subject:"Re\:"


Okay, that proves I was wrong on the client side bottleneck.

On 24/04/2024 17:55, Nathan Marz wrote:
I tried running two client processes in parallel and the numbers were 
unchanged. The max throughput is still a single client doing 10 
in-flight BatchStatement containing 100 inserts.


On Tue, Apr 23, 2024 at 10:24 PM Bowen Song via user 
 wrote:


You might have run into the bottleneck of the driver's IO thread.
Try increase the driver's connections-per-server limit to 2 or 3
if you've only got 1 server in the cluster. Or alternatively, run
two client processes in parallel.


On 24/04/2024 07:19, Nathan Marz wrote:

Tried it again with one more client thread, and that had no
effect on performance. This is unsurprising as there's only 2 CPU
on this node and they were already at 100%. These were good
ideas, but I'm still unable to even match the performance of
batch commit mode with group commit mode.

On Tue, Apr 23, 2024 at 12:46 PM Bowen Song via user
 wrote:

To achieve 10k loop iterations per second, each iteration
must take 0.1 milliseconds or less. Considering that each
iteration needs to lock and unlock the semaphore (two
syscalls) and make network requests (more syscalls), that's a
lots of context switches. It may a bit too much to ask for a
single thread. I would suggest try multi-threading or
multi-processing, and see if the combined insert rate is higher.

I should also note that executeAsync() also has implicit
limits on the number of in-flight requests, which default to
1024 requests per connection and 1 connection per server. See

https://docs.datastax.com/en/developer/java-driver/4.17/manual/core/pooling/


On 23/04/2024 23:18, Nathan Marz wrote:

It's using the async API, so why would it need multiple
threads? Using the exact same approach I'm able to get 38k /
second with periodic commitlog_sync. For what it's worth, I
do see 100% CPU utilization in every single one of these tests.

On Tue, Apr 23, 2024 at 11:01 AM Bowen Song via user
 wrote:

Have you checked the thread CPU utilisation of the
client side? You likely will need more than one thread
to do insertion in a loop to achieve tens of thousands
of inserts per second.


On 23/04/2024 21:55, Nathan Marz wrote:

Thanks for the explanation.

I tried again with commitlog_sync_group_window at 2ms,
concurrent_writes at 512, and doing 1000 individual
inserts at a time with the same loop + semaphore
approach. This only nets 9k / second.

I got much higher throughput for the other modes with
BatchStatement of 100 inserts rather than 100x more
individual inserts.

On Tue, Apr 23, 2024 at 10:45 AM Bowen Song via user
 wrote:

I suspect you are abusing batch statements. Batch
statements should only be used where atomicity or
isolation is needed. Using batch statements won't
make inserting multiple partitions faster. In fact,
it often will make that slower.

Also, the liner relationship between
commitlog_sync_group_window and write throughput is
expected. That's because the max number of
uncompleted writes is limited by the write
concurrency, and a write is not considered
"complete" before it is synced to disk when
commitlog sync is in group or batch mode. That
means within each interval, only limited number of
writes can be done. The ways to increase that
including: add more nodes, sync commitlog at
shorter intervals and allow more concurrent writes.


On 23/04/2024 20:43, Nathan Marz wrote:

Thanks. I raised concurrent_writes to 128 and
set commitlog_sync_group_window to 20ms. This
causes a single execute of a BatchStatement
containing 100 inserts to succeed. However, the
throughput I'm seeing is atrocious.

With these settings, I'm executing 10
BatchStatement concurrently at a time using the
semaphore + loop approach I showed in my first
message. So as requests complete, more are sent
out such that there are 10 in-flight at a time.
Each BatchStatement has 100 individual inserts.
I'm seeing only 730 inserts / second. Again, with
periodic mode I see 38k / second and with batch I
see 14k / second. My expectation was that group
commit mode throughput would be somewhere between

Re: Trouble with using group commitlog_sync

2024-04-24 Thread Nathan Marz

I tried running two client processes in parallel and the numbers were
unchanged. The max throughput is still a single client doing 10 in-flight
BatchStatement containing 100 inserts.

On Tue, Apr 23, 2024 at 10:24 PM Bowen Song via user <
user@cassandra.apache.org> wrote:

> You might have run into the bottleneck of the driver's IO thread. Try
> increase the driver's connections-per-server limit to 2 or 3 if you've only
> got 1 server in the cluster. Or alternatively, run two client processes in
> parallel.
>
>
> On 24/04/2024 07:19, Nathan Marz wrote:
>
> Tried it again with one more client thread, and that had no effect on
> performance. This is unsurprising as there's only 2 CPU on this node and
> they were already at 100%. These were good ideas, but I'm still unable to
> even match the performance of batch commit mode with group commit mode.
>
> On Tue, Apr 23, 2024 at 12:46 PM Bowen Song via user <
> user@cassandra.apache.org> wrote:
>
>> To achieve 10k loop iterations per second, each iteration must take 0.1
>> milliseconds or less. Considering that each iteration needs to lock and
>> unlock the semaphore (two syscalls) and make network requests (more
>> syscalls), that's a lots of context switches. It may a bit too much to ask
>> for a single thread. I would suggest try multi-threading or
>> multi-processing, and see if the combined insert rate is higher.
>>
>> I should also note that executeAsync() also has implicit limits on the
>> number of in-flight requests, which default to 1024 requests per connection
>> and 1 connection per server. See
>> https://docs.datastax.com/en/developer/java-driver/4.17/manual/core/pooling/
>>
>>
>> On 23/04/2024 23:18, Nathan Marz wrote:
>>
>> It's using the async API, so why would it need multiple threads? Using
>> the exact same approach I'm able to get 38k / second with periodic
>> commitlog_sync. For what it's worth, I do see 100% CPU utilization in every
>> single one of these tests.
>>
>> On Tue, Apr 23, 2024 at 11:01 AM Bowen Song via user <
>> user@cassandra.apache.org> wrote:
>>
>>> Have you checked the thread CPU utilisation of the client side? You
>>> likely will need more than one thread to do insertion in a loop to achieve
>>> tens of thousands of inserts per second.
>>>
>>>
>>> On 23/04/2024 21:55, Nathan Marz wrote:
>>>
>>> Thanks for the explanation.
>>>
>>> I tried again with commitlog_sync_group_window at 2ms, concurrent_writes
>>> at 512, and doing 1000 individual inserts at a time with the same loop +
>>> semaphore approach. This only nets 9k / second.
>>>
>>> I got much higher throughput for the other modes with BatchStatement of
>>> 100 inserts rather than 100x more individual inserts.
>>>
>>> On Tue, Apr 23, 2024 at 10:45 AM Bowen Song via user <
>>> user@cassandra.apache.org> wrote:
>>>
 I suspect you are abusing batch statements. Batch statements should
 only be used where atomicity or isolation is needed. Using batch statements
 won't make inserting multiple partitions faster. In fact, it often will
 make that slower.

 Also, the liner relationship between commitlog_sync_group_window and
 write throughput is expected. That's because the max number of uncompleted
 writes is limited by the write concurrency, and a write is not considered
 "complete" before it is synced to disk when commitlog sync is in group or
 batch mode. That means within each interval, only limited number of writes
 can be done. The ways to increase that including: add more nodes, sync
 commitlog at shorter intervals and allow more concurrent writes.

 On 23/04/2024 20:43, Nathan Marz wrote:

 Thanks. I raised concurrent_writes to 128 and
 set commitlog_sync_group_window to 20ms. This causes a single execute of a
 BatchStatement containing 100 inserts to succeed. However, the throughput
 I'm seeing is atrocious.

 With these settings, I'm executing 10 BatchStatement concurrently at a
 time using the semaphore + loop approach I showed in my first message. So
 as requests complete, more are sent out such that there are 10 in-flight at
 a time. Each BatchStatement has 100 individual inserts. I'm seeing only 730
 inserts / second. Again, with periodic mode I see 38k / second and with
 batch I see 14k / second. My expectation was that group commit mode
 throughput would be somewhere between those two.

 If I set commitlog_sync_group_window to 100ms, the throughput drops to
 14 / second.

 If I set commitlog_sync_group_window to 10ms, the throughput increases
 to 1587 / second.

 If I set commitlog_sync_group_window to 5ms, the throughput increases
 to 3200 / second.

 If I set commitlog_sync_group_window to 1ms, the throughput increases
 to 13k / second, which is slightly less than batch commit mode.

 Is group commit mode supposed to have better performance than batch
 mode?

 On Tue, Apr 23,

Re: Mixed Cluster 4.0 and 4.1


Hi Paul,

IMO, if they are truly risk-adverse, they should follow the tested and 
proven best practices, instead of doing things in a less tested way 
which is also know to pose a danger to the data correctness.


If they must do this over a long period of time, then they may need to 
temporarily increase the gc_grace_seconds on all tables, and ensure that 
no DDL or repair is run before the upgrade completes. It is unknown 
whether this route is safe, because it's a less tested route to upgrade 
a cluster.


Please be aware that if they do deletes frequently, increasing the 
gc_grace_seconds may cause some reads to fail due to the elevated number 
of tombstones.


Cheers,
Bowen

On 24/04/2024 17:25, Paul Chandler wrote:

Hi Bowen,

Thanks for your quick reply.

Sorry I used the wrong term there, there it is a maintenance window rather than 
an outage. This is a key system and the vital nature of it means that the 
customer is rightly very risk adverse, so we will only even get permission to 
upgrade one DC per night via a rolling upgrade, meaning this will always be 
over more than a week.

So we can’t shorten the time the cluster is in mixed mode, but I am concerned 
about having a schema mismatch for this long time. Should I be concerned, or 
have others upgraded in a similar way?

Thanks

Paul


On 24 Apr 2024, at 17:02, Bowen Song via user  wrote:

Hi Paul,

You don't need to plan for or introduce an outage for a rolling upgrade, which 
is the preferred route. It isn't advisable to take down an entire DC to do 
upgrade.

You should aim to complete upgrading the entire cluster and finish a full 
repair within the shortest gc_grace_seconds (default to 10 days) of all tables. 
Failing to do that may cause data resurrections.

During the rolling upgrade, you should not run repair or any DDL query (such as 
ALTER TABLE, TRUNCATE, etc.).

You don't need to do the rolling upgrade node by node. You can do it rack by 
rack. Stopping all nodes in a single rack and upgrade them concurrently is much 
faster. The number of nodes doesn't matter that much to the time required to 
complete a rolling upgrade, it's the number of DCs and racks matter.

Cheers,
Bowen

On 24/04/2024 16:16, Paul Chandler wrote:

Hi all,

We have some large clusters ( 1000+  nodes ), these are across multiple 
datacenters.

When we perform upgrades we would normally upgrade a DC at a time during a 
planned outage for one DC. This means that a cluster might be in a mixed mode 
with multiple versions for a week or 2.

We have noticed that during our testing that upgrading to 4.1 causes a schema 
mismatch due to the new tables added into the system keyspace.

Is this going to be an issue if this schema mismatch lasts for maybe several 
weeks? I assume that running any DDL during that time would be a bad idea, is 
there any other issues to look out for?

Thanks

Paul Chandler

Re: Mixed Cluster 4.0 and 4.1

2024-04-24 Thread Paul Chandler

Hi Bowen,

Thanks for your quick reply. 

Sorry I used the wrong term there, there it is a maintenance window rather than 
an outage. This is a key system and the vital nature of it means that the 
customer is rightly very risk adverse, so we will only even get permission to 
upgrade one DC per night via a rolling upgrade, meaning this will always be 
over more than a week. 

So we can’t shorten the time the cluster is in mixed mode, but I am concerned 
about having a schema mismatch for this long time. Should I be concerned, or 
have others upgraded in a similar way?

Thanks

Paul

> On 24 Apr 2024, at 17:02, Bowen Song via user  
> wrote:
> 
> Hi Paul,
> 
> You don't need to plan for or introduce an outage for a rolling upgrade, 
> which is the preferred route. It isn't advisable to take down an entire DC to 
> do upgrade.
> 
> You should aim to complete upgrading the entire cluster and finish a full 
> repair within the shortest gc_grace_seconds (default to 10 days) of all 
> tables. Failing to do that may cause data resurrections.
> 
> During the rolling upgrade, you should not run repair or any DDL query (such 
> as ALTER TABLE, TRUNCATE, etc.).
> 
> You don't need to do the rolling upgrade node by node. You can do it rack by 
> rack. Stopping all nodes in a single rack and upgrade them concurrently is 
> much faster. The number of nodes doesn't matter that much to the time 
> required to complete a rolling upgrade, it's the number of DCs and racks 
> matter.
> 
> Cheers,
> Bowen
> 
> On 24/04/2024 16:16, Paul Chandler wrote:
>> Hi all,
>> 
>> We have some large clusters ( 1000+  nodes ), these are across multiple 
>> datacenters.
>> 
>> When we perform upgrades we would normally upgrade a DC at a time during a 
>> planned outage for one DC. This means that a cluster might be in a mixed 
>> mode with multiple versions for a week or 2.
>> 
>> We have noticed that during our testing that upgrading to 4.1 causes a 
>> schema mismatch due to the new tables added into the system keyspace.
>> 
>> Is this going to be an issue if this schema mismatch lasts for maybe several 
>> weeks? I assume that running any DDL during that time would be a bad idea, 
>> is there any other issues to look out for?
>> 
>> Thanks
>> 
>> Paul Chandler

Re: Mixed Cluster 4.0 and 4.1


Hi Paul,

You don't need to plan for or introduce an outage for a rolling upgrade, 
which is the preferred route. It isn't advisable to take down an entire 
DC to do upgrade.


You should aim to complete upgrading the entire cluster and finish a 
full repair within the shortest gc_grace_seconds (default to 10 days) of 
all tables. Failing to do that may cause data resurrections.


During the rolling upgrade, you should not run repair or any DDL query 
(such as ALTER TABLE, TRUNCATE, etc.).


You don't need to do the rolling upgrade node by node. You can do it 
rack by rack. Stopping all nodes in a single rack and upgrade them 
concurrently is much faster. The number of nodes doesn't matter that 
much to the time required to complete a rolling upgrade, it's the number 
of DCs and racks matter.


Cheers,
Bowen

On 24/04/2024 16:16, Paul Chandler wrote:

Hi all,

We have some large clusters ( 1000+  nodes ), these are across multiple 
datacenters.

When we perform upgrades we would normally upgrade a DC at a time during a 
planned outage for one DC. This means that a cluster might be in a mixed mode 
with multiple versions for a week or 2.

We have noticed that during our testing that upgrading to 4.1 causes a schema 
mismatch due to the new tables added into the system keyspace.

Is this going to be an issue if this schema mismatch lasts for maybe several 
weeks? I assume that running any DDL during that time would be a bad idea, is 
there any other issues to look out for?

Thanks

Paul Chandler

Re: Trouble with using group commitlog_sync

You might have run into the bottleneck of the driver's IO thread. Try 
increase the driver's connections-per-server limit to 2 or 3 if you've 
only got 1 server in the cluster. Or alternatively, run two client 
processes in parallel.



On 24/04/2024 07:19, Nathan Marz wrote:
Tried it again with one more client thread, and that had no effect on 
performance. This is unsurprising as there's only 2 CPU on this node 
and they were already at 100%. These were good ideas, but I'm still 
unable to even match the performance of batch commit mode with group 
commit mode.


On Tue, Apr 23, 2024 at 12:46 PM Bowen Song via user 
 wrote:


To achieve 10k loop iterations per second, each iteration must
take 0.1 milliseconds or less. Considering that each iteration
needs to lock and unlock the semaphore (two syscalls) and make
network requests (more syscalls), that's a lots of context
switches. It may a bit too much to ask for a single thread. I
would suggest try multi-threading or multi-processing, and see if
the combined insert rate is higher.

I should also note that executeAsync() also has implicit limits on
the number of in-flight requests, which default to 1024 requests
per connection and 1 connection per server. See
https://docs.datastax.com/en/developer/java-driver/4.17/manual/core/pooling/


On 23/04/2024 23:18, Nathan Marz wrote:

It's using the async API, so why would it need multiple threads?
Using the exact same approach I'm able to get 38k / second with
periodic commitlog_sync. For what it's worth, I do see 100% CPU
utilization in every single one of these tests.

On Tue, Apr 23, 2024 at 11:01 AM Bowen Song via user
 wrote:

Have you checked the thread CPU utilisation of the client
side? You likely will need more than one thread to do
insertion in a loop to achieve tens of thousands of inserts
per second.


On 23/04/2024 21:55, Nathan Marz wrote:

Thanks for the explanation.

I tried again with commitlog_sync_group_window at 2ms,
concurrent_writes at 512, and doing 1000 individual inserts
at a time with the same loop + semaphore approach. This only
nets 9k / second.

I got much higher throughput for the other modes with
BatchStatement of 100 inserts rather than 100x more
individual inserts.

On Tue, Apr 23, 2024 at 10:45 AM Bowen Song via user
 wrote:

I suspect you are abusing batch statements. Batch
statements should only be used where atomicity or
isolation is needed. Using batch statements won't make
inserting multiple partitions faster. In fact, it often
will make that slower.

Also, the liner relationship between
commitlog_sync_group_window and write throughput is
expected. That's because the max number of uncompleted
writes is limited by the write concurrency, and a write
is not considered "complete" before it is synced to disk
when commitlog sync is in group or batch mode. That
means within each interval, only limited number of
writes can be done. The ways to increase that including:
add more nodes, sync commitlog at shorter intervals and
allow more concurrent writes.


On 23/04/2024 20:43, Nathan Marz wrote:

Thanks. I raised concurrent_writes to 128 and
set commitlog_sync_group_window to 20ms. This causes a
single execute of a BatchStatement containing 100
inserts to succeed. However, the throughput I'm seeing
is atrocious.

With these settings, I'm executing 10 BatchStatement
concurrently at a time using the semaphore + loop
approach I showed in my first message. So as requests
complete, more are sent out such that there are 10
in-flight at a time. Each BatchStatement has 100
individual inserts. I'm seeing only 730 inserts /
second. Again, with periodic mode I see 38k / second
and with batch I see 14k / second. My expectation was
that group commit mode throughput would be somewhere
between those two.

If I set commitlog_sync_group_window to 100ms, the
throughput drops to 14 / second.

If I set commitlog_sync_group_window to 10ms, the
throughput increases to 1587 / second.

If I set commitlog_sync_group_window to 5ms, the
throughput increases to 3200 / second.

If I set commitlog_sync_group_window to 1ms, the
throughput increases to 13k / second, which is slightly
less than batch commit mode.

Is group commit mode supposed to have better
performance than batch mode?


On Tue, Apr 23, 2024 at 8:46 AM

Re: Trouble with using group commitlog_sync

2024-04-24 Thread Nathan Marz

Tried it again with one more client thread, and that had no effect on
performance. This is unsurprising as there's only 2 CPU on this node and
they were already at 100%. These were good ideas, but I'm still unable to
even match the performance of batch commit mode with group commit mode.

On Tue, Apr 23, 2024 at 12:46 PM Bowen Song via user <
user@cassandra.apache.org> wrote:

> To achieve 10k loop iterations per second, each iteration must take 0.1
> milliseconds or less. Considering that each iteration needs to lock and
> unlock the semaphore (two syscalls) and make network requests (more
> syscalls), that's a lots of context switches. It may a bit too much to ask
> for a single thread. I would suggest try multi-threading or
> multi-processing, and see if the combined insert rate is higher.
>
> I should also note that executeAsync() also has implicit limits on the
> number of in-flight requests, which default to 1024 requests per connection
> and 1 connection per server. See
> https://docs.datastax.com/en/developer/java-driver/4.17/manual/core/pooling/
>
>
> On 23/04/2024 23:18, Nathan Marz wrote:
>
> It's using the async API, so why would it need multiple threads? Using the
> exact same approach I'm able to get 38k / second with periodic
> commitlog_sync. For what it's worth, I do see 100% CPU utilization in every
> single one of these tests.
>
> On Tue, Apr 23, 2024 at 11:01 AM Bowen Song via user <
> user@cassandra.apache.org> wrote:
>
>> Have you checked the thread CPU utilisation of the client side? You
>> likely will need more than one thread to do insertion in a loop to achieve
>> tens of thousands of inserts per second.
>>
>>
>> On 23/04/2024 21:55, Nathan Marz wrote:
>>
>> Thanks for the explanation.
>>
>> I tried again with commitlog_sync_group_window at 2ms, concurrent_writes
>> at 512, and doing 1000 individual inserts at a time with the same loop +
>> semaphore approach. This only nets 9k / second.
>>
>> I got much higher throughput for the other modes with BatchStatement of
>> 100 inserts rather than 100x more individual inserts.
>>
>> On Tue, Apr 23, 2024 at 10:45 AM Bowen Song via user <
>> user@cassandra.apache.org> wrote:
>>
>>> I suspect you are abusing batch statements. Batch statements should only
>>> be used where atomicity or isolation is needed. Using batch statements
>>> won't make inserting multiple partitions faster. In fact, it often will
>>> make that slower.
>>>
>>> Also, the liner relationship between commitlog_sync_group_window and
>>> write throughput is expected. That's because the max number of uncompleted
>>> writes is limited by the write concurrency, and a write is not considered
>>> "complete" before it is synced to disk when commitlog sync is in group or
>>> batch mode. That means within each interval, only limited number of writes
>>> can be done. The ways to increase that including: add more nodes, sync
>>> commitlog at shorter intervals and allow more concurrent writes.
>>>
>>>
>>> On 23/04/2024 20:43, Nathan Marz wrote:
>>>
>>> Thanks. I raised concurrent_writes to 128 and
>>> set commitlog_sync_group_window to 20ms. This causes a single execute of a
>>> BatchStatement containing 100 inserts to succeed. However, the throughput
>>> I'm seeing is atrocious.
>>>
>>> With these settings, I'm executing 10 BatchStatement concurrently at a
>>> time using the semaphore + loop approach I showed in my first message. So
>>> as requests complete, more are sent out such that there are 10 in-flight at
>>> a time. Each BatchStatement has 100 individual inserts. I'm seeing only 730
>>> inserts / second. Again, with periodic mode I see 38k / second and with
>>> batch I see 14k / second. My expectation was that group commit mode
>>> throughput would be somewhere between those two.
>>>
>>> If I set commitlog_sync_group_window to 100ms, the throughput drops to
>>> 14 / second.
>>>
>>> If I set commitlog_sync_group_window to 10ms, the throughput increases
>>> to 1587 / second.
>>>
>>> If I set commitlog_sync_group_window to 5ms, the throughput increases to
>>> 3200 / second.
>>>
>>> If I set commitlog_sync_group_window to 1ms, the throughput increases to
>>> 13k / second, which is slightly less than batch commit mode.
>>>
>>> Is group commit mode supposed to have better performance than batch mode?
>>>
>>>
>>> On Tue, Apr 23, 2024 at 8:46 AM Bowen Song via user <
>>> user@cassandra.apache.org> wrote:
>>>
 The default commitlog_sync_group_window is very long for SSDs. Try
 reduce it if you are using SSD-backed storage for the commit log. 10-15 ms
 is a good starting point. You may also want to increase the value of
 concurrent_writes, consider at least double or quadruple it from the
 default. You'll need even higher write concurrency for longer
 commitlog_sync_group_window.

 On 23/04/2024 19:26, Nathan Marz wrote:

 "batch" mode works fine. I'm having trouble with "group" mode. The only
 config for that is "commitlog_sync_group_window",

Re: Trouble with using group commitlog_sync

To achieve 10k loop iterations per second, each iteration must take 0.1 
milliseconds or less. Considering that each iteration needs to lock and 
unlock the semaphore (two syscalls) and make network requests (more 
syscalls), that's a lots of context switches. It may a bit too much to 
ask for a single thread. I would suggest try multi-threading or 
multi-processing, and see if the combined insert rate is higher.


I should also note that executeAsync() also has implicit limits on the 
number of in-flight requests, which default to 1024 requests per 
connection and 1 connection per server. See 
https://docs.datastax.com/en/developer/java-driver/4.17/manual/core/pooling/



On 23/04/2024 23:18, Nathan Marz wrote:
It's using the async API, so why would it need multiple threads? Using 
the exact same approach I'm able to get 38k / second with periodic 
commitlog_sync. For what it's worth, I do see 100% CPU utilization in 
every single one of these tests.


On Tue, Apr 23, 2024 at 11:01 AM Bowen Song via user 
 wrote:


Have you checked the thread CPU utilisation of the client side?
You likely will need more than one thread to do insertion in a
loop to achieve tens of thousands of inserts per second.


On 23/04/2024 21:55, Nathan Marz wrote:

Thanks for the explanation.

I tried again with commitlog_sync_group_window at 2ms,
concurrent_writes at 512, and doing 1000 individual inserts at a
time with the same loop + semaphore approach. This only nets 9k /
second.

I got much higher throughput for the other modes with
BatchStatement of 100 inserts rather than 100x more individual
inserts.

On Tue, Apr 23, 2024 at 10:45 AM Bowen Song via user
 wrote:

I suspect you are abusing batch statements. Batch statements
should only be used where atomicity or isolation is needed.
Using batch statements won't make inserting multiple
partitions faster. In fact, it often will make that slower.

Also, the liner relationship between
commitlog_sync_group_window and write throughput is expected.
That's because the max number of uncompleted writes is
limited by the write concurrency, and a write is not
considered "complete" before it is synced to disk when
commitlog sync is in group or batch mode. That means within
each interval, only limited number of writes can be done. The
ways to increase that including: add more nodes, sync
commitlog at shorter intervals and allow more concurrent writes.


On 23/04/2024 20:43, Nathan Marz wrote:

Thanks. I raised concurrent_writes to 128 and
set commitlog_sync_group_window to 20ms. This causes a
single execute of a BatchStatement containing 100 inserts to
succeed. However, the throughput I'm seeing is atrocious.

With these settings, I'm executing 10 BatchStatement
concurrently at a time using the semaphore + loop approach I
showed in my first message. So as requests complete, more
are sent out such that there are 10 in-flight at a time.
Each BatchStatement has 100 individual inserts. I'm seeing
only 730 inserts / second. Again, with periodic mode I see
38k / second and with batch I see 14k / second. My
expectation was that group commit mode throughput would be
somewhere between those two.

If I set commitlog_sync_group_window to 100ms, the
throughput drops to 14 / second.

If I set commitlog_sync_group_window to 10ms, the throughput
increases to 1587 / second.

If I set commitlog_sync_group_window to 5ms, the throughput
increases to 3200 / second.

If I set commitlog_sync_group_window to 1ms, the throughput
increases to 13k / second, which is slightly less than batch
commit mode.

Is group commit mode supposed to have better performance
than batch mode?


On Tue, Apr 23, 2024 at 8:46 AM Bowen Song via user
 wrote:

The default commitlog_sync_group_window is very long for
SSDs. Try reduce it if you are using SSD-backed storage
for the commit log. 10-15 ms is a good starting point.
You may also want to increase the value of
concurrent_writes, consider at least double or quadruple
it from the default. You'll need even higher write
concurrency for longer commitlog_sync_group_window.


On 23/04/2024 19:26, Nathan Marz wrote:

"batch" mode works fine. I'm having trouble with
"group" mode. The only config for that is
"commitlog_sync_group_window", and I have that set to
the default 1000ms.

On Tue, Apr 23, 2024 at 8:15 AM Bowen Song via user
 wrote:

Why would you want to set
commitlog_sync_batch_window to 1 second long when

Re: Trouble with using group commitlog_sync

It's using the async API, so why would it need multiple threads? Using the
exact same approach I'm able to get 38k / second with periodic
commitlog_sync. For what it's worth, I do see 100% CPU utilization in every
single one of these tests.

On Tue, Apr 23, 2024 at 11:01 AM Bowen Song via user <
user@cassandra.apache.org> wrote:

> Have you checked the thread CPU utilisation of the client side? You likely
> will need more than one thread to do insertion in a loop to achieve tens of
> thousands of inserts per second.
>
>
> On 23/04/2024 21:55, Nathan Marz wrote:
>
> Thanks for the explanation.
>
> I tried again with commitlog_sync_group_window at 2ms, concurrent_writes
> at 512, and doing 1000 individual inserts at a time with the same loop +
> semaphore approach. This only nets 9k / second.
>
> I got much higher throughput for the other modes with BatchStatement of
> 100 inserts rather than 100x more individual inserts.
>
> On Tue, Apr 23, 2024 at 10:45 AM Bowen Song via user <
> user@cassandra.apache.org> wrote:
>
>> I suspect you are abusing batch statements. Batch statements should only
>> be used where atomicity or isolation is needed. Using batch statements
>> won't make inserting multiple partitions faster. In fact, it often will
>> make that slower.
>>
>> Also, the liner relationship between commitlog_sync_group_window and
>> write throughput is expected. That's because the max number of uncompleted
>> writes is limited by the write concurrency, and a write is not considered
>> "complete" before it is synced to disk when commitlog sync is in group or
>> batch mode. That means within each interval, only limited number of writes
>> can be done. The ways to increase that including: add more nodes, sync
>> commitlog at shorter intervals and allow more concurrent writes.
>>
>>
>> On 23/04/2024 20:43, Nathan Marz wrote:
>>
>> Thanks. I raised concurrent_writes to 128 and
>> set commitlog_sync_group_window to 20ms. This causes a single execute of a
>> BatchStatement containing 100 inserts to succeed. However, the throughput
>> I'm seeing is atrocious.
>>
>> With these settings, I'm executing 10 BatchStatement concurrently at a
>> time using the semaphore + loop approach I showed in my first message. So
>> as requests complete, more are sent out such that there are 10 in-flight at
>> a time. Each BatchStatement has 100 individual inserts. I'm seeing only 730
>> inserts / second. Again, with periodic mode I see 38k / second and with
>> batch I see 14k / second. My expectation was that group commit mode
>> throughput would be somewhere between those two.
>>
>> If I set commitlog_sync_group_window to 100ms, the throughput drops to 14
>> / second.
>>
>> If I set commitlog_sync_group_window to 10ms, the throughput increases to
>> 1587 / second.
>>
>> If I set commitlog_sync_group_window to 5ms, the throughput increases to
>> 3200 / second.
>>
>> If I set commitlog_sync_group_window to 1ms, the throughput increases to
>> 13k / second, which is slightly less than batch commit mode.
>>
>> Is group commit mode supposed to have better performance than batch mode?
>>
>>
>> On Tue, Apr 23, 2024 at 8:46 AM Bowen Song via user <
>> user@cassandra.apache.org> wrote:
>>
>>> The default commitlog_sync_group_window is very long for SSDs. Try
>>> reduce it if you are using SSD-backed storage for the commit log. 10-15 ms
>>> is a good starting point. You may also want to increase the value of
>>> concurrent_writes, consider at least double or quadruple it from the
>>> default. You'll need even higher write concurrency for longer
>>> commitlog_sync_group_window.
>>>
>>> On 23/04/2024 19:26, Nathan Marz wrote:
>>>
>>> "batch" mode works fine. I'm having trouble with "group" mode. The only
>>> config for that is "commitlog_sync_group_window", and I have that set to
>>> the default 1000ms.
>>>
>>> On Tue, Apr 23, 2024 at 8:15 AM Bowen Song via user <
>>> user@cassandra.apache.org> wrote:
>>>
 Why would you want to set commitlog_sync_batch_window to 1 second long
 when commitlog_sync is set to batch mode? The documentation
 
 on this says:

 *This window should be kept short because the writer threads will be
 unable to do extra work while waiting. You may need to increase
 concurrent_writes for the same reason*

 If you want to use batch mode, at least ensure
 commitlog_sync_batch_window is reasonably short. The default is 2
 millisecond.


 On 23/04/2024 18:32, Nathan Marz wrote:

 I'm doing some benchmarking of Cassandra on a single m6gd.large
 instance. It works fine with periodic or batch commitlog_sync options, but
 I'm having tons of issues when I change it to "group". I have
 "commitlog_sync_group_window" set to 1000ms.

 My client is doing writes like this (pseudocode):

 Semaphore sem = new Semaphore(numTickets);
 while(true) {

Re: Trouble with using group commitlog_sync

Have you checked the thread CPU utilisation of the client side? You 
likely will need more than one thread to do insertion in a loop to 
achieve tens of thousands of inserts per second.



On 23/04/2024 21:55, Nathan Marz wrote:

Thanks for the explanation.

I tried again with commitlog_sync_group_window at 2ms, 
concurrent_writes at 512, and doing 1000 individual inserts at a time 
with the same loop + semaphore approach. This only nets 9k / second.


I got much higher throughput for the other modes with BatchStatement 
of 100 inserts rather than 100x more individual inserts.


On Tue, Apr 23, 2024 at 10:45 AM Bowen Song via user 
 wrote:


I suspect you are abusing batch statements. Batch statements
should only be used where atomicity or isolation is needed. Using
batch statements won't make inserting multiple partitions faster.
In fact, it often will make that slower.

Also, the liner relationship between commitlog_sync_group_window
and write throughput is expected. That's because the max number of
uncompleted writes is limited by the write concurrency, and a
write is not considered "complete" before it is synced to disk
when commitlog sync is in group or batch mode. That means within
each interval, only limited number of writes can be done. The ways
to increase that including: add more nodes, sync commitlog at
shorter intervals and allow more concurrent writes.


On 23/04/2024 20:43, Nathan Marz wrote:

Thanks. I raised concurrent_writes to 128 and
set commitlog_sync_group_window to 20ms. This causes a single
execute of a BatchStatement containing 100 inserts to succeed.
However, the throughput I'm seeing is atrocious.

With these settings, I'm executing 10 BatchStatement concurrently
at a time using the semaphore + loop approach I showed in my
first message. So as requests complete, more are sent out such
that there are 10 in-flight at a time. Each BatchStatement has
100 individual inserts. I'm seeing only 730 inserts / second.
Again, with periodic mode I see 38k / second and with batch I see
14k / second. My expectation was that group commit mode
throughput would be somewhere between those two.

If I set commitlog_sync_group_window to 100ms, the throughput
drops to 14 / second.

If I set commitlog_sync_group_window to 10ms, the throughput
increases to 1587 / second.

If I set commitlog_sync_group_window to 5ms, the throughput
increases to 3200 / second.

If I set commitlog_sync_group_window to 1ms, the throughput
increases to 13k / second, which is slightly less than batch
commit mode.

Is group commit mode supposed to have better performance than
batch mode?


On Tue, Apr 23, 2024 at 8:46 AM Bowen Song via user
 wrote:

The default commitlog_sync_group_window is very long for
SSDs. Try reduce it if you are using SSD-backed storage for
the commit log. 10-15 ms is a good starting point. You may
also want to increase the value of concurrent_writes,
consider at least double or quadruple it from the default.
You'll need even higher write concurrency for longer
commitlog_sync_group_window.


On 23/04/2024 19:26, Nathan Marz wrote:

"batch" mode works fine. I'm having trouble with "group"
mode. The only config for that is
"commitlog_sync_group_window", and I have that set to the
default 1000ms.

On Tue, Apr 23, 2024 at 8:15 AM Bowen Song via user
 wrote:

Why would you want to set commitlog_sync_batch_window to
1 second long when commitlog_sync is set to batch mode?
The documentation


on this says:

/This window should be kept short because the writer
threads will be unable to do extra work while
waiting. You may need to increase concurrent_writes
for the same reason/

If you want to use batch mode, at least ensure
commitlog_sync_batch_window is reasonably short. The
default is 2 millisecond.


On 23/04/2024 18:32, Nathan Marz wrote:

I'm doing some benchmarking of Cassandra on a single
m6gd.large instance. It works fine with periodic or
batch commitlog_sync options, but I'm having tons of
issues when I change it to "group". I have
"commitlog_sync_group_window" set to 1000ms.

My client is doing writes like this (pseudocode):

Semaphore sem = new Semaphore(numTickets);
while(true) {

sem.acquire();
session.executeAsync(insert.bind(genUUIDStr(),
genUUIDStr(), genUUIDStr())
            .whenComplete((t, u) -> sem.release())

}

Re: Trouble with using group commitlog_sync

Thanks for the explanation.

I tried again with commitlog_sync_group_window at 2ms, concurrent_writes at
512, and doing 1000 individual inserts at a time with the same loop +
semaphore approach. This only nets 9k / second.

I got much higher throughput for the other modes with BatchStatement of 100
inserts rather than 100x more individual inserts.

On Tue, Apr 23, 2024 at 10:45 AM Bowen Song via user <
user@cassandra.apache.org> wrote:

> I suspect you are abusing batch statements. Batch statements should only
> be used where atomicity or isolation is needed. Using batch statements
> won't make inserting multiple partitions faster. In fact, it often will
> make that slower.
>
> Also, the liner relationship between commitlog_sync_group_window and
> write throughput is expected. That's because the max number of uncompleted
> writes is limited by the write concurrency, and a write is not considered
> "complete" before it is synced to disk when commitlog sync is in group or
> batch mode. That means within each interval, only limited number of writes
> can be done. The ways to increase that including: add more nodes, sync
> commitlog at shorter intervals and allow more concurrent writes.
>
>
> On 23/04/2024 20:43, Nathan Marz wrote:
>
> Thanks. I raised concurrent_writes to 128 and
> set commitlog_sync_group_window to 20ms. This causes a single execute of a
> BatchStatement containing 100 inserts to succeed. However, the throughput
> I'm seeing is atrocious.
>
> With these settings, I'm executing 10 BatchStatement concurrently at a
> time using the semaphore + loop approach I showed in my first message. So
> as requests complete, more are sent out such that there are 10 in-flight at
> a time. Each BatchStatement has 100 individual inserts. I'm seeing only 730
> inserts / second. Again, with periodic mode I see 38k / second and with
> batch I see 14k / second. My expectation was that group commit mode
> throughput would be somewhere between those two.
>
> If I set commitlog_sync_group_window to 100ms, the throughput drops to 14
> / second.
>
> If I set commitlog_sync_group_window to 10ms, the throughput increases to
> 1587 / second.
>
> If I set commitlog_sync_group_window to 5ms, the throughput increases to
> 3200 / second.
>
> If I set commitlog_sync_group_window to 1ms, the throughput increases to
> 13k / second, which is slightly less than batch commit mode.
>
> Is group commit mode supposed to have better performance than batch mode?
>
>
> On Tue, Apr 23, 2024 at 8:46 AM Bowen Song via user <
> user@cassandra.apache.org> wrote:
>
>> The default commitlog_sync_group_window is very long for SSDs. Try
>> reduce it if you are using SSD-backed storage for the commit log. 10-15 ms
>> is a good starting point. You may also want to increase the value of
>> concurrent_writes, consider at least double or quadruple it from the
>> default. You'll need even higher write concurrency for longer
>> commitlog_sync_group_window.
>>
>> On 23/04/2024 19:26, Nathan Marz wrote:
>>
>> "batch" mode works fine. I'm having trouble with "group" mode. The only
>> config for that is "commitlog_sync_group_window", and I have that set to
>> the default 1000ms.
>>
>> On Tue, Apr 23, 2024 at 8:15 AM Bowen Song via user <
>> user@cassandra.apache.org> wrote:
>>
>>> Why would you want to set commitlog_sync_batch_window to 1 second long
>>> when commitlog_sync is set to batch mode? The documentation
>>> 
>>> on this says:
>>>
>>> *This window should be kept short because the writer threads will be
>>> unable to do extra work while waiting. You may need to increase
>>> concurrent_writes for the same reason*
>>>
>>> If you want to use batch mode, at least ensure
>>> commitlog_sync_batch_window is reasonably short. The default is 2
>>> millisecond.
>>>
>>>
>>> On 23/04/2024 18:32, Nathan Marz wrote:
>>>
>>> I'm doing some benchmarking of Cassandra on a single m6gd.large
>>> instance. It works fine with periodic or batch commitlog_sync options, but
>>> I'm having tons of issues when I change it to "group". I have
>>> "commitlog_sync_group_window" set to 1000ms.
>>>
>>> My client is doing writes like this (pseudocode):
>>>
>>> Semaphore sem = new Semaphore(numTickets);
>>> while(true) {
>>>
>>> sem.acquire();
>>> session.executeAsync(insert.bind(genUUIDStr(), genUUIDStr(),
>>> genUUIDStr())
>>> .whenComplete((t, u) -> sem.release())
>>>
>>> }
>>>
>>> If I set numTickets higher than 20, I get tons of timeout errors.
>>>
>>> I've also tried doing single commands with BatchStatement with many
>>> inserts at a time, and that fails with timeout when the batch size gets
>>> more than 20.
>>>
>>> Increasing the write request timeout in cassandra.yaml makes it time out
>>> at slightly higher numbers of concurrent requests.
>>>
>>> With periodic I'm able to get about 38k writes / second, and with batch
>>> I'm able to get about 14k / second.
>>>
>>> Any tips on

Re: Trouble with using group commitlog_sync

I suspect you are abusing batch statements. Batch statements should only 
be used where atomicity or isolation is needed. Using batch statements 
won't make inserting multiple partitions faster. In fact, it often will 
make that slower.


Also, the liner relationship between commitlog_sync_group_window and 
write throughput is expected. That's because the max number of 
uncompleted writes is limited by the write concurrency, and a write is 
not considered "complete" before it is synced to disk when commitlog 
sync is in group or batch mode. That means within each interval, only 
limited number of writes can be done. The ways to increase that 
including: add more nodes, sync commitlog at shorter intervals and allow 
more concurrent writes.



On 23/04/2024 20:43, Nathan Marz wrote:
Thanks. I raised concurrent_writes to 128 and 
set commitlog_sync_group_window to 20ms. This causes a single execute 
of a BatchStatement containing 100 inserts to succeed. However, the 
throughput I'm seeing is atrocious.


With these settings, I'm executing 10 BatchStatement concurrently at a 
time using the semaphore + loop approach I showed in my first message. 
So as requests complete, more are sent out such that there are 10 
in-flight at a time. Each BatchStatement has 100 individual inserts. 
I'm seeing only 730 inserts / second. Again, with periodic mode I see 
38k / second and with batch I see 14k / second. My expectation was 
that group commit mode throughput would be somewhere between those two.


If I set commitlog_sync_group_window to 100ms, the throughput drops to 
14 / second.


If I set commitlog_sync_group_window to 10ms, the throughput increases 
to 1587 / second.


If I set commitlog_sync_group_window to 5ms, the throughput increases 
to 3200 / second.


If I set commitlog_sync_group_window to 1ms, the throughput increases 
to 13k / second, which is slightly less than batch commit mode.


Is group commit mode supposed to have better performance than batch mode?


On Tue, Apr 23, 2024 at 8:46 AM Bowen Song via user 
 wrote:


The default commitlog_sync_group_window is very long for SSDs. Try
reduce it if you are using SSD-backed storage for the commit log.
10-15 ms is a good starting point. You may also want to increase
the value of concurrent_writes, consider at least double or
quadruple it from the default. You'll need even higher write
concurrency for longer commitlog_sync_group_window.


On 23/04/2024 19:26, Nathan Marz wrote:

"batch" mode works fine. I'm having trouble with "group" mode.
The only config for that is "commitlog_sync_group_window", and I
have that set to the default 1000ms.

On Tue, Apr 23, 2024 at 8:15 AM Bowen Song via user
 wrote:

Why would you want to set commitlog_sync_batch_window to 1
second long when commitlog_sync is set to batch mode? The
documentation


on this says:

/This window should be kept short because the writer
threads will be unable to do extra work while waiting.
You may need to increase concurrent_writes for the same
reason/

If you want to use batch mode, at least ensure
commitlog_sync_batch_window is reasonably short. The default
is 2 millisecond.


On 23/04/2024 18:32, Nathan Marz wrote:

I'm doing some benchmarking of Cassandra on a single
m6gd.large instance. It works fine with periodic or batch
commitlog_sync options, but I'm having tons of issues when I
change it to "group". I have "commitlog_sync_group_window"
set to 1000ms.

My client is doing writes like this (pseudocode):

Semaphore sem = new Semaphore(numTickets);
while(true) {

sem.acquire();
session.executeAsync(insert.bind(genUUIDStr(),
genUUIDStr(), genUUIDStr())
            .whenComplete((t, u) -> sem.release())

}

If I set numTickets higher than 20, I get tons of timeout
errors.

I've also tried doing single commands with BatchStatement
with many inserts at a time, and that fails with timeout
when the batch size gets more than 20.

Increasing the write request timeout in cassandra.yaml makes
it time out at slightly higher numbers of concurrent requests.

With periodic I'm able to get about 38k writes / second, and
with batch I'm able to get about 14k / second.

Any tips on what I should be doing to get group
commitlog_sync to work properly? I didn't expect to have to
do anything other than change the config.

Re: Trouble with using group commitlog_sync

Thanks. I raised concurrent_writes to 128 and
set commitlog_sync_group_window to 20ms. This causes a single execute of a
BatchStatement containing 100 inserts to succeed. However, the throughput
I'm seeing is atrocious.

With these settings, I'm executing 10 BatchStatement concurrently at a time
using the semaphore + loop approach I showed in my first message. So as
requests complete, more are sent out such that there are 10 in-flight at a
time. Each BatchStatement has 100 individual inserts. I'm seeing only 730
inserts / second. Again, with periodic mode I see 38k / second and with
batch I see 14k / second. My expectation was that group commit mode
throughput would be somewhere between those two.

If I set commitlog_sync_group_window to 100ms, the throughput drops to 14 /
second.

If I set commitlog_sync_group_window to 10ms, the throughput increases to
1587 / second.

If I set commitlog_sync_group_window to 5ms, the throughput increases to
3200 / second.

If I set commitlog_sync_group_window to 1ms, the throughput increases to
13k / second, which is slightly less than batch commit mode.

Is group commit mode supposed to have better performance than batch mode?

On Tue, Apr 23, 2024 at 8:46 AM Bowen Song via user <
user@cassandra.apache.org> wrote:

> The default commitlog_sync_group_window is very long for SSDs. Try reduce
> it if you are using SSD-backed storage for the commit log. 10-15 ms is a
> good starting point. You may also want to increase the value of
> concurrent_writes, consider at least double or quadruple it from the
> default. You'll need even higher write concurrency for longer
> commitlog_sync_group_window.
>
> On 23/04/2024 19:26, Nathan Marz wrote:
>
> "batch" mode works fine. I'm having trouble with "group" mode. The only
> config for that is "commitlog_sync_group_window", and I have that set to
> the default 1000ms.
>
> On Tue, Apr 23, 2024 at 8:15 AM Bowen Song via user <
> user@cassandra.apache.org> wrote:
>
>> Why would you want to set commitlog_sync_batch_window to 1 second long
>> when commitlog_sync is set to batch mode? The documentation
>> 
>> on this says:
>>
>> *This window should be kept short because the writer threads will be
>> unable to do extra work while waiting. You may need to increase
>> concurrent_writes for the same reason*
>>
>> If you want to use batch mode, at least ensure
>> commitlog_sync_batch_window is reasonably short. The default is 2
>> millisecond.
>>
>>
>> On 23/04/2024 18:32, Nathan Marz wrote:
>>
>> I'm doing some benchmarking of Cassandra on a single m6gd.large instance.
>> It works fine with periodic or batch commitlog_sync options, but I'm having
>> tons of issues when I change it to "group". I have
>> "commitlog_sync_group_window" set to 1000ms.
>>
>> My client is doing writes like this (pseudocode):
>>
>> Semaphore sem = new Semaphore(numTickets);
>> while(true) {
>>
>> sem.acquire();
>> session.executeAsync(insert.bind(genUUIDStr(), genUUIDStr(), genUUIDStr())
>> .whenComplete((t, u) -> sem.release())
>>
>> }
>>
>> If I set numTickets higher than 20, I get tons of timeout errors.
>>
>> I've also tried doing single commands with BatchStatement with many
>> inserts at a time, and that fails with timeout when the batch size gets
>> more than 20.
>>
>> Increasing the write request timeout in cassandra.yaml makes it time out
>> at slightly higher numbers of concurrent requests.
>>
>> With periodic I'm able to get about 38k writes / second, and with batch
>> I'm able to get about 14k / second.
>>
>> Any tips on what I should be doing to get group commitlog_sync to work
>> properly? I didn't expect to have to do anything other than change the
>> config.
>>
>>

Re: Trouble with using group commitlog_sync

The default commitlog_sync_group_window is very long for SSDs. Try 
reduce it if you are using SSD-backed storage for the commit log. 10-15 
ms is a good starting point. You may also want to increase the value of 
concurrent_writes, consider at least double or quadruple it from the 
default. You'll need even higher write concurrency for longer 
commitlog_sync_group_window.



On 23/04/2024 19:26, Nathan Marz wrote:
"batch" mode works fine. I'm having trouble with "group" mode. The 
only config for that is "commitlog_sync_group_window", and I have that 
set to the default 1000ms.


On Tue, Apr 23, 2024 at 8:15 AM Bowen Song via user 
 wrote:


Why would you want to set commitlog_sync_batch_window to 1 second
long when commitlog_sync is set to batch mode? The documentation


on this says:

/This window should be kept short because the writer threads
will be unable to do extra work while waiting. You may need to
increase concurrent_writes for the same reason/

If you want to use batch mode, at least ensure
commitlog_sync_batch_window is reasonably short. The default is 2
millisecond.


On 23/04/2024 18:32, Nathan Marz wrote:

I'm doing some benchmarking of Cassandra on a single m6gd.large
instance. It works fine with periodic or batch commitlog_sync
options, but I'm having tons of issues when I change it to
"group". I have "commitlog_sync_group_window" set to 1000ms.

My client is doing writes like this (pseudocode):

Semaphore sem = new Semaphore(numTickets);
while(true) {

sem.acquire();
session.executeAsync(insert.bind(genUUIDStr(), genUUIDStr(),
genUUIDStr())
            .whenComplete((t, u) -> sem.release())

}

If I set numTickets higher than 20, I get tons of timeout errors.

I've also tried doing single commands with BatchStatement with
many inserts at a time, and that fails with timeout when the
batch size gets more than 20.

Increasing the write request timeout in cassandra.yaml makes it
time out at slightly higher numbers of concurrent requests.

With periodic I'm able to get about 38k writes / second, and with
batch I'm able to get about 14k / second.

Any tips on what I should be doing to get group commitlog_sync to
work properly? I didn't expect to have to do anything other than
change the config.

Re: Trouble with using group commitlog_sync

"batch" mode works fine. I'm having trouble with "group" mode. The only
config for that is "commitlog_sync_group_window", and I have that set to
the default 1000ms.

On Tue, Apr 23, 2024 at 8:15 AM Bowen Song via user <
user@cassandra.apache.org> wrote:

> Why would you want to set commitlog_sync_batch_window to 1 second long
> when commitlog_sync is set to batch mode? The documentation
> 
> on this says:
>
> *This window should be kept short because the writer threads will be
> unable to do extra work while waiting. You may need to increase
> concurrent_writes for the same reason*
>
> If you want to use batch mode, at least ensure commitlog_sync_batch_window
> is reasonably short. The default is 2 millisecond.
>
>
> On 23/04/2024 18:32, Nathan Marz wrote:
>
> I'm doing some benchmarking of Cassandra on a single m6gd.large instance.
> It works fine with periodic or batch commitlog_sync options, but I'm having
> tons of issues when I change it to "group". I have
> "commitlog_sync_group_window" set to 1000ms.
>
> My client is doing writes like this (pseudocode):
>
> Semaphore sem = new Semaphore(numTickets);
> while(true) {
>
> sem.acquire();
> session.executeAsync(insert.bind(genUUIDStr(), genUUIDStr(), genUUIDStr())
> .whenComplete((t, u) -> sem.release())
>
> }
>
> If I set numTickets higher than 20, I get tons of timeout errors.
>
> I've also tried doing single commands with BatchStatement with many
> inserts at a time, and that fails with timeout when the batch size gets
> more than 20.
>
> Increasing the write request timeout in cassandra.yaml makes it time out
> at slightly higher numbers of concurrent requests.
>
> With periodic I'm able to get about 38k writes / second, and with batch
> I'm able to get about 14k / second.
>
> Any tips on what I should be doing to get group commitlog_sync to work
> properly? I didn't expect to have to do anything other than change the
> config.
>
>

Re: Trouble with using group commitlog_sync

Why would you want to set commitlog_sync_batch_window to 1 second long 
when commitlog_sync is set to batch mode? The documentation 
 
on this says:


   /This window should be kept short because the writer threads will be
   unable to do extra work while waiting. You may need to increase
   concurrent_writes for the same reason/

If you want to use batch mode, at least ensure 
commitlog_sync_batch_window is reasonably short. The default is 2 
millisecond.



On 23/04/2024 18:32, Nathan Marz wrote:
I'm doing some benchmarking of Cassandra on a single m6gd.large 
instance. It works fine with periodic or batch commitlog_sync options, 
but I'm having tons of issues when I change it to "group". I have 
"commitlog_sync_group_window" set to 1000ms.


My client is doing writes like this (pseudocode):

Semaphore sem = new Semaphore(numTickets);
while(true) {

sem.acquire();
session.executeAsync(insert.bind(genUUIDStr(), genUUIDStr(),
genUUIDStr())
            .whenComplete((t, u) -> sem.release())

}

If I set numTickets higher than 20, I get tons of timeout errors.

I've also tried doing single commands with BatchStatement with many 
inserts at a time, and that fails with timeout when the batch size 
gets more than 20.


Increasing the write request timeout in cassandra.yaml makes it time 
out at slightly higher numbers of concurrent requests.


With periodic I'm able to get about 38k writes / second, and with 
batch I'm able to get about 14k / second.


Any tips on what I should be doing to get group commitlog_sync to work 
properly? I didn't expect to have to do anything other than change the 
config.

RE: Datacenter decommissioning on Cassandra 4.1.4

2024-04-23 Thread Michalis Kotsiouros (EXT) via user

Hello Alain,
Thanks a lot for the confirmation.
Yes this procedure seems like a workaround. But for my use case where 
system_auth contains a small amount of data and consistency level for 
authentication/authorization is switched to LOCAL_ONE, I think it is good 
enough.
I completely get that this could be improved since there might be requirements 
from other users that cannot be covered with the proposed procedure.

BR
MK
From: Alain Rodriguez 
Sent: April 22, 2024 18:27
To: user@cassandra.apache.org
Cc: Michalis Kotsiouros (EXT) 
Subject: Re: Datacenter decommissioning on Cassandra 4.1.4

Hi Michalis,

It's been a while since I removed a DC for the last time, but I see there is 
now a protection to avoid accidentally leaving a DC without auth capability.

This was introduced in C* 4.1 through CASSANDRA-17478 
(https://issues.apache.org/jira/browse/CASSANDRA-17478).

The process of dropping a data center might have been overlooked while doing 
this work.

It's never correct for an operator to remove a DC from system_auth replication 
settings while there are currently nodes up in that DC.

I believe this assertion is not correct. As Jon and Jeff mentioned, usually we 
remove the replication before decommissioning any node in the case of removing 
an entire DC, for reasons exposed by Jeff. The existing documentation is also 
clear about this: 
https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/operations/opsDecomissionDC.html
 and 
https://thelastpickle.com/blog/2019/02/26/data-center-switch.html<https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-45444731-3f5d78e47d9f728a=1=1b5f9bb8-e8af-49b9-9e2d-26622cb77bfc=https%3A%2F%2Fthelastpickle.com%2Fblog%2F2019%2F02%2F26%2Fdata-center-switch.html>.

Michalis, the solution you suggest seems to be the (good/only?) way to go, even 
though it looks like a workaround, not really "clean" and something we need to 
improve. It was also mentioned here: 
https://dba.stackexchange.com/questions/331732/not-a-able-to-decommission-the-old-datacenter#answer-334890.
 It should work quickly, but only because this keyspace has a fairly low amount 
of data, but it will still not be optimal and as fast as it should (it should 
be a near no-op as explained above by Jeff). It also obliges you to use 
"--force" option that could lead you to delete one of your nodes in another DC 
by mistake and in a loaded cluster or a 3-node cluster - RF = 3, this could 
hurt...). Having to operate using "nodetool decommission --force" cannot be 
standard, but for now I can't think of anything better for you. Maybe wait for 
someone else's confirmation, it's been a while since operated Cassandra :).

I think it would make sense to fix this somehow in Cassandra. Maybe should we 
ensure that no other keyspaces has a RF > 0 for this data center instead of 
looking at active nodes, or that there is no client connected to the nodes, add 
a manual flag somewhere, or something else? Even though I understand the 
motivation to protect users against a wrongly distributed system_auth keyspace, 
I think this protection should not be kept with this implementation. If that 
makes sense I can create a ticket for this problem.

C*heers,

Alain Rodriguez

casterix.fr<https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-45444731-c572154016f8885b=1=1b5f9bb8-e8af-49b9-9e2d-26622cb77bfc=http%3A%2F%2Fcasterix.fr%2F>

[Image removed by sender.]



Le lun. 8 avr. 2024 à 16:26, Michalis Kotsiouros (EXT) via user 
mailto:user@cassandra.apache.org>> a écrit :
Hello Jon and Jeff,
Thanks a lot for your replies.
I completely get your points.
Some more clarification about my issue.
When trying to update the Replication before the decommission, I get the 
following error message when I remove the replication for system_auth kesypace.
ConfigurationException: Following datacenters have active nodes and must be 
present in replication options for keyspace system_auth: [datacenter1]

This error message does not appear in the rest of the application keyspaces.
So, may I change the procedure to:

  1.  Make sure no clients are still writing to any nodes in the datacenter.
  2.  Run a full repair with nodetool repair.
  3.  Change all keyspaces so they no longer reference the datacenter being 
removed apart from system_auth keyspace.
  4.  Run nodetool decommission using the --force option on every node in the 
datacenter being removed.
  5.  Change system_auth keyspace so they no longer reference the datacenter 
being removed.
BR
MK



From: Jeff Jirsa mailto:jji...@gmail.com>>
Sent: April 08, 2024 17:19
To: cassandra mailto:user@cassandra.apache.org>>
Cc: Michalis Kotsiouros (EXT) 
mailto:michalis.kotsiouros@ericsson.com>>
Subject: Re: Datacenter decommissioning on Cassandra 4.1.4

To Jon’s point, if you remove from replication after step 1 or step 2 (probably 
step 2 if your goal is to be strictly correct), the nodet

RE: Datacenter decommissioning on Cassandra 4.1.4

2024-04-23 Thread Michalis Kotsiouros (EXT) via user

Hello Sebastien,
Yes, your approach is really interesting. I will test this in my system as
well. I think it reduces some risks involved in the procedure that was
discussed in the previous emails.
Just for the record, availability is a top priority for my use cases that is
why I have switched the default consistency level for
authentication/authorization to LOCAL_ONE as it used in previous C*
versions.

BR
MK
-Original Message-
From: Sebastian Marsching  
Sent: April 22, 2024 21:58
To: Michalis Kotsiouros (EXT) via user 
Subject: Re: Datacenter decommissioning on Cassandra 4.1.4

Recently, I successfully used the following procedure when decommissioning a
datacenter:

1. Reduced the replication factor for this DC to zero for all keyspaces
except the system_auth keyspace. For that keyspace, I reduced the RF to one.
2. Decommissioned all nodes except one in the DC using the regular procedure
(no --force needed).
3. Decommissioned the last node using --force.
4. Set the RF for the system_auth keyspace to 0.

This procedure has two benefits:

1. Authentication on the nodes in the DC being decommissioned will work
until the last node has been decommissioned. This is important when
authentication is enabled for JMX. Otherwise, you cannot proceed when there
are too few nodes left to get a LOCAL_QUORUM on system_auth.
2. One does not have to use --force except when removing the last node.

It would be nice if the RF for the system_auth keyspace could be reduced to
zero before decommissioning the nodes. However, I think that implementing
this correctly may be hard. If there are no local replicas, queries with a
consistency level of LOCAL_QUORUM will probably fail, and this is the
consistency level used for all authentication and authorization related
queries. So, setting the RF to zero might break authentication and
authorization, which in turn might make it impossible to decommission the
nodes (without disabling authentication for that DC).

So, I guess that the code dealing with authentication and authorization
would have to be changed to use a CL of QUORUM instead of LOCAL_QUORUM when
system_auth is not replicated in the local DC.



smime.p7s
Description: S/MIME cryptographic signature

Re: Datacenter decommissioning on Cassandra 4.1.4

2024-04-22 Thread Sebastian Marsching


Recently, I successfully used the following procedure when decommissioning a 
datacenter:

1. Reduced the replication factor for this DC to zero for all keyspaces except 
the system_auth keyspace. For that keyspace, I reduced the RF to one.
2. Decommissioned all nodes except one in the DC using the regular procedure 
(no --force needed).
3. Decommissioned the last node using --force.
4. Set the RF for the system_auth keyspace to 0.

This procedure has two benefits:

1. Authentication on the nodes in the DC being decommissioned will work until 
the last node has been decommissioned. This is important when authentication is 
enabled for JMX. Otherwise, you cannot proceed when there are too few nodes 
left to get a LOCAL_QUORUM on system_auth.
2. One does not have to use --force except when removing the last node.

It would be nice if the RF for the system_auth keyspace could be reduced to 
zero before decommissioning the nodes. However, I think that implementing this 
correctly may be hard. If there are no local replicas, queries with a 
consistency level of LOCAL_QUORUM will probably fail, and this is the 
consistency level used for all authentication and authorization related 
queries. So, setting the RF to zero might break authentication and 
authorization, which in turn might make it impossible to decommission the nodes 
(without disabling authentication for that DC).

So, I guess that the code dealing with authentication and authorization would 
have to be changed to use a CL of QUORUM instead of LOCAL_QUORUM when 
system_auth is not replicated in the local DC.



smime.p7s
Description: S/MIME cryptographic signature

Re: Datacenter decommissioning on Cassandra 4.1.4

2024-04-22 Thread Alain Rodriguez via user

Hi Michalis,

It's been a while since I removed a DC for the last time, but I see there
is now a protection to avoid accidentally leaving a DC without auth
capability.

This was introduced in C* 4.1 through CASSANDRA-17478 (
https://issues.apache.org/jira/browse/CASSANDRA-17478).

The process of dropping a data center might have been overlooked while
doing this work.

It's never correct for an operator to remove a DC from system_auth
> replication settings while there are currently nodes up in that DC.
>

I believe this assertion is not correct. As Jon and Jeff mentioned, usually
we remove the replication *before* decommissioning any node in the case of
removing an entire DC, for reasons exposed by Jeff. The existing
documentation is also clear about this:
https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/operations/opsDecomissionDC.html
and https://thelastpickle.com/blog/2019/02/26/data-center-switch.html.

Michalis, the solution you suggest seems to be the (good/only?) way to go,
even though it looks like a workaround, not really "clean" and something we
need to improve. It was also mentioned here:
https://dba.stackexchange.com/questions/331732/not-a-able-to-decommission-the-old-datacenter#answer-334890.
It should work quickly, but only because this keyspace has a fairly low
amount of data, but it will still not be optimal and as fast as it should
(it should be a near no-op as explained above by Jeff). It also obliges you
to use "--force" option that could lead you to delete one of your nodes in
another DC by mistake and in a loaded cluster or a 3-node cluster - RF = 3,
this could hurt...). Having to operate using "nodetool decommission
--force" cannot be standard, but for now I can't think of anything better
for you. Maybe wait for someone else's confirmation, it's been a while
since operated Cassandra :).

I think it would make sense to fix this somehow in Cassandra. Maybe should
we ensure that no other keyspaces has a RF > 0 for this data center instead
of looking at active nodes, or that there is no client connected to the
nodes, add a manual flag somewhere, or something else? Even though I
understand the motivation to protect users against a wrongly distributed
system_auth keyspace, I think this protection should not be kept with this
implementation. If that makes sense I can create a ticket for this problem.

C*heers,

*Alain Rodriguezcasterix.fr <http://casterix.fr>*

Le lun. 8 avr. 2024 à 16:26, Michalis Kotsiouros (EXT) via user <
user@cassandra.apache.org> a écrit :

> Hello Jon and Jeff,
>
> Thanks a lot for your replies.
>
> I completely get your points.
>
> Some more clarification about my issue.
>
> When trying to update the Replication before the decommission, I get the
> following error message when I remove the replication for system_auth
> kesypace.
>
> ConfigurationException: Following datacenters have active nodes and must
> be present in replication options for keyspace system_auth: [datacenter1]
>
>
>
> This error message does not appear in the rest of the application
> keyspaces.
>
> So, may I change the procedure to:
>
>1. Make sure no clients are still writing to any nodes in the
>datacenter.
>2. Run a full repair with nodetool repair.
>3. Change all keyspaces so they no longer reference the datacenter
>being removed apart from system_auth keyspace.
>4. Run nodetool decommission using the --force option on every node in
>the datacenter being removed.
>5. Change system_auth keyspace so they no longer reference the
>datacenter being removed.
>
> BR
>
> MK
>
>
>
>
>
>
>
> *From:* Jeff Jirsa 
> *Sent:* April 08, 2024 17:19
> *To:* cassandra 
> *Cc:* Michalis Kotsiouros (EXT) 
> *Subject:* Re: Datacenter decommissioning on Cassandra 4.1.4
>
>
>
> To Jon’s point, if you remove from replication after step 1 or step 2
> (probably step 2 if your goal is to be strictly correct), the nodetool
> decommission phase becomes almost a no-op.
>
>
>
> If you use the order below, the last nodes to decommission will cause
> those surviving machines to run out of space (assuming you have more than a
> few nodes to start)
>
>
>
>
>
>
>
> On Apr 8, 2024, at 6:58 AM, Jon Haddad  wrote:
>
>
>
> You shouldn’t decom an entire DC before removing it from replication.
>
>
> —
>
>
> Jon Haddad
> Rustyrazorblade Consulting
> rustyrazorblade.com
> <https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-45444731-1624a77accb6d839=1=8a954d2d-17da-40df-8732-bdcc7893179a=http%3A%2F%2Frustyrazorblade.com%2F>
>
>
>
>
>
> On Mon, Apr 8, 2024 at 6:26 AM Michalis Kotsiouros (EXT) via user <
> user@cassandra.apache.org> wrote:
>
> Hello

Re: ssl certificate hot reloading test - cassandra 4.1

2024-04-18 Thread Tolbert, Andy

I think in the context of what I think initially motivated this hot
reloading capability, a big win it provides is avoiding having to
bounce your cluster as your certificates near expiry.  If not watched
closely you can get yourself into a state where every node in the
cluster's cert expired, which is effectively an outage.

I see the appeal of draining connections on a change of trust,
although the necessity of being able to "do it live" (as opposed to
doing a bounce) seems less important then avoiding the outage
condition of your certificates expiring, especially since you can sort
of already do this without bouncing by toggling nodetool
disablebinary/enablebinary.  I agree with Dinesh that most operators
would prefer that it does not do that as interrupting connections can
be disruptive to applications if they don't have retries configured,
but I also agree it'd be a nice improvement to support draining
existing connections in some way.

+1 on the idea of having a "timed connection" capability brought up
here, and implementing it in a way such that connection lifetimes can
be dynamically adjusted.  This way it can be made such that on a trust
store change Cassandra could simply adjust the connection lifetimes
and they will be disconnected immediately or drained over a time
period like Josh proposed.

Thanks,
Andy

Re: ssl certificate hot reloading test - cassandra 4.1

2024-04-18 Thread Josh McKenzie

I think it's all part of the same issue and you're not derailing IMO Abe. For 
the user Pabbireddy here, the unexpected behavior was not closing internode 
connections on that keystore refresh. So ISTM, from a "featureset that would be 
nice to have here" perspective, we could theoretically provide:
 1. An option to disconnect all connections on cert update, disabled by default
 2. An option to drain and recycle connections on a time period, also disabled 
by default
Leave the current behavior in place but allow for these kind of strong 
cert-guarantees if a user needs it in their env.

On Mon, Apr 15, 2024, at 9:51 PM, Abe Ratnofsky wrote:
> Not to derail from the original conversation too far, but wanted to agree 
> that maximum connection establishment time on native transport would be 
> useful. That would provide a maximum duration before an updated client 
> keystore is used for connections, which can be used to safely roll out client 
> keystore updates.
> 
> For example, if the maximum connection establishment time is 12 hours, then 
> you can update the keystore on a canary client, wait 24 hours, confirm that 
> connectivity is maintained, then upgrade keystores across the rest of the 
> fleet.
> 
> With unbounded connection establishment, reconnection isn't tested as often 
> and issues can hide behind long-lived connections.
> 
>> On Apr 15, 2024, at 5:14 PM, Jeff Jirsa  wrote:
>> 
>> It seems like if folks really want the life of a connection to be finite 
>> (either client/server or server/server), adding in an option to quietly 
>> drain and recycle a connection on some period isn’t that difficult.
>> 
>> That type of requirement shows up in a number of environments, usually on 
>> interactive logins (cqlsh, login, walk away, the connection needs to become 
>> invalid in a short and finite period of time), but adding it to internode 
>> could also be done, and may help in some weird situations (if you changed 
>> certs because you believe a key/cert is compromised, having the connection 
>> remain active is decidedly inconvenient, so maybe it does make sense to add 
>> an expiration timer/condition on each connection).
>> 
>> 
>> 
>>> On Apr 15, 2024, at 12:28 PM, Dinesh Joshi  wrote:
>>> 
>>> In addition to what Andy mentioned, I want to point out that for the vast 
>>> majority of use-cases, we would like to _avoid_ interruptions when a 
>>> certificate is updated so it is by design. If you're dealing with a 
>>> situation where you want to ensure that the connections are cycled, you can 
>>> follow Andy's advice. It will require automation outside the database that 
>>> you might already have. If there is demand, we can consider adding a 
>>> feature to slowly cycle the connections so the old SSL context is not used 
>>> anymore.
>>> 
>>> One more thing you should bear in mind is that Cassandra will not load the 
>>> new SSL context if it cannot successfully initialize it. This is again by 
>>> design to prevent an outage when the updated truststore is corrupted or 
>>> could not be read in some way.
>>> 
>>> thanks,
>>> Dinesh
>>> 
>>> On Mon, Apr 15, 2024 at 9:45 AM Tolbert, Andy  
>>> wrote:
 I should mention, when toggling disablebinary/enablebinary between
 instances, you will probably want to give some time between doing this
 so connections can reestablish, and you will want to verify that the
 connections can actually reestablish.  You also need to be mindful of
 this being disruptive to inflight queries (if your client is
 configured for retries it will probably be fine).  Semantically to
 your applications it should look a lot like a rolling cluster bounce.
 
 Thanks,
 Andy
 
 On Mon, Apr 15, 2024 at 11:39 AM pabbireddy avinash
  wrote:
 >
 > Thanks Andy for your reply . We will test the scenario you mentioned.
 >
 > Regards
 > Avinash
 >
 > On Mon, Apr 15, 2024 at 11:28 AM, Tolbert, Andy  
 > wrote:
 >>
 >> Hi Avinash,
 >>
 >> As far as I understand it, if the underlying keystore/trustore(s)
 >> Cassandra is configured for is updated, this *will not* provoke
 >> Cassandra to interrupt existing connections, it's just that the new
 >> stores will be used for future TLS initialization.
 >>
 >> Via: 
 >> https://cassandra.apache.org/doc/4.1/cassandra/operating/security.html#ssl-certificate-hot-reloading
 >>
 >> > When the files are updated, Cassandra will reload them and use them 
 >> > for subsequent connections
 >>
 >> I suppose one could do a rolling disablebinary/enablebinary (if it's
 >> only client connections) after you roll out a keystore/truststore
 >> change as a way of enforcing the existing connections to reestablish.
 >>
 >> Thanks,
 >> Andy
 >>
 >>
 >> On Mon, Apr 15, 2024 at 11:11 AM pabbireddy avinash
 >>  wrote:
 >> >
 >> > Dear Community,
 >> >
 >> > I hope this email

Re: ssl certificate hot reloading test - cassandra 4.1

2024-04-15 Thread Abe Ratnofsky

Not to derail from the original conversation too far, but wanted to agree that 
maximum connection establishment time on native transport would be useful. That 
would provide a maximum duration before an updated client keystore is used for 
connections, which can be used to safely roll out client keystore updates.

For example, if the maximum connection establishment time is 12 hours, then you 
can update the keystore on a canary client, wait 24 hours, confirm that 
connectivity is maintained, then upgrade keystores across the rest of the fleet.

With unbounded connection establishment, reconnection isn't tested as often and 
issues can hide behind long-lived connections.

> On Apr 15, 2024, at 5:14 PM, Jeff Jirsa  wrote:
> 
> It seems like if folks really want the life of a connection to be finite 
> (either client/server or server/server), adding in an option to quietly drain 
> and recycle a connection on some period isn’t that difficult.
> 
> That type of requirement shows up in a number of environments, usually on 
> interactive logins (cqlsh, login, walk away, the connection needs to become 
> invalid in a short and finite period of time), but adding it to internode 
> could also be done, and may help in some weird situations (if you changed 
> certs because you believe a key/cert is compromised, having the connection 
> remain active is decidedly inconvenient, so maybe it does make sense to add 
> an expiration timer/condition on each connection).
> 
> 
> 
>> On Apr 15, 2024, at 12:28 PM, Dinesh Joshi  wrote:
>> 
>> In addition to what Andy mentioned, I want to point out that for the vast 
>> majority of use-cases, we would like to _avoid_ interruptions when a 
>> certificate is updated so it is by design. If you're dealing with a 
>> situation where you want to ensure that the connections are cycled, you can 
>> follow Andy's advice. It will require automation outside the database that 
>> you might already have. If there is demand, we can consider adding a feature 
>> to slowly cycle the connections so the old SSL context is not used anymore.
>> 
>> One more thing you should bear in mind is that Cassandra will not load the 
>> new SSL context if it cannot successfully initialize it. This is again by 
>> design to prevent an outage when the updated truststore is corrupted or 
>> could not be read in some way.
>> 
>> thanks,
>> Dinesh
>> 
>> On Mon, Apr 15, 2024 at 9:45 AM Tolbert, Andy > > wrote:
>>> I should mention, when toggling disablebinary/enablebinary between
>>> instances, you will probably want to give some time between doing this
>>> so connections can reestablish, and you will want to verify that the
>>> connections can actually reestablish.  You also need to be mindful of
>>> this being disruptive to inflight queries (if your client is
>>> configured for retries it will probably be fine).  Semantically to
>>> your applications it should look a lot like a rolling cluster bounce.
>>> 
>>> Thanks,
>>> Andy
>>> 
>>> On Mon, Apr 15, 2024 at 11:39 AM pabbireddy avinash
>>> mailto:pabbireddyavin...@gmail.com>> wrote:
>>> >
>>> > Thanks Andy for your reply . We will test the scenario you mentioned.
>>> >
>>> > Regards
>>> > Avinash
>>> >
>>> > On Mon, Apr 15, 2024 at 11:28 AM, Tolbert, Andy >> > > wrote:
>>> >>
>>> >> Hi Avinash,
>>> >>
>>> >> As far as I understand it, if the underlying keystore/trustore(s)
>>> >> Cassandra is configured for is updated, this *will not* provoke
>>> >> Cassandra to interrupt existing connections, it's just that the new
>>> >> stores will be used for future TLS initialization.
>>> >>
>>> >> Via: 
>>> >> https://cassandra.apache.org/doc/4.1/cassandra/operating/security.html#ssl-certificate-hot-reloading
>>> >>
>>> >> > When the files are updated, Cassandra will reload them and use them 
>>> >> > for subsequent connections
>>> >>
>>> >> I suppose one could do a rolling disablebinary/enablebinary (if it's
>>> >> only client connections) after you roll out a keystore/truststore
>>> >> change as a way of enforcing the existing connections to reestablish.
>>> >>
>>> >> Thanks,
>>> >> Andy
>>> >>
>>> >>
>>> >> On Mon, Apr 15, 2024 at 11:11 AM pabbireddy avinash
>>> >> mailto:pabbireddyavin...@gmail.com>> wrote:
>>> >> >
>>> >> > Dear Community,
>>> >> >
>>> >> > I hope this email finds you well. I am currently testing SSL 
>>> >> > certificate hot reloading on a Cassandra cluster running version 4.1 
>>> >> > and encountered a situation that requires your expertise.
>>> >> >
>>> >> > Here's a summary of the process and issue:
>>> >> >
>>> >> > Reloading Process: We reloaded certificates signed by our in-house 
>>> >> > certificate authority into the cluster, which was initially running 
>>> >> > with self-signed certificates. The reload was done node by node.
>>> >> >
>>> >> > Truststore and Keystore: The truststore and keystore passwords are the 
>>> >> > same across the cluster.
>>> >> >
>>> >> >

Re: ssl certificate hot reloading test - cassandra 4.1

2024-04-15 Thread Jeff Jirsa

It seems like if folks really want the life of a connection to be finite 
(either client/server or server/server), adding in an option to quietly drain 
and recycle a connection on some period isn’t that difficult.

That type of requirement shows up in a number of environments, usually on 
interactive logins (cqlsh, login, walk away, the connection needs to become 
invalid in a short and finite period of time), but adding it to internode could 
also be done, and may help in some weird situations (if you changed certs 
because you believe a key/cert is compromised, having the connection remain 
active is decidedly inconvenient, so maybe it does make sense to add an 
expiration timer/condition on each connection).



> On Apr 15, 2024, at 12:28 PM, Dinesh Joshi  wrote:
> 
> In addition to what Andy mentioned, I want to point out that for the vast 
> majority of use-cases, we would like to _avoid_ interruptions when a 
> certificate is updated so it is by design. If you're dealing with a situation 
> where you want to ensure that the connections are cycled, you can follow 
> Andy's advice. It will require automation outside the database that you might 
> already have. If there is demand, we can consider adding a feature to slowly 
> cycle the connections so the old SSL context is not used anymore.
> 
> One more thing you should bear in mind is that Cassandra will not load the 
> new SSL context if it cannot successfully initialize it. This is again by 
> design to prevent an outage when the updated truststore is corrupted or could 
> not be read in some way.
> 
> thanks,
> Dinesh
> 
> On Mon, Apr 15, 2024 at 9:45 AM Tolbert, Andy  > wrote:
>> I should mention, when toggling disablebinary/enablebinary between
>> instances, you will probably want to give some time between doing this
>> so connections can reestablish, and you will want to verify that the
>> connections can actually reestablish.  You also need to be mindful of
>> this being disruptive to inflight queries (if your client is
>> configured for retries it will probably be fine).  Semantically to
>> your applications it should look a lot like a rolling cluster bounce.
>> 
>> Thanks,
>> Andy
>> 
>> On Mon, Apr 15, 2024 at 11:39 AM pabbireddy avinash
>> mailto:pabbireddyavin...@gmail.com>> wrote:
>> >
>> > Thanks Andy for your reply . We will test the scenario you mentioned.
>> >
>> > Regards
>> > Avinash
>> >
>> > On Mon, Apr 15, 2024 at 11:28 AM, Tolbert, Andy > > > wrote:
>> >>
>> >> Hi Avinash,
>> >>
>> >> As far as I understand it, if the underlying keystore/trustore(s)
>> >> Cassandra is configured for is updated, this *will not* provoke
>> >> Cassandra to interrupt existing connections, it's just that the new
>> >> stores will be used for future TLS initialization.
>> >>
>> >> Via: 
>> >> https://cassandra.apache.org/doc/4.1/cassandra/operating/security.html#ssl-certificate-hot-reloading
>> >>
>> >> > When the files are updated, Cassandra will reload them and use them for 
>> >> > subsequent connections
>> >>
>> >> I suppose one could do a rolling disablebinary/enablebinary (if it's
>> >> only client connections) after you roll out a keystore/truststore
>> >> change as a way of enforcing the existing connections to reestablish.
>> >>
>> >> Thanks,
>> >> Andy
>> >>
>> >>
>> >> On Mon, Apr 15, 2024 at 11:11 AM pabbireddy avinash
>> >> mailto:pabbireddyavin...@gmail.com>> wrote:
>> >> >
>> >> > Dear Community,
>> >> >
>> >> > I hope this email finds you well. I am currently testing SSL 
>> >> > certificate hot reloading on a Cassandra cluster running version 4.1 
>> >> > and encountered a situation that requires your expertise.
>> >> >
>> >> > Here's a summary of the process and issue:
>> >> >
>> >> > Reloading Process: We reloaded certificates signed by our in-house 
>> >> > certificate authority into the cluster, which was initially running 
>> >> > with self-signed certificates. The reload was done node by node.
>> >> >
>> >> > Truststore and Keystore: The truststore and keystore passwords are the 
>> >> > same across the cluster.
>> >> >
>> >> > Unexpected Behavior: Despite the different truststore configurations 
>> >> > for the self-signed and new CA certificates, we observed no breakdown 
>> >> > in server-to-server communication during the reload. We did not upload 
>> >> > the new CA cert into the old truststore.We anticipated interruptions 
>> >> > due to the differing truststore configurations but did not encounter 
>> >> > any.
>> >> >
>> >> > Post-Reload Changes: After reloading, we updated the cqlshrc file with 
>> >> > the new CA certificate and key to connect to cqlsh.
>> >> >
>> >> > server_encryption_options:
>> >> >
>> >> > internode_encryption: all
>> >> >
>> >> > keystore: ~/conf/server-keystore.jks
>> >> >
>> >> > keystore_password: 
>> >> >
>> >> > truststore: ~/conf/server-truststore.jks
>> >> >
>> >> > truststore_password: 
>> >> >
>>

Re: ssl certificate hot reloading test - cassandra 4.1

2024-04-15 Thread Dinesh Joshi

In addition to what Andy mentioned, I want to point out that for the vast
majority of use-cases, we would like to _avoid_ interruptions when a
certificate is updated so it is by design. If you're dealing with a
situation where you want to ensure that the connections are cycled, you can
follow Andy's advice. It will require automation outside the database that
you might already have. If there is demand, we can consider adding a
feature to slowly cycle the connections so the old SSL context is not used
anymore.

One more thing you should bear in mind is that Cassandra will not load the
new SSL context if it cannot successfully initialize it. This is again by
design to prevent an outage when the updated truststore is corrupted or
could not be read in some way.

thanks,
Dinesh

On Mon, Apr 15, 2024 at 9:45 AM Tolbert, Andy  wrote:

> I should mention, when toggling disablebinary/enablebinary between
> instances, you will probably want to give some time between doing this
> so connections can reestablish, and you will want to verify that the
> connections can actually reestablish.  You also need to be mindful of
> this being disruptive to inflight queries (if your client is
> configured for retries it will probably be fine).  Semantically to
> your applications it should look a lot like a rolling cluster bounce.
>
> Thanks,
> Andy
>
> On Mon, Apr 15, 2024 at 11:39 AM pabbireddy avinash
>  wrote:
> >
> > Thanks Andy for your reply . We will test the scenario you mentioned.
> >
> > Regards
> > Avinash
> >
> > On Mon, Apr 15, 2024 at 11:28 AM, Tolbert, Andy 
> wrote:
> >>
> >> Hi Avinash,
> >>
> >> As far as I understand it, if the underlying keystore/trustore(s)
> >> Cassandra is configured for is updated, this *will not* provoke
> >> Cassandra to interrupt existing connections, it's just that the new
> >> stores will be used for future TLS initialization.
> >>
> >> Via:
> https://cassandra.apache.org/doc/4.1/cassandra/operating/security.html#ssl-certificate-hot-reloading
> >>
> >> > When the files are updated, Cassandra will reload them and use them
> for subsequent connections
> >>
> >> I suppose one could do a rolling disablebinary/enablebinary (if it's
> >> only client connections) after you roll out a keystore/truststore
> >> change as a way of enforcing the existing connections to reestablish.
> >>
> >> Thanks,
> >> Andy
> >>
> >>
> >> On Mon, Apr 15, 2024 at 11:11 AM pabbireddy avinash
> >>  wrote:
> >> >
> >> > Dear Community,
> >> >
> >> > I hope this email finds you well. I am currently testing SSL
> certificate hot reloading on a Cassandra cluster running version 4.1 and
> encountered a situation that requires your expertise.
> >> >
> >> > Here's a summary of the process and issue:
> >> >
> >> > Reloading Process: We reloaded certificates signed by our in-house
> certificate authority into the cluster, which was initially running with
> self-signed certificates. The reload was done node by node.
> >> >
> >> > Truststore and Keystore: The truststore and keystore passwords are
> the same across the cluster.
> >> >
> >> > Unexpected Behavior: Despite the different truststore configurations
> for the self-signed and new CA certificates, we observed no breakdown in
> server-to-server communication during the reload. We did not upload the new
> CA cert into the old truststore.We anticipated interruptions due to the
> differing truststore configurations but did not encounter any.
> >> >
> >> > Post-Reload Changes: After reloading, we updated the cqlshrc file
> with the new CA certificate and key to connect to cqlsh.
> >> >
> >> > server_encryption_options:
> >> >
> >> > internode_encryption: all
> >> >
> >> > keystore: ~/conf/server-keystore.jks
> >> >
> >> > keystore_password: 
> >> >
> >> > truststore: ~/conf/server-truststore.jks
> >> >
> >> > truststore_password: 
> >> >
> >> > protocol: TLS
> >> >
> >> > algorithm: SunX509
> >> >
> >> > store_type: JKS
> >> >
> >> > cipher_suites: [TLS_RSA_WITH_AES_256_CBC_SHA]
> >> >
> >> > require_client_auth: true
> >> >
> >> > client_encryption_options:
> >> >
> >> > enabled: true
> >> >
> >> > keystore: ~/conf/server-keystore.jks
> >> >
> >> > keystore_password: 
> >> >
> >> > require_client_auth: true
> >> >
> >> > truststore: ~/conf/server-truststore.jks
> >> >
> >> > truststore_password: 
> >> >
> >> > protocol: TLS
> >> >
> >> > algorithm: SunX509
> >> >
> >> > store_type: JKS
> >> >
> >> > cipher_suites: [TLS_RSA_WITH_AES_256_CBC_SHA]
> >> >
> >> > Given this situation, I have the following questions:
> >> >
> >> > Could there be a reason for the continuity of server-to-server
> communication despite the different truststores?
> >> > Is there a possibility that the old truststore remains cached even
> after reloading the certificates on a node?
> >> > Have others encountered similar issues, and if so, what were your
> solutions?
> >> >
> >> > Any insights or

Re: ssl certificate hot reloading test - cassandra 4.1

2024-04-15 Thread Tolbert, Andy

I should mention, when toggling disablebinary/enablebinary between
instances, you will probably want to give some time between doing this
so connections can reestablish, and you will want to verify that the
connections can actually reestablish.  You also need to be mindful of
this being disruptive to inflight queries (if your client is
configured for retries it will probably be fine).  Semantically to
your applications it should look a lot like a rolling cluster bounce.

Thanks,
Andy

On Mon, Apr 15, 2024 at 11:39 AM pabbireddy avinash
 wrote:
>
> Thanks Andy for your reply . We will test the scenario you mentioned.
>
> Regards
> Avinash
>
> On Mon, Apr 15, 2024 at 11:28 AM, Tolbert, Andy  
> wrote:
>>
>> Hi Avinash,
>>
>> As far as I understand it, if the underlying keystore/trustore(s)
>> Cassandra is configured for is updated, this *will not* provoke
>> Cassandra to interrupt existing connections, it's just that the new
>> stores will be used for future TLS initialization.
>>
>> Via: 
>> https://cassandra.apache.org/doc/4.1/cassandra/operating/security.html#ssl-certificate-hot-reloading
>>
>> > When the files are updated, Cassandra will reload them and use them for 
>> > subsequent connections
>>
>> I suppose one could do a rolling disablebinary/enablebinary (if it's
>> only client connections) after you roll out a keystore/truststore
>> change as a way of enforcing the existing connections to reestablish.
>>
>> Thanks,
>> Andy
>>
>>
>> On Mon, Apr 15, 2024 at 11:11 AM pabbireddy avinash
>>  wrote:
>> >
>> > Dear Community,
>> >
>> > I hope this email finds you well. I am currently testing SSL certificate 
>> > hot reloading on a Cassandra cluster running version 4.1 and encountered a 
>> > situation that requires your expertise.
>> >
>> > Here's a summary of the process and issue:
>> >
>> > Reloading Process: We reloaded certificates signed by our in-house 
>> > certificate authority into the cluster, which was initially running with 
>> > self-signed certificates. The reload was done node by node.
>> >
>> > Truststore and Keystore: The truststore and keystore passwords are the 
>> > same across the cluster.
>> >
>> > Unexpected Behavior: Despite the different truststore configurations for 
>> > the self-signed and new CA certificates, we observed no breakdown in 
>> > server-to-server communication during the reload. We did not upload the 
>> > new CA cert into the old truststore.We anticipated interruptions due to 
>> > the differing truststore configurations but did not encounter any.
>> >
>> > Post-Reload Changes: After reloading, we updated the cqlshrc file with the 
>> > new CA certificate and key to connect to cqlsh.
>> >
>> > server_encryption_options:
>> >
>> > internode_encryption: all
>> >
>> > keystore: ~/conf/server-keystore.jks
>> >
>> > keystore_password: 
>> >
>> > truststore: ~/conf/server-truststore.jks
>> >
>> > truststore_password: 
>> >
>> > protocol: TLS
>> >
>> > algorithm: SunX509
>> >
>> > store_type: JKS
>> >
>> > cipher_suites: [TLS_RSA_WITH_AES_256_CBC_SHA]
>> >
>> > require_client_auth: true
>> >
>> > client_encryption_options:
>> >
>> > enabled: true
>> >
>> > keystore: ~/conf/server-keystore.jks
>> >
>> > keystore_password: 
>> >
>> > require_client_auth: true
>> >
>> > truststore: ~/conf/server-truststore.jks
>> >
>> > truststore_password: 
>> >
>> > protocol: TLS
>> >
>> > algorithm: SunX509
>> >
>> > store_type: JKS
>> >
>> > cipher_suites: [TLS_RSA_WITH_AES_256_CBC_SHA]
>> >
>> > Given this situation, I have the following questions:
>> >
>> > Could there be a reason for the continuity of server-to-server 
>> > communication despite the different truststores?
>> > Is there a possibility that the old truststore remains cached even after 
>> > reloading the certificates on a node?
>> > Have others encountered similar issues, and if so, what were your 
>> > solutions?
>> >
>> > Any insights or suggestions would be greatly appreciated. Please let me 
>> > know if further information is needed.
>> >
>> > Thank you
>> >
>> > Best regards,
>> >
>> > Avinash

Re: ssl certificate hot reloading test - cassandra 4.1

2024-04-15 Thread pabbireddy avinash

Thanks Andy for your reply . We will test the scenario you mentioned.

Regards
Avinash

On Mon, Apr 15, 2024 at 11:28 AM, Tolbert, Andy  wrote:

> Hi Avinash,
>
> As far as I understand it, if the underlying keystore/trustore(s)
> Cassandra is configured for is updated, this *will not* provoke
> Cassandra to interrupt existing connections, it's just that the new
> stores will be used for future TLS initialization.
>
> Via:
> https://cassandra.apache.org/doc/4.1/cassandra/operating/security.html#ssl-certificate-hot-reloading
>
> > When the files are updated, Cassandra will reload them and use them for
> subsequent connections
>
> I suppose one could do a rolling disablebinary/enablebinary (if it's
> only client connections) after you roll out a keystore/truststore
> change as a way of enforcing the existing connections to reestablish.
>
> Thanks,
> Andy
>
>
> On Mon, Apr 15, 2024 at 11:11 AM pabbireddy avinash
>  wrote:
> >
> > Dear Community,
> >
> > I hope this email finds you well. I am currently testing SSL certificate
> hot reloading on a Cassandra cluster running version 4.1 and encountered a
> situation that requires your expertise.
> >
> > Here's a summary of the process and issue:
> >
> > Reloading Process: We reloaded certificates signed by our in-house
> certificate authority into the cluster, which was initially running with
> self-signed certificates. The reload was done node by node.
> >
> > Truststore and Keystore: The truststore and keystore passwords are the
> same across the cluster.
> >
> > Unexpected Behavior: Despite the different truststore configurations for
> the self-signed and new CA certificates, we observed no breakdown in
> server-to-server communication during the reload. We did not upload the new
> CA cert into the old truststore.We anticipated interruptions due to the
> differing truststore configurations but did not encounter any.
> >
> > Post-Reload Changes: After reloading, we updated the cqlshrc file with
> the new CA certificate and key to connect to cqlsh.
> >
> > server_encryption_options:
> >
> > internode_encryption: all
> >
> > keystore: ~/conf/server-keystore.jks
> >
> > keystore_password: 
> >
> > truststore: ~/conf/server-truststore.jks
> >
> > truststore_password: 
> >
> > protocol: TLS
> >
> > algorithm: SunX509
> >
> > store_type: JKS
> >
> > cipher_suites: [TLS_RSA_WITH_AES_256_CBC_SHA]
> >
> > require_client_auth: true
> >
> > client_encryption_options:
> >
> > enabled: true
> >
> > keystore: ~/conf/server-keystore.jks
> >
> > keystore_password: 
> >
> > require_client_auth: true
> >
> > truststore: ~/conf/server-truststore.jks
> >
> > truststore_password: 
> >
> > protocol: TLS
> >
> > algorithm: SunX509
> >
> > store_type: JKS
> >
> > cipher_suites: [TLS_RSA_WITH_AES_256_CBC_SHA]
> >
> > Given this situation, I have the following questions:
> >
> > Could there be a reason for the continuity of server-to-server
> communication despite the different truststores?
> > Is there a possibility that the old truststore remains cached even after
> reloading the certificates on a node?
> > Have others encountered similar issues, and if so, what were your
> solutions?
> >
> > Any insights or suggestions would be greatly appreciated. Please let me
> know if further information is needed.
> >
> > Thank you
> >
> > Best regards,
> >
> > Avinash
>

Re: ssl certificate hot reloading test - cassandra 4.1

2024-04-15 Thread Tolbert, Andy

Hi Avinash,

As far as I understand it, if the underlying keystore/trustore(s)
Cassandra is configured for is updated, this *will not* provoke
Cassandra to interrupt existing connections, it's just that the new
stores will be used for future TLS initialization.

Via: 
https://cassandra.apache.org/doc/4.1/cassandra/operating/security.html#ssl-certificate-hot-reloading

> When the files are updated, Cassandra will reload them and use them for 
> subsequent connections

I suppose one could do a rolling disablebinary/enablebinary (if it's
only client connections) after you roll out a keystore/truststore
change as a way of enforcing the existing connections to reestablish.

Thanks,
Andy


On Mon, Apr 15, 2024 at 11:11 AM pabbireddy avinash
 wrote:
>
> Dear Community,
>
> I hope this email finds you well. I am currently testing SSL certificate hot 
> reloading on a Cassandra cluster running version 4.1 and encountered a 
> situation that requires your expertise.
>
> Here's a summary of the process and issue:
>
> Reloading Process: We reloaded certificates signed by our in-house 
> certificate authority into the cluster, which was initially running with 
> self-signed certificates. The reload was done node by node.
>
> Truststore and Keystore: The truststore and keystore passwords are the same 
> across the cluster.
>
> Unexpected Behavior: Despite the different truststore configurations for the 
> self-signed and new CA certificates, we observed no breakdown in 
> server-to-server communication during the reload. We did not upload the new 
> CA cert into the old truststore.We anticipated interruptions due to the 
> differing truststore configurations but did not encounter any.
>
> Post-Reload Changes: After reloading, we updated the cqlshrc file with the 
> new CA certificate and key to connect to cqlsh.
>
> server_encryption_options:
>
> internode_encryption: all
>
> keystore: ~/conf/server-keystore.jks
>
> keystore_password: 
>
> truststore: ~/conf/server-truststore.jks
>
> truststore_password: 
>
> protocol: TLS
>
> algorithm: SunX509
>
> store_type: JKS
>
> cipher_suites: [TLS_RSA_WITH_AES_256_CBC_SHA]
>
> require_client_auth: true
>
> client_encryption_options:
>
> enabled: true
>
> keystore: ~/conf/server-keystore.jks
>
> keystore_password: 
>
> require_client_auth: true
>
> truststore: ~/conf/server-truststore.jks
>
> truststore_password: 
>
> protocol: TLS
>
> algorithm: SunX509
>
> store_type: JKS
>
> cipher_suites: [TLS_RSA_WITH_AES_256_CBC_SHA]
>
> Given this situation, I have the following questions:
>
> Could there be a reason for the continuity of server-to-server communication 
> despite the different truststores?
> Is there a possibility that the old truststore remains cached even after 
> reloading the certificates on a node?
> Have others encountered similar issues, and if so, what were your solutions?
>
> Any insights or suggestions would be greatly appreciated. Please let me know 
> if further information is needed.
>
> Thank you
>
> Best regards,
>
> Avinash

RE: Datacenter decommissioning on Cassandra 4.1.4

2024-04-08 Thread Michalis Kotsiouros (EXT) via user

Hello Jon and Jeff,
Thanks a lot for your replies.
I completely get your points.
Some more clarification about my issue.
When trying to update the Replication before the decommission, I get the 
following error message when I remove the replication for system_auth kesypace.
ConfigurationException: Following datacenters have active nodes and must be 
present in replication options for keyspace system_auth: [datacenter1]

This error message does not appear in the rest of the application keyspaces.
So, may I change the procedure to:

  1.  Make sure no clients are still writing to any nodes in the datacenter.
  2.  Run a full repair with nodetool repair.
  3.  Change all keyspaces so they no longer reference the datacenter being 
removed apart from system_auth keyspace.
  4.  Run nodetool decommission using the --force option on every node in the 
datacenter being removed.
  5.  Change system_auth keyspace so they no longer reference the datacenter 
being removed.
BR
MK



From: Jeff Jirsa 
Sent: April 08, 2024 17:19
To: cassandra 
Cc: Michalis Kotsiouros (EXT) 
Subject: Re: Datacenter decommissioning on Cassandra 4.1.4

To Jon’s point, if you remove from replication after step 1 or step 2 (probably 
step 2 if your goal is to be strictly correct), the nodetool decommission phase 
becomes almost a no-op.

If you use the order below, the last nodes to decommission will cause those 
surviving machines to run out of space (assuming you have more than a few nodes 
to start)




On Apr 8, 2024, at 6:58 AM, Jon Haddad 
mailto:j...@jonhaddad.com>> wrote:

You shouldn’t decom an entire DC before removing it from replication.

—

Jon Haddad
Rustyrazorblade Consulting
rustyrazorblade.com<https://protect2.fireeye.com/v1/url?k=31323334-501d5122-313273af-45444731-1624a77accb6d839=1=8a954d2d-17da-40df-8732-bdcc7893179a=http%3A%2F%2Frustyrazorblade.com%2F>


On Mon, Apr 8, 2024 at 6:26 AM Michalis Kotsiouros (EXT) via user 
mailto:user@cassandra.apache.org>> wrote:
Hello community,
In our deployments, we usually rebuild the Cassandra datacenters for 
maintenance or recovery operations.
The procedure used since the days of Cassandra 3.x was the one documented in 
datastax documentation. Decommissioning a datacenter | Apache Cassandra 3.x 
(datastax.com)<https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/operations/opsDecomissionDC.html>
After upgrading to Cassandra 4.1.4, we have realized that there are some 
stricter rules that do not allo to remove the replication when active Cassandra 
nodes still exist in a datacenter.
This check makes the above-mentioned procedure obsolete.
I am thinking to use the following as an alternative:

  1.  Make sure no clients are still writing to any nodes in the datacenter.
  2.  Run a full repair with nodetool repair.
  3.  Run nodetool decommission using the --force option on every node in the 
datacenter being removed.
  4.  Change all keyspaces so they no longer reference the datacenter being 
removed.

What is the procedure followed by other users? Do you see any risk following 
the proposed procedure?

BR
MK

Re: Datacenter decommissioning on Cassandra 4.1.4

2024-04-08 Thread Jeff Jirsa

To Jon’s point, if you remove from replication after step 1 or step 2 (probably 
step 2 if your goal is to be strictly correct), the nodetool decommission phase 
becomes almost a no-op. 

If you use the order below, the last nodes to decommission will cause those 
surviving machines to run out of space (assuming you have more than a few nodes 
to start)



> On Apr 8, 2024, at 6:58 AM, Jon Haddad  wrote:
> 
> You shouldn’t decom an entire DC before removing it from replication.
> 
> —
> 
> Jon Haddad
> Rustyrazorblade Consulting
> rustyrazorblade.com 
> 
> 
> On Mon, Apr 8, 2024 at 6:26 AM Michalis Kotsiouros (EXT) via user 
> mailto:user@cassandra.apache.org>> wrote:
>> Hello community,
>> 
>> In our deployments, we usually rebuild the Cassandra datacenters for 
>> maintenance or recovery operations.
>> 
>> The procedure used since the days of Cassandra 3.x was the one documented in 
>> datastax documentation. Decommissioning a datacenter | Apache Cassandra 3.x 
>> (datastax.com) 
>> 
>> After upgrading to Cassandra 4.1.4, we have realized that there are some 
>> stricter rules that do not allo to remove the replication when active 
>> Cassandra nodes still exist in a datacenter.
>> 
>> This check makes the above-mentioned procedure obsolete.
>> 
>> I am thinking to use the following as an alternative:
>> 
>> Make sure no clients are still writing to any nodes in the datacenter.
>> Run a full repair with nodetool repair.
>> Run nodetool decommission using the --force option on every node in the 
>> datacenter being removed.
>> Change all keyspaces so they no longer reference the datacenter being 
>> removed.
>>  
>> 
>> What is the procedure followed by other users? Do you see any risk following 
>> the proposed procedure?
>> 
>>  
>> 
>> BR
>> 
>> MK
>>

Re: Datacenter decommissioning on Cassandra 4.1.4

2024-04-08 Thread Jon Haddad

You shouldn’t decom an entire DC before removing it from replication.

—

Jon Haddad
Rustyrazorblade Consulting
rustyrazorblade.com


On Mon, Apr 8, 2024 at 6:26 AM Michalis Kotsiouros (EXT) via user <
user@cassandra.apache.org> wrote:

> Hello community,
>
> In our deployments, we usually rebuild the Cassandra datacenters for
> maintenance or recovery operations.
>
> The procedure used since the days of Cassandra 3.x was the one documented
> in datastax documentation. Decommissioning a datacenter | Apache
> Cassandra 3.x (datastax.com)
> 
>
> After upgrading to Cassandra 4.1.4, we have realized that there are some
> stricter rules that do not allo to remove the replication when active
> Cassandra nodes still exist in a datacenter.
>
> This check makes the above-mentioned procedure obsolete.
>
> I am thinking to use the following as an alternative:
>
>1. Make sure no clients are still writing to any nodes in the
>datacenter.
>2. Run a full repair with nodetool repair.
>3. Run nodetool decommission using the --force option on every node in
>the datacenter being removed.
>4. Change all keyspaces so they no longer reference the datacenter
>being removed.
>
>
>
> What is the procedure followed by other users? Do you see any risk
> following the proposed procedure?
>
>
>
> BR
>
> MK
>

Re: Update: C/C NA Call for Presentations Deadline Extended to April 15th

2024-04-06 Thread Paulo Motta

Hi,

I would like to send a friendly reminder that the Community Over Code North
America 2024 call for presentations ends in a little less than 9 days on
Mon, 15 April 2024 22:59:59 UTC. Don't leave your Cassandra submissions to
the last minute! :-)

Thanks,

Paulo

On Tue, Mar 19, 2024 at 7:19 PM Paulo Motta  wrote:

> Hi,
>
> I wanted to update that the Call for Presentations deadline was extended
> by two weeks to April 15th, 2024 for Community Over Code North America
> 2024. Find more information on this blog post:
> https://news.apache.org/foundation/entry/apache-software-foundation-opens-cfp-for-community-over-code-north-america-2024
>
> We're looking for presentation abstracts in the following areas:
> * Customizing and tweaking Cassandra
> * Benchmarking and testing Cassandra
> * New Cassandra features and improvements
> * Provisioning and operating Cassandra
> * Developing with Cassandra
> * Anything else related to Apache Cassandra
>
> Please use this link to submit your proposal:
> https://sessionize.com/community-over-code-na-2024/
>
> Thanks,
>
> Paulo
>

Re: Query on Performance Dip

2024-04-05 Thread Jon Haddad

Try changing the chunk length parameter on the compression settings to 4kb,
and reduce read ahead to 16kb if you’re using EBS or 4KB if you’re using
decent local ssd or nvme.

Counters read before write.

—
Jon Haddad
Rustyrazorblade Consulting
rustyrazorblade.com


On Fri, Apr 5, 2024 at 9:27 AM Subroto Barua  wrote:

> follow up question on performance issue with 'counter writes'- is there a
> parameter or condition that limits the allocation rate for
> 'CounterMutationStage'? I see 13-18mb/s for 4.1.4 Vs 20-25mb/s for 4.0.5.
>
> The back-end infra is same for both the clusters and same test cases/data
> model.
> On Saturday, March 30, 2024 at 08:40:28 AM PDT, Jon Haddad <
> j...@jonhaddad.com> wrote:
>
>
> Hi,
>
> Unfortunately, the numbers you're posting have no meaning without
> context.  The speculative retries could be the cause of a problem, or you
> could simply be executing enough queries and you have a fairly high
> variance in latency which triggers them often.  It's unclear how many
> queries / second you're executing and there's no historical information to
> suggest if what you're seeing now is an anomaly or business as usual.
>
> If you want to determine if your theory that speculative retries are
> causing your performance issue, then you could try changing speculative
> retry to a fixed value instead of a percentile, such as 50MS.  It's easy
> enough to try and you can get an answer to your question almost immediately.
>
> The problem with this is that you're essentially guessing based on very
> limited information - the output of a nodetool command you've run "every
> few secs".  I prefer to use a more data driven approach.  Get a CPU flame
> graph and figure out where your time is spent:
> https://rustyrazorblade.com/post/2023/2023-11-07-async-profiler/
>
> The flame graph will reveal where your time is spent, and you can focus on
> improving that, rather than looking at a random statistic that you've
> picked.
>
> I just gave a talk at SCALE on distributed systems performance
> troubleshooting.  You'll be better off following a methodical process than
> guessing at potential root causes, because the odds of you correctly
> guessing the root cause in a system this complex is close to zero.  My talk
> is here: https://www.youtube.com/watch?v=VX9tHk3VTLE
>
> I'm guessing you don't have dashboards in place if you're relying on
> nodetool output with grep.  If your cluster is under 6 nodes, you can take
> advantage of AxonOps's free tier: https://axonops.com/
>
> Good dashboards are essential for these types of problems.
>
> Jon
>
>
>
> On Sat, Mar 30, 2024 at 2:33 AM ranju goel  wrote:
>
> Hi All,
>
> On debugging the cluster for performance dip seen while using 4.1.4,  i
> found high speculation retries Value in nodetool tablestats during read
> operation.
>
> I ran the below tablestats command and checked its output after every few
> secs and noticed that retries are on rising side. Also there is one open
> ticket (https://issues.apache.org/jira/browse/CASSANDRA-18766) similar to
> this.
> /usr/share/cassandra/bin/nodetool -u  -pw  -p 
> tablestats  | grep -i 'Speculative retries'
>
>
>
> Speculative retries: 11633
>
> ..
>
> ..
>
> Speculative retries: 13727
>
>
>
> Speculative retries: 14256
>
> Speculative retries: 14855
>
> Speculative retries: 14858
>
> Speculative retries: 14859
>
> Speculative retries: 14873
>
> Speculative retries: 14875
>
> Speculative retries: 14890
>
> Speculative retries: 14893
>
> Speculative retries: 14896
>
> Speculative retries: 14901
>
> Speculative retries: 14905
>
> Speculative retries: 14946
>
> Speculative retries: 14948
>
> Speculative retries: 14957
>
>
> Suspecting this could be performance dip cause.  Please add in case anyone
> knows more about it.
>
>
> Regards
>
>
>
>
>
>
>
>
> On Wed, Mar 27, 2024 at 10:43 PM Subroto Barua via user <
> user@cassandra.apache.org> wrote:
>
> we are seeing similar perf issues with counter writes - to reproduce:
>
> cassandra-stress counter_write n=10 no-warmup cl=LOCAL_QUORUM -rate
> threads=50 -mode native cql3 user= password= -name 
>
>
> op rate: 39,260 ops (4.1) and 63,689 ops (4.0)
> latency 99th percentile: 7.7ms (4.1) and 1.8ms (4.0)
> Total GC count: 750 (4.1) and 744 (4.0)
> Avg GC time: 106 ms (4.1) and 110.1 ms (4.0)
>
>
> On Wednesday, March 27, 2024 at 12:18:50 AM PDT, ranju goel <
> goel.ra...@gmail.com> wrote:
>
>
> Hi All,
>
> Was going through this mail chain
> (https://www.mail-archive.com/user@cassandra.apache.org/msg63564.html)
>  and was wondering that if this could cause a performance degradation in
> 4.1 without changing compactionThroughput.
>
> As seeing performance dip in Read/Write after upgrading

Re: Query on Performance Dip

2024-04-05 Thread Subroto Barua via user

 follow up question on performance issue with 'counter writes'- is there a 
parameter or condition that limits the allocation rate for 
'CounterMutationStage'? I see 13-18mb/s for 4.1.4 Vs 20-25mb/s for 4.0.5.

The back-end infra is same for both the clusters and same test cases/data model.
On Saturday, March 30, 2024 at 08:40:28 AM PDT, Jon Haddad 
 wrote:  
 
 Hi,

Unfortunately, the numbers you're posting have no meaning without context.  The 
speculative retries could be the cause of a problem, or you could simply be 
executing enough queries and you have a fairly high variance in latency which 
triggers them often.  It's unclear how many queries / second you're executing 
and there's no historical information to suggest if what you're seeing now is 
an anomaly or business as usual.
If you want to determine if your theory that speculative retries are causing 
your performance issue, then you could try changing speculative retry to a 
fixed value instead of a percentile, such as 50MS.  It's easy enough to try and 
you can get an answer to your question almost immediately.
The problem with this is that you're essentially guessing based on very limited 
information - the output of a nodetool command you've run "every few secs".  I 
prefer to use a more data driven approach.  Get a CPU flame graph and figure 
out where your time is spent: 
https://rustyrazorblade.com/post/2023/2023-11-07-async-profiler/
The flame graph will reveal where your time is spent, and you can focus on 
improving that, rather than looking at a random statistic that you've picked.
I just gave a talk at SCALE on distributed systems performance troubleshooting. 
 You'll be better off following a methodical process than guessing at potential 
root causes, because the odds of you correctly guessing the root cause in a 
system this complex is close to zero.  My talk is here: 
https://www.youtube.com/watch?v=VX9tHk3VTLE
I'm guessing you don't have dashboards in place if you're relying on nodetool 
output with grep.  If your cluster is under 6 nodes, you can take advantage of 
AxonOps's free tier: https://axonops.com/
Good dashboards are essential for these types of problems.    
Jon


On Sat, Mar 30, 2024 at 2:33 AM ranju goel  wrote:

Hi All,
On debugging the cluster for performance dip seen while using 4.1.4,  i found 
high speculation retries Value in nodetool tablestats during read operation.
I ran the below tablestats command and checked its output after every few secs 
and noticed that retries are on rising side. Also there is one open ticket 
(https://issues.apache.org/jira/browse/CASSANDRA-18766) similar to 
this./usr/share/cassandra/bin/nodetool -u  -pw  -p  
tablestats  | grep -i 'Speculative retries' 

                    

    Speculative retries: 11633

                ..

                ..

                Speculative retries: 13727

     

    Speculative retries: 14256

    Speculative retries: 14855

    Speculative retries: 14858

    Speculative retries: 14859

    Speculative retries: 14873

    Speculative retries: 14875

    Speculative retries: 14890

    Speculative retries: 14893

    Speculative retries: 14896

    Speculative retries: 14901

    Speculative retries: 14905

    Speculative retries: 14946

    Speculative retries: 14948

    Speculative retries: 14957




Suspecting this could be performance dip cause.  Please add in case anyone 
knows more about it.




Regards













On Wed, Mar 27, 2024 at 10:43 PM Subroto Barua via user 
 wrote:

 we are seeing similar perf issues with counter writes - to reproduce:

cassandra-stress counter_write n=10 no-warmup cl=LOCAL_QUORUM -rate 
threads=50 -mode native cql3 user= password= -name  


op rate: 39,260 ops (4.1) and 63,689 ops (4.0)
latency 99th percentile: 7.7ms (4.1) and 1.8ms (4.0)
Total GC count: 750 (4.1) and 744 (4.0)
Avg GC time: 106 ms (4.1) and 110.1 ms (4.0)

On Wednesday, March 27, 2024 at 12:18:50 AM PDT, ranju goel 
 wrote:  
 
 Hi All,

Was going through this mail chain 
(https://www.mail-archive.com/user@cassandra.apache.org/msg63564.html) and was 
wondering that if this could cause a performance degradation in 4.1 without 
changing compactionThroughput. 

As seeing performance dip in Read/Write after upgrading from 4.0 to 4.1.

RegardsRanju

Re: Query on Performance Dip

2024-03-30 Thread Jon Haddad

Hi,

Unfortunately, the numbers you're posting have no meaning without context.
The speculative retries could be the cause of a problem, or you could
simply be executing enough queries and you have a fairly high variance in
latency which triggers them often.  It's unclear how many queries / second
you're executing and there's no historical information to suggest if what
you're seeing now is an anomaly or business as usual.

If you want to determine if your theory that speculative retries are
causing your performance issue, then you could try changing speculative
retry to a fixed value instead of a percentile, such as 50MS.  It's easy
enough to try and you can get an answer to your question almost immediately.

The problem with this is that you're essentially guessing based on very
limited information - the output of a nodetool command you've run "every
few secs".  I prefer to use a more data driven approach.  Get a CPU flame
graph and figure out where your time is spent:
https://rustyrazorblade.com/post/2023/2023-11-07-async-profiler/

The flame graph will reveal where your time is spent, and you can focus on
improving that, rather than looking at a random statistic that you've
picked.

I just gave a talk at SCALE on distributed systems performance
troubleshooting.  You'll be better off following a methodical process than
guessing at potential root causes, because the odds of you correctly
guessing the root cause in a system this complex is close to zero.  My talk
is here: https://www.youtube.com/watch?v=VX9tHk3VTLE

I'm guessing you don't have dashboards in place if you're relying on
nodetool output with grep.  If your cluster is under 6 nodes, you can take
advantage of AxonOps's free tier: https://axonops.com/

Good dashboards are essential for these types of problems.

Jon

On Sat, Mar 30, 2024 at 2:33 AM ranju goel  wrote:

> Hi All,
>
> On debugging the cluster for performance dip seen while using 4.1.4,  i
> found high speculation retries Value in nodetool tablestats during read
> operation.
>
> I ran the below tablestats command and checked its output after every few
> secs and noticed that retries are on rising side. Also there is one open
> ticket (https://issues.apache.org/jira/browse/CASSANDRA-18766) similar to
> this.
> /usr/share/cassandra/bin/nodetool -u  -pw  -p 
> tablestats  | grep -i 'Speculative retries'
>
>
>
> Speculative retries: 11633
>
> ..
>
> ..
>
> Speculative retries: 13727
>
>
>
> Speculative retries: 14256
>
> Speculative retries: 14855
>
> Speculative retries: 14858
>
> Speculative retries: 14859
>
> Speculative retries: 14873
>
> Speculative retries: 14875
>
> Speculative retries: 14890
>
> Speculative retries: 14893
>
> Speculative retries: 14896
>
> Speculative retries: 14901
>
> Speculative retries: 14905
>
> Speculative retries: 14946
>
> Speculative retries: 14948
>
> Speculative retries: 14957
>
>
> Suspecting this could be performance dip cause.  Please add in case anyone
> knows more about it.
>
>
> Regards
>
>
>
>
>
>
>
>
> On Wed, Mar 27, 2024 at 10:43 PM Subroto Barua via user <
> user@cassandra.apache.org> wrote:
>
>> we are seeing similar perf issues with counter writes - to reproduce:
>>
>> cassandra-stress counter_write n=10 no-warmup cl=LOCAL_QUORUM -rate
>> threads=50 -mode native cql3 user= password= -name 
>>
>>
>> op rate: 39,260 ops (4.1) and 63,689 ops (4.0)
>> latency 99th percentile: 7.7ms (4.1) and 1.8ms (4.0)
>> Total GC count: 750 (4.1) and 744 (4.0)
>> Avg GC time: 106 ms (4.1) and 110.1 ms (4.0)
>>
>>
>> On Wednesday, March 27, 2024 at 12:18:50 AM PDT, ranju goel <
>> goel.ra...@gmail.com> wrote:
>>
>>
>> Hi All,
>>
>> Was going through this mail chain
>> (https://www.mail-archive.com/user@cassandra.apache.org/msg63564.html)
>>  and was wondering that if this could cause a performance degradation in
>> 4.1 without changing compactionThroughput.
>>
>> As seeing performance dip in Read/Write after upgrading from 4.0 to 4.1.
>>
>> Regards
>> Ranju
>>
>

Re: Query on Performance Dip

2024-03-30 Thread ranju goel

Hi All,

On debugging the cluster for performance dip seen while using 4.1.4,  i
found high speculation retries Value in nodetool tablestats during read
operation.

I ran the below tablestats command and checked its output after every few
secs and noticed that retries are on rising side. Also there is one open
ticket (https://issues.apache.org/jira/browse/CASSANDRA-18766) similar to
this.
/usr/share/cassandra/bin/nodetool -u  -pw  -p 
tablestats  | grep -i 'Speculative retries'

Speculative retries: 11633

..

..

Speculative retries: 13727

Speculative retries: 14256

Speculative retries: 14855

Speculative retries: 14858

Speculative retries: 14859

Speculative retries: 14873

Speculative retries: 14875

Speculative retries: 14890

Speculative retries: 14893

Speculative retries: 14896

Speculative retries: 14901

Speculative retries: 14905

Speculative retries: 14946

Speculative retries: 14948

Speculative retries: 14957

Suspecting this could be performance dip cause.  Please add in case anyone
knows more about it.

Regards

On Wed, Mar 27, 2024 at 10:43 PM Subroto Barua via user <
user@cassandra.apache.org> wrote:

> we are seeing similar perf issues with counter writes - to reproduce:
>
> cassandra-stress counter_write n=10 no-warmup cl=LOCAL_QUORUM -rate
> threads=50 -mode native cql3 user= password= -name 
>
>
> op rate: 39,260 ops (4.1) and 63,689 ops (4.0)
> latency 99th percentile: 7.7ms (4.1) and 1.8ms (4.0)
> Total GC count: 750 (4.1) and 744 (4.0)
> Avg GC time: 106 ms (4.1) and 110.1 ms (4.0)
>
>
> On Wednesday, March 27, 2024 at 12:18:50 AM PDT, ranju goel <
> goel.ra...@gmail.com> wrote:
>
>
> Hi All,
>
> Was going through this mail chain
> (https://www.mail-archive.com/user@cassandra.apache.org/msg63564.html)
>  and was wondering that if this could cause a performance degradation in
> 4.1 without changing compactionThroughput.
>
> As seeing performance dip in Read/Write after upgrading from 4.0 to 4.1.
>
> Regards
> Ranju
>

Re: Query on Performance Dip

2024-03-27 Thread Subroto Barua via user

 we are seeing similar perf issues with counter writes - to reproduce:

cassandra-stress counter_write n=10 no-warmup cl=LOCAL_QUORUM -rate 
threads=50 -mode native cql3 user= password= -name  


op rate: 39,260 ops (4.1) and 63,689 ops (4.0)
latency 99th percentile: 7.7ms (4.1) and 1.8ms (4.0)
Total GC count: 750 (4.1) and 744 (4.0)
Avg GC time: 106 ms (4.1) and 110.1 ms (4.0)

On Wednesday, March 27, 2024 at 12:18:50 AM PDT, ranju goel 
 wrote:  
 
 Hi All,

Was going through this mail chain 
(https://www.mail-archive.com/user@cassandra.apache.org/msg63564.html) and was 
wondering that if this could cause a performance degradation in 4.1 without 
changing compactionThroughput. 

As seeing performance dip in Read/Write after upgrading from 4.0 to 4.1.

RegardsRanju

Re: Cassandra 5.0 Beta1 - vector searching results

2024-03-27 Thread Caleb Rackliffe

> For your #1 - if there are going to be 100+ million vectors, wouldn't I
want the search to go across nodes?

If you have a replication factor of 3 and 3 nodes, every node will have a
complete copy of the data, so you'd only need to talk to one node. If your
replication factor is 1, you'd have to talk to all three nodes.

On Wed, Mar 27, 2024 at 9:06 AM Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> Thank you all for the details on this.
> For your #1 - if there are going to be 100+ million vectors, wouldn't I
> want the search to go across nodes?
>
> Right now, we're running both weaviate (8 node cluster), our main
> cassandra 4 cluster (12 nodes), and a test 3 node cassandra 5 cluster.
> Weaviate does some interesting things like product quantization to reduce
> size and improve search speed.  They get amazing speed, but the drawback
> is, from what I can tell, they load the entire index into RAM.  We've been
> having a reoccurring issue where once it runs out of RAM, it doesn't get
> slow; it just stops working.  Weaviate enables some powerful
> vector+boolean+range queries.  I would love to only have one database!
>
> I'll look into how to do profiling - the terms you use are things I'm not
> familiar with, but I've got chatGPT and google... :)
>
> -Joe
> On 3/21/2024 10:51 PM, Caleb Rackliffe wrote:
>
> To expand on Jonathan’s response, the best way to get SAI to perform on
> the read side is to use it as a tool for large-partition search. In other
> words, if you can model your data such that your queries will be restricted
> to a single partition, two things will happen…
>
> 1.) With all queries (not just ANN queries), you will only hit as many
> nodes as your read consistency level and replication factor require. For
> vector searches, that means you should only hit one node, and it should be
> the coordinating node w/ a properly configured, token-aware client.
>
> 2.) You can use LCS (or UCS configured to mimic LCS) instead of STCS as
> your table compaction strategy. This will essentially guarantee your
> (partition-restricted) SAI query hits a small number of SSTable-attached
> indexes. (It’ll hit Memtable-attached indexes as well for any recently
> added data, so if you’re seeing latencies shoot up, it’s possible there
> could be contention on the Memtable-attached index that supports ANN
> queries. I haven’t done a deep dive on it. You can always flush Memtables
> directly before queries to factor that out.)
>
> If you can do all of the above, the simple performance of the local index
> query and its post-filtering reads is probably the place to explore
> further. If you manage to collect any profiling data (JFR, flamegraphs via
> async-profiler, etc) I’d be happy to dig into it with you.
>
> Thanks for kicking the tires!
>
> On Mar 21, 2024, at 8:20 PM, Brebner, Paul via user
>   wrote:
>
> 
>
> Hi Joe,
>
>
>
> Have you considered submitting something for Community Over Code NA 2024?
> The CFP is still open for a few more weeks, options could be my Performance
> Engineering track or the Cassandra track – or both 
>
>
>
>
> https://www.linkedin.com/pulse/cfp-community-over-code-na-denver-2024-performance-track-paul-brebner-nagmc/?trackingId=PlmmMjMeQby0Mozq8cnIpA%3D%3D
>
>
>
> Regards, Paul Brebner
>
>
>
>
>
>
>
> *From: *Joe Obernberger 
> 
> *Date: *Friday, 22 March 2024 at 3:19 am
> *To: *user@cassandra.apache.org 
> 
> *Subject: *Cassandra 5.0 Beta1 - vector searching results
>
> EXTERNAL EMAIL - USE CAUTION when clicking links or attachments
>
>
>
>
> Hi All - I'd like to share some initial results for the vector search on
> Cassandra 5.0 beta1.  3 node cluster running in kubernetes; fast Netapp
> storage.
>
> Have a table (doc.embeddings_googleflan5tlarge) with definition:
>
> CREATE TABLE doc.embeddings_googleflant5large (
>  uuid text,
>  type text,
>  fieldname text,
>  offset int,
>  sourceurl text,
>  textdata text,
>  creationdate timestamp,
>  embeddings vector,
>  metadata boolean,
>  source text,
>  PRIMARY KEY ((uuid, type), fieldname, offset, sourceurl, textdata)
> ) WITH CLUSTERING ORDER BY (fieldname ASC, offset ASC, sourceurl ASC,
> textdata ASC)
>  AND additional_write_policy = '99p'
>  AND allow_auto_snapshot = true
>  AND bloom_filter_fp_chance = 0.01
>  AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>  AND cdc = false
>  AND comment = ''
>  AND compaction = {'class':
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
> 'max_threshold': '32', 'min_threshold': '4'}
>  AND compression = {'chunk_length_in_kb': '16', 'class':
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>  AND memtable = 'default'
>  AND crc_check_chance = 1.0
>  AND default_time_to_live = 0
>  AND extensions = {}
>  AND gc_grace_seconds = 864000
>  AND incremental_backups = true
>  AND max_index_interval = 2048
>  AND memtable_flush_period_in_ms = 0

Re: Cassandra 5.0 Beta1 - vector searching results

2024-03-27 Thread Joe Obernberger


Thank you all for the details on this.
For your #1 - if there are going to be 100+ million vectors, wouldn't I 
want the search to go across nodes?


Right now, we're running both weaviate (8 node cluster), our main 
cassandra 4 cluster (12 nodes), and a test 3 node cassandra 5 cluster.  
Weaviate does some interesting things like product quantization to 
reduce size and improve search speed.  They get amazing speed, but the 
drawback is, from what I can tell, they load the entire index into RAM.  
We've been having a reoccurring issue where once it runs out of RAM, it 
doesn't get slow; it just stops working.  Weaviate enables some powerful 
vector+boolean+range queries.  I would love to only have one database!


I'll look into how to do profiling - the terms you use are things I'm 
not familiar with, but I've got chatGPT and google... :)


-Joe

On 3/21/2024 10:51 PM, Caleb Rackliffe wrote:
To expand on Jonathan’s response, the best way to get SAI to perform 
on the read side is to use it as a tool for large-partition search. In 
other words, if you can model your data such that your queries will be 
restricted to a single partition, two things will happen…


1.) With all queries (not just ANN queries), you will only hit as many 
nodes as your read consistency level and replication factor require. 
For vector searches, that means you should only hit one node, and it 
should be the coordinating node w/ a properly configured, token-aware 
client.


2.) You can use LCS (or UCS configured to mimic LCS) instead of STCS 
as your table compaction strategy. This will essentially guarantee 
your (partition-restricted) SAI query hits a small number of 
SSTable-attached indexes. (It’ll hit Memtable-attached indexes as well 
for any recently added data, so if you’re seeing latencies shoot up, 
it’s possible there could be contention on the Memtable-attached index 
that supports ANN queries. I haven’t done a deep dive on it. You can 
always flush Memtables directly before queries to factor that out.)


If you can do all of the above, the simple performance of the local 
index query and its post-filtering reads is probably the place to 
explore further. If you manage to collect any profiling data (JFR, 
flamegraphs via async-profiler, etc) I’d be happy to dig into it with you.


Thanks for kicking the tires!

On Mar 21, 2024, at 8:20 PM, Brebner, Paul via user 
 wrote:




Hi Joe,

Have you considered submitting something for Community Over Code NA 
2024? The CFP is still open for a few more weeks, options could be my 
Performance Engineering track or the Cassandra track – or both 


https://www.linkedin.com/pulse/cfp-community-over-code-na-denver-2024-performance-track-paul-brebner-nagmc/?trackingId=PlmmMjMeQby0Mozq8cnIpA%3D%3D

Regards, Paul Brebner

*From: *Joe Obernberger 
*Date: *Friday, 22 March 2024 at 3:19 am
*To: *user@cassandra.apache.org 
*Subject: *Cassandra 5.0 Beta1 - vector searching results

EXTERNAL EMAIL - USE CAUTION when clicking links or attachments




Hi All - I'd like to share some initial results for the vector search on
Cassandra 5.0 beta1.  3 node cluster running in kubernetes; fast Netapp
storage.

Have a table (doc.embeddings_googleflan5tlarge) with definition:

CREATE TABLE doc.embeddings_googleflant5large (
 uuid text,
 type text,
 fieldname text,
 offset int,
 sourceurl text,
 textdata text,
 creationdate timestamp,
 embeddings vector,
 metadata boolean,
 source text,
 PRIMARY KEY ((uuid, type), fieldname, offset, sourceurl, textdata)
) WITH CLUSTERING ORDER BY (fieldname ASC, offset ASC, sourceurl ASC,
textdata ASC)
 AND additional_write_policy = '99p'
 AND allow_auto_snapshot = true
 AND bloom_filter_fp_chance = 0.01
 AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
 AND cdc = false
 AND comment = ''
 AND compaction = {'class':
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold': '32', 'min_threshold': '4'}
 AND compression = {'chunk_length_in_kb': '16', 'class':
'org.apache.cassandra.io.compress.LZ4Compressor'}
 AND memtable = 'default'
 AND crc_check_chance = 1.0
 AND default_time_to_live = 0
 AND extensions = {}
 AND gc_grace_seconds = 864000
 AND incremental_backups = true
 AND max_index_interval = 2048
 AND memtable_flush_period_in_ms = 0
 AND min_index_interval = 128
 AND read_repair = 'BLOCKING'
 AND speculative_retry = '99p';

CREATE CUSTOM INDEX ann_index_googleflant5large ON
doc.embeddings_googleflant5large (embeddings) USING 'sai';
CREATE CUSTOM INDEX offset_index_googleflant5large ON
doc.embeddings_googleflant5large (offset) USING 'sai';

nodetool status -r

UN cassandra-1.cassandra5.cassandra5-jos.svc.cluster.local 18.02 GiB
128 100.0% f2989dea-908b-4c06-9caa-4aacad8ba0e8  rack1
UN cassandra-2.cassandra5.cassandra5-jos.svc.cluster.local 17.98 GiB
128 100.0% ec4e506d-5f0d-475a-a3c1-aafe58399412  rack1

Re: Cassandra 5.0 Beta1 - vector searching results

2024-03-25 Thread Brebner, Paul via user

Hi all, curious if there is support for the new Cassandra vector data type in 
any open-source Kafka Connect Cassandra Sink connectors please? i.e. To write 
vector data to Cassandra from Kafka. Regards, Paul

From: Caleb Rackliffe 
Date: Friday, 22 March 2024 at 1:52 pm
To: user@cassandra.apache.org 
Subject: Re: Cassandra 5.0 Beta1 - vector searching results
You don't often get email from calebrackli...@gmail.com. Learn why this is 
important<https://aka.ms/LearnAboutSenderIdentification>

EXTERNAL EMAIL - USE CAUTION when clicking links or attachments

To expand on Jonathan’s response, the best way to get SAI to perform on the 
read side is to use it as a tool for large-partition search. In other words, if 
you can model your data such that your queries will be restricted to a single 
partition, two things will happen…

1.) With all queries (not just ANN queries), you will only hit as many nodes as 
your read consistency level and replication factor require. For vector 
searches, that means you should only hit one node, and it should be the 
coordinating node w/ a properly configured, token-aware client.

2.) You can use LCS (or UCS configured to mimic LCS) instead of STCS as your 
table compaction strategy. This will essentially guarantee your 
(partition-restricted) SAI query hits a small number of SSTable-attached 
indexes. (It’ll hit Memtable-attached indexes as well for any recently added 
data, so if you’re seeing latencies shoot up, it’s possible there could be 
contention on the Memtable-attached index that supports ANN queries. I haven’t 
done a deep dive on it. You can always flush Memtables directly before queries 
to factor that out.)

If you can do all of the above, the simple performance of the local index query 
and its post-filtering reads is probably the place to explore further. If you 
manage to collect any profiling data (JFR, flamegraphs via async-profiler, etc) 
I’d be happy to dig into it with you.

Thanks for kicking the tires!

On Mar 21, 2024, at 8:20 PM, Brebner, Paul via user  
wrote:

Hi Joe,

Have you considered submitting something for Community Over Code NA 2024? The 
CFP is still open for a few more weeks, options could be my Performance 
Engineering track or the Cassandra track – or both 

https://www.linkedin.com/pulse/cfp-community-over-code-na-denver-2024-performance-track-paul-brebner-nagmc/?trackingId=PlmmMjMeQby0Mozq8cnIpA%3D%3D

Regards, Paul Brebner

From: Joe Obernberger 
Date: Friday, 22 March 2024 at 3:19 am
To: user@cassandra.apache.org 
Subject: Cassandra 5.0 Beta1 - vector searching results
EXTERNAL EMAIL - USE CAUTION when clicking links or attachments

Hi All - I'd like to share some initial results for the vector search on
Cassandra 5.0 beta1.  3 node cluster running in kubernetes; fast Netapp
storage.

Have a table (doc.embeddings_googleflan5tlarge) with definition:

CREATE TABLE doc.embeddings_googleflant5large (
 uuid text,
 type text,
 fieldname text,
 offset int,
 sourceurl text,
 textdata text,
 creationdate timestamp,
 embeddings vector,
 metadata boolean,
 source text,
 PRIMARY KEY ((uuid, type), fieldname, offset, sourceurl, textdata)
) WITH CLUSTERING ORDER BY (fieldname ASC, offset ASC, sourceurl ASC,
textdata ASC)
 AND additional_write_policy = '99p'
 AND allow_auto_snapshot = true
 AND bloom_filter_fp_chance = 0.01
 AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
 AND cdc = false
 AND comment = ''
 AND compaction = {'class':
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold': '32', 'min_threshold': '4'}
 AND compression = {'chunk_length_in_kb': '16', 'class':
'org.apache.cassandra.io.compress.LZ4Compressor'}
 AND memtable = 'default'
 AND crc_check_chance = 1.0
 AND default_time_to_live = 0
 AND extensions = {}
 AND gc_grace_seconds = 864000
 AND incremental_backups = true
 AND max_index_interval = 2048
 AND memtable_flush_period_in_ms = 0
 AND min_index_interval = 128
 AND read_repair = 'BLOCKING'
 AND speculative_retry = '99p';

CREATE CUSTOM INDEX ann_index_googleflant5large ON
doc.embeddings_googleflant5large (embeddings) USING 'sai';
CREATE CUSTOM INDEX offset_index_googleflant5large ON
doc.embeddings_googleflant5large (offset) USING 'sai';

nodetool status -r

UN  cassandra-1.cassandra5.cassandra5-jos.svc.cluster.local 18.02 GiB
128 100.0% f2989dea-908b-4c06-9caa-4aacad8ba0e8  rack1
UN  cassandra-2.cassandra5.cassandra5-jos.svc.cluster.local  17.98 GiB
128 100.0% ec4e506d-5f0d-475a-a3c1-aafe58399412  rack1
UN  cassandra-0.cassandra5.cassandra5-jos.svc.cluster.local  18.16 GiB
128 100.0% 92c6d909-ee01-4124-ae03-3b9e2d5e74c0  rack1

nodetool tablestats doc.embeddings_googleflant5large

Total number of tables: 1

Keyspace: doc
 Read Count: 0
 Read Latency: NaN ms
 Write Count: 2893108
 Write L

Re: Cassandra 5.0 Beta1 - vector searching results

2024-03-21 Thread Caleb Rackliffe

To expand on Jonathan’s response, the best way to get SAI to perform on the read side is to use it as a tool for large-partition search. In other words, if you can model your data such that your queries will be restricted to a single partition, two things will happen…1.) With all queries (not just ANN queries), you will only hit as many nodes as your read consistency level and replication factor require. For vector searches, that means you should only hit one node, and it should be the coordinating node w/ a properly configured, token-aware client.2.) You can use LCS (or UCS configured to mimic LCS) instead of STCS as your table compaction strategy. This will essentially guarantee your (partition-restricted) SAI query hits a small number of SSTable-attached indexes. (It’ll hit Memtable-attached indexes as well for any recently added data, so if you’re seeing latencies shoot up, it’s possible there could be contention on the Memtable-attached index that supports ANN queries. I haven’t done a deep dive on it. You can always flush Memtables directly before queries to factor that out.)If you can do all of the above, the simple performance of the local index query and its post-filtering reads is probably the place to explore further. If you manage to collect any profiling data (JFR, flamegraphs via async-profiler, etc) I’d be happy to dig into it with you.Thanks for kicking the tires!On Mar 21, 2024, at 8:20 PM, Brebner, Paul via user  wrote:







Hi Joe,
 
Have you considered submitting something for Community Over Code NA 2024? The CFP is still open for a few more weeks, options could be my Performance Engineering track or the Cassandra
 track – or both 

 
https://www.linkedin.com/pulse/cfp-community-over-code-na-denver-2024-performance-track-paul-brebner-nagmc/?trackingId=PlmmMjMeQby0Mozq8cnIpA%3D%3D
 
Regards, Paul Brebner
 
 
 



From:
Joe Obernberger 
Date: Friday, 22 March 2024 at 3:19 am
To: user@cassandra.apache.org 
Subject: Cassandra 5.0 Beta1 - vector searching results


EXTERNAL EMAIL - USE CAUTION when clicking links or attachments




Hi All - I'd like to share some initial results for the vector search on
Cassandra 5.0 beta1.  3 node cluster running in kubernetes; fast Netapp
storage.

Have a table (doc.embeddings_googleflan5tlarge) with definition:

CREATE TABLE doc.embeddings_googleflant5large (
 uuid text,
 type text,
 fieldname text,
 offset int,
 sourceurl text,
 textdata text,
 creationdate timestamp,
 embeddings vector,
 metadata boolean,
 source text,
 PRIMARY KEY ((uuid, type), fieldname, offset, sourceurl, textdata)
) WITH CLUSTERING ORDER BY (fieldname ASC, offset ASC, sourceurl ASC,
textdata ASC)
 AND additional_write_policy = '99p'
 AND allow_auto_snapshot = true
 AND bloom_filter_fp_chance = 0.01
 AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
 AND cdc = false
 AND comment = ''
 AND compaction = {'class':
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold': '32', 'min_threshold': '4'}
 AND compression = {'chunk_length_in_kb': '16', 'class':
'org.apache.cassandra.io.compress.LZ4Compressor'}
 AND memtable = 'default'
 AND crc_check_chance = 1.0
 AND default_time_to_live = 0
 AND extensions = {}
 AND gc_grace_seconds = 864000
 AND incremental_backups = true
 AND max_index_interval = 2048
 AND memtable_flush_period_in_ms = 0
 AND min_index_interval = 128
 AND read_repair = 'BLOCKING'
 AND speculative_retry = '99p';

CREATE CUSTOM INDEX ann_index_googleflant5large ON
doc.embeddings_googleflant5large (embeddings) USING 'sai';
CREATE CUSTOM INDEX offset_index_googleflant5large ON
doc.embeddings_googleflant5large (offset) USING 'sai';

nodetool status -r

UN  cassandra-1.cassandra5.cassandra5-jos.svc.cluster.local 18.02 GiB
128 100.0% f2989dea-908b-4c06-9caa-4aacad8ba0e8  rack1
UN  cassandra-2.cassandra5.cassandra5-jos.svc.cluster.local  17.98 GiB
128 100.0% ec4e506d-5f0d-475a-a3c1-aafe58399412  rack1
UN  cassandra-0.cassandra5.cassandra5-jos.svc.cluster.local  18.16 GiB
128 100.0% 92c6d909-ee01-4124-ae03-3b9e2d5e74c0  rack1

nodetool tablestats doc.embeddings_googleflant5large

Total number of tables: 1

Keyspace: doc
 Read Count: 0
 Read Latency: NaN ms
 Write Count: 2893108
 Write Latency: 326.3586520174843 ms
 Pending Flushes: 0
 Table: embeddings_googleflant5large
 SSTable count: 6
 Old SSTable count: 0
 Max SSTable size: 5.108GiB
 Space used (live): 19318114423
 Space used (total): 19318114423
 Space used by snapshots (total): 0
 Off heap memory used (total): 4874912
 SSTable Compression Ratio: 0.97448
 Number of partitions (estimate): 58399
 Memtable cell count: 0

Re: Cassandra 5.0 Beta1 - vector searching results

2024-03-21 Thread Brebner, Paul via user

Hi Joe,

Have you considered submitting something for Community Over Code NA 2024? The 
CFP is still open for a few more weeks, options could be my Performance 
Engineering track or the Cassandra track – or both 

https://www.linkedin.com/pulse/cfp-community-over-code-na-denver-2024-performance-track-paul-brebner-nagmc/?trackingId=PlmmMjMeQby0Mozq8cnIpA%3D%3D

Regards, Paul Brebner



From: Joe Obernberger 
Date: Friday, 22 March 2024 at 3:19 am
To: user@cassandra.apache.org 
Subject: Cassandra 5.0 Beta1 - vector searching results
EXTERNAL EMAIL - USE CAUTION when clicking links or attachments




Hi All - I'd like to share some initial results for the vector search on
Cassandra 5.0 beta1.  3 node cluster running in kubernetes; fast Netapp
storage.

Have a table (doc.embeddings_googleflan5tlarge) with definition:

CREATE TABLE doc.embeddings_googleflant5large (
 uuid text,
 type text,
 fieldname text,
 offset int,
 sourceurl text,
 textdata text,
 creationdate timestamp,
 embeddings vector,
 metadata boolean,
 source text,
 PRIMARY KEY ((uuid, type), fieldname, offset, sourceurl, textdata)
) WITH CLUSTERING ORDER BY (fieldname ASC, offset ASC, sourceurl ASC,
textdata ASC)
 AND additional_write_policy = '99p'
 AND allow_auto_snapshot = true
 AND bloom_filter_fp_chance = 0.01
 AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
 AND cdc = false
 AND comment = ''
 AND compaction = {'class':
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
'max_threshold': '32', 'min_threshold': '4'}
 AND compression = {'chunk_length_in_kb': '16', 'class':
'org.apache.cassandra.io.compress.LZ4Compressor'}
 AND memtable = 'default'
 AND crc_check_chance = 1.0
 AND default_time_to_live = 0
 AND extensions = {}
 AND gc_grace_seconds = 864000
 AND incremental_backups = true
 AND max_index_interval = 2048
 AND memtable_flush_period_in_ms = 0
 AND min_index_interval = 128
 AND read_repair = 'BLOCKING'
 AND speculative_retry = '99p';

CREATE CUSTOM INDEX ann_index_googleflant5large ON
doc.embeddings_googleflant5large (embeddings) USING 'sai';
CREATE CUSTOM INDEX offset_index_googleflant5large ON
doc.embeddings_googleflant5large (offset) USING 'sai';

nodetool status -r

UN  cassandra-1.cassandra5.cassandra5-jos.svc.cluster.local 18.02 GiB
128 100.0% f2989dea-908b-4c06-9caa-4aacad8ba0e8  rack1
UN  cassandra-2.cassandra5.cassandra5-jos.svc.cluster.local  17.98 GiB
128 100.0% ec4e506d-5f0d-475a-a3c1-aafe58399412  rack1
UN  cassandra-0.cassandra5.cassandra5-jos.svc.cluster.local  18.16 GiB
128 100.0% 92c6d909-ee01-4124-ae03-3b9e2d5e74c0  rack1

nodetool tablestats doc.embeddings_googleflant5large

Total number of tables: 1

Keyspace: doc
 Read Count: 0
 Read Latency: NaN ms
 Write Count: 2893108
 Write Latency: 326.3586520174843 ms
 Pending Flushes: 0
 Table: embeddings_googleflant5large
 SSTable count: 6
 Old SSTable count: 0
 Max SSTable size: 5.108GiB
 Space used (live): 19318114423
 Space used (total): 19318114423
 Space used by snapshots (total): 0
 Off heap memory used (total): 4874912
 SSTable Compression Ratio: 0.97448
 Number of partitions (estimate): 58399
 Memtable cell count: 0
 Memtable data size: 0
 Memtable off heap memory used: 0
 Memtable switch count: 16
 Speculative retries: 0
 Local read count: 0
 Local read latency: NaN ms
 Local write count: 2893108
 Local write latency: NaN ms
 Local read/write ratio: 0.0
 Pending flushes: 0
 Percent repaired: 100.0
 Bytes repaired: 9.066GiB
 Bytes unrepaired: 0B
 Bytes pending repair: 0B
 Bloom filter false positives: 7245
 Bloom filter false ratio: 0.00286
 Bloom filter space used: 87264
 Bloom filter off heap memory used: 87216
 Index summary off heap memory used: 34624
 Compression metadata off heap memory used: 4753072
 Compacted partition minimum bytes: 2760
 Compacted partition maximum bytes: 4866323
 Compacted partition mean bytes: 154523
 Average live cells per slice (last five minutes): NaN
 Maximum live cells per slice (last five minutes): 0
 Average tombstones per slice (last five minutes): NaN
 Maximum tombstones per slice (last five minutes): 0
 Droppable tombstone ratio: 0.0

nodetool tablehistograms doc.embeddings_googleflant5large

Re: Cassandra 5.0 Beta1 - vector searching results

2024-03-21 Thread Jonathan Ellis

Hi Joe,

Thanks for testing out vector search!

Cassandra 5.0 is about six months behind on vector search progress.  Part
of this is keeping up with JVector releases but more of it is core
improvements to SAI.  Unfortunately there's no easy fix for the impedance
mismatch between a field where the state of the art is improving almost
daily, and a project with a release cycle measured in years.

DataStax's cutting-edge vector search work is public and open source [1]
but it's going to be a while before we have bandwidth to upstream it to
Apache, and longer before it can be released in 5.1 or 6.0.  If you're
interested in collaborating on this, I'm happy to get you pointed in the
right direction.

In the meantime, I can also recommend trying out DataStax's Astra [2]
service, where we deploy improvements regularly.  My guesstimate is that
Astra will be 2x faster at vanilla ANN queries (with no WHERE clause) and
10x-100x faster at queries with additional predicates, depending on the
cardinality.  (As an example of what needs to be upstreamed, we added a
primitive cost-based analyzer back in January to fix the kind of timeouts
you're seeing with offset=1, and we just committed a more sophisticated one
this week [3].)

If you're stuck with 5.0, my best advice is to compact as aggressively as
possible, since SAI queries are O(N) in the number of sstables.

[1] https://github.com/datastax/cassandra/tree/vsearch
[2] https://www.datastax.com/products/datastax-astra
[3]
https://github.com/datastax/cassandra/commit/eeb33dd62b9b74ecf818a263fd73dbe6714b0df0

On Thu, Mar 21, 2024 at 9:19 AM Joe Obernberger <
joseph.obernber...@gmail.com> wrote:

> Hi All - I'd like to share some initial results for the vector search on
> Cassandra 5.0 beta1.  3 node cluster running in kubernetes; fast Netapp
> storage.
>
> Have a table (doc.embeddings_googleflan5tlarge) with definition:
>
> CREATE TABLE doc.embeddings_googleflant5large (
>  uuid text,
>  type text,
>  fieldname text,
>  offset int,
>  sourceurl text,
>  textdata text,
>  creationdate timestamp,
>  embeddings vector,
>  metadata boolean,
>  source text,
>  PRIMARY KEY ((uuid, type), fieldname, offset, sourceurl, textdata)
> ) WITH CLUSTERING ORDER BY (fieldname ASC, offset ASC, sourceurl ASC,
> textdata ASC)
>  AND additional_write_policy = '99p'
>  AND allow_auto_snapshot = true
>  AND bloom_filter_fp_chance = 0.01
>  AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>  AND cdc = false
>  AND comment = ''
>  AND compaction = {'class':
> 'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy',
> 'max_threshold': '32', 'min_threshold': '4'}
>  AND compression = {'chunk_length_in_kb': '16', 'class':
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>  AND memtable = 'default'
>  AND crc_check_chance = 1.0
>  AND default_time_to_live = 0
>  AND extensions = {}
>  AND gc_grace_seconds = 864000
>  AND incremental_backups = true
>  AND max_index_interval = 2048
>  AND memtable_flush_period_in_ms = 0
>  AND min_index_interval = 128
>  AND read_repair = 'BLOCKING'
>  AND speculative_retry = '99p';
>
> CREATE CUSTOM INDEX ann_index_googleflant5large ON
> doc.embeddings_googleflant5large (embeddings) USING 'sai';
> CREATE CUSTOM INDEX offset_index_googleflant5large ON
> doc.embeddings_googleflant5large (offset) USING 'sai';
>
> nodetool status -r
>
> UN  cassandra-1.cassandra5.cassandra5-jos.svc.cluster.local 18.02 GiB
> 128 100.0% f2989dea-908b-4c06-9caa-4aacad8ba0e8  rack1
> UN  cassandra-2.cassandra5.cassandra5-jos.svc.cluster.local  17.98 GiB
> 128 100.0% ec4e506d-5f0d-475a-a3c1-aafe58399412  rack1
> UN  cassandra-0.cassandra5.cassandra5-jos.svc.cluster.local  18.16 GiB
> 128 100.0% 92c6d909-ee01-4124-ae03-3b9e2d5e74c0  rack1
>
> nodetool tablestats doc.embeddings_googleflant5large
>
> Total number of tables: 1
> 
> Keyspace: doc
>  Read Count: 0
>  Read Latency: NaN ms
>  Write Count: 2893108
>  Write Latency: 326.3586520174843 ms
>  Pending Flushes: 0
>  Table: embeddings_googleflant5large
>  SSTable count: 6
>  Old SSTable count: 0
>  Max SSTable size: 5.108GiB
>  Space used (live): 19318114423
>  Space used (total): 19318114423
>  Space used by snapshots (total): 0
>  Off heap memory used (total): 4874912
>  SSTable Compression Ratio: 0.97448
>  Number of partitions (estimate): 58399
>  Memtable cell count: 0
>  Memtable data size: 0
>  Memtable off heap memory used: 0
>  Memtable switch count: 16
>  Speculative retries: 0
>  Local read count: 0
>  Local read latency: NaN ms
>  Local

Re: Alternate apt repo for Debian installation?

2024-03-20 Thread Grant Talarico

Oh, nevermind. It looks like debian.cassandra.apache.org has come back
online and I can get once again pull from the apt repo.

On Wed, Mar 20, 2024 at 2:15 PM Grant Talarico  wrote:

> I already tried those. My particular application requires a minimum
> version of 3.11.14 and I have 3.11.16 installed in my staging environment.
> The archive.apache.org only has it's latest of 3.11.13.
>
> On Wed, Mar 20, 2024 at 1:55 PM Bowen Song via user <
> user@cassandra.apache.org> wrote:
>
>> You can try https://archive.apache.org/dist/cassandra/debian/
>>
>> The deb files can be found here:
>> https://archive.apache.org/dist/cassandra/debian/pool/main/c/cassandra/
>> On 20/03/2024 20:47, Grant Talarico wrote:
>>
>> Hi there. Hopefully this is the right place to ask this question. I'm
>> trying to install the latest version of Cassandra 3.11 using debian
>> packages through the debian.cassandra.apache.org apt repo but it appears
>> to be down at the moment. Is there an alternate apt repo I might be able to
>> use as a backup?
>>
>> - Grant
>>
>>
>
> --
>
> *Grant Talarico IT Senior Systems Engineer*
>
>
> 901 Marshall St, Suite 200
> Redwood City, CA 94063
> http://www.imvu.com
>


-- 

*Grant Talarico IT Senior Systems Engineer*


901 Marshall St, Suite 200
Redwood City, CA 94063
http://www.imvu.com

Re: Alternate apt repo for Debian installation?

2024-03-20 Thread Grant Talarico

I already tried those. My particular application requires a minimum version
of 3.11.14 and I have 3.11.16 installed in my staging environment. The
archive.apache.org only has it's latest of 3.11.13.

On Wed, Mar 20, 2024 at 1:55 PM Bowen Song via user <
user@cassandra.apache.org> wrote:

> You can try https://archive.apache.org/dist/cassandra/debian/
>
> The deb files can be found here:
> https://archive.apache.org/dist/cassandra/debian/pool/main/c/cassandra/
> On 20/03/2024 20:47, Grant Talarico wrote:
>
> Hi there. Hopefully this is the right place to ask this question. I'm
> trying to install the latest version of Cassandra 3.11 using debian
> packages through the debian.cassandra.apache.org apt repo but it appears
> to be down at the moment. Is there an alternate apt repo I might be able to
> use as a backup?
>
> - Grant
>
>

-- 

*Grant Talarico IT Senior Systems Engineer*


901 Marshall St, Suite 200
Redwood City, CA 94063
http://www.imvu.com

Re: Alternate apt repo for Debian installation?

2024-03-20 Thread Bowen Song via user


You can try https://archive.apache.org/dist/cassandra/debian/

The deb files can be found here: 
https://archive.apache.org/dist/cassandra/debian/pool/main/c/cassandra/


On 20/03/2024 20:47, Grant Talarico wrote:
Hi there. Hopefully this is the right place to ask this question. I'm 
trying to install the latest version of Cassandra 3.11 using debian 
packages through the debian.cassandra.apache.org 
 apt repo but it appears to be 
down at the moment. Is there an alternate apt repo I might be able to 
use as a backup?


- Grant

Re: [EXTERNAL] Re: About Cassandra stable version having Java 17 support

2024-03-18 Thread Bowen Song via user


Short answer:

There's no definite answer to that question.


Longer answer:

I doubt such date has already been decided. It's largely driven by the 
time required to fix known issues and any potential new issues 
discovered during the BETA and RC process. If you want to track the 
progress, feel free to look at the project's Jira boards, there's a 5.0 
GA board dedicated for that.


Furthermore, it's likely there will only be an experimental support for 
Java 17 in Cassandra 5.0, which means it shouldn't be used on production 
environments.


So, would you like to keep waiting indefinitely for the Java 17 official 
support, or run Cassandra 4.1 on Java 11 today and upgrade when newer 
version becomes available?



On 18/03/2024 13:10, Divyanshi Kaushik via user wrote:

Thanks for your reply.

As Cassandra has moved to Java 17 in it's *5.0-BETA1* (Latest release 
on 2023-12-05). Can you please let us know when the team is planning 
to GA Cassandra 5.0 version which has Java 17 support?


Regards,
Divyanshi

*From:* Bowen Song via user 
*Sent:* Monday, March 18, 2024 5:14 PM
*To:* user@cassandra.apache.org 
*Cc:* Bowen Song 
*Subject:* [EXTERNAL] Re: About Cassandra stable version having Java 
17 support


*CAUTION:* This email originated from outside the organization. Do not 
click links or open attachments unless you recognize the sender and 
know the content is safe.


Why Java 17? It makes no sense to choose an officially non-supported 
library version for a piece of software. That decision making process 
is the problem, not the software's library version compatibility.



On 18/03/2024 09:44, Divyanshi Kaushik via user wrote:

Hi All,

As per my project requirement, Java 17 needs to be used. Can you
please let us know when you are planning to release the next
stable version of Cassandra having Java 17 support?

Regards,
Divyanshi
This email and any files transmitted with it are confidential,
proprietary and intended solely for the individual or entity to
whom they are addressed. If you have received this email in error
please delete it immediately.

This email and any files transmitted with it are confidential, 
proprietary and intended solely for the individual or entity to whom 
they are addressed. If you have received this email in error please 
delete it immediately.

Re: [EXTERNAL] Re: About Cassandra stable version having Java 17 support

2024-03-18 Thread Divyanshi Kaushik via user

Thanks for your reply.

As Cassandra has moved to Java 17 in it's 5.0-BETA1 (Latest release on 
2023-12-05). Can you please let us know when the team is planning to GA 
Cassandra 5.0 version which has Java 17 support?

Regards,
Divyanshi

From: Bowen Song via user 
Sent: Monday, March 18, 2024 5:14 PM
To: user@cassandra.apache.org 
Cc: Bowen Song 
Subject: [EXTERNAL] Re: About Cassandra stable version having Java 17 support


CAUTION: This email originated from outside the organization. Do not click 
links or open attachments unless you recognize the sender and know the content 
is safe.

Why Java 17? It makes no sense to choose an officially non-supported library 
version for a piece of software. That decision making process is the problem, 
not the software's library version compatibility.


On 18/03/2024 09:44, Divyanshi Kaushik via user wrote:
Hi All,

As per my project requirement, Java 17 needs to be used. Can you please let us 
know when you are planning to release the next stable version of Cassandra 
having Java 17 support?

Regards,
Divyanshi
This email and any files transmitted with it are confidential, proprietary and 
intended solely for the individual or entity to whom they are addressed. If you 
have received this email in error please delete it immediately.
This email and any files transmitted with it are confidential, proprietary and 
intended solely for the individual or entity to whom they are addressed. If you 
have received this email in error please delete it immediately.

Re: About Cassandra stable version having Java 17 support

2024-03-18 Thread Bowen Song via user

Why Java 17? It makes no sense to choose an officially non-supported 
library version for a piece of software. That decision making process is 
the problem, not the software's library version compatibility.



On 18/03/2024 09:44, Divyanshi Kaushik via user wrote:

Hi All,

As per my project requirement, Java 17 needs to be used. Can you 
please let us know when you are planning to release the next stable 
version of Cassandra having Java 17 support?


Regards,
Divyanshi
This email and any files transmitted with it are confidential, 
proprietary and intended solely for the individual or entity to whom 
they are addressed. If you have received this email in error please 
delete it immediately.

Re: Documentation about TTL and tombstones

2024-03-18 Thread Sebastian Marsching


> It's actually correct to do it how it is today.
> Insertion date does not matter, what matters is the time after tombstones are 
> supposed to be deleted.
> If the delete got to all nodes, sure, no problem, but if any of the nodes 
> didn't get the delete, and you would get rid of the tombstones before running 
> a repair, you might have nodes that still has that data.
> Then following a repair, that data will be copied to other replicas, and that 
> data you thought you deleted, will be brought back to life.

Sure, for regular data that does not have a TTL, this makes sense. But I claim 
that data with a TTL is deleted when it is inserted. It’s just that this delete 
only becomes effective at some future date.

In order to understand whether data might reappear, we have to consider four 
cases. Let us first consider the three cases where the INSERT / UPDATE did not 
overwrite any existing data that would have lived longer than the new data:

1. Let us assume that the data is successfully written to all nodes and no 
repair is run. After the TTL expires, the data turns into a tombstone, but 
because the data was present on all nodes, the tombstone is present on all 
nodes, so there is no risk of data reappearing.

2. Let us assume that this data is not written to all nodes but a repair is run 
within the TTL. After that, we effectively have the first situation, so there 
is no risk of data reappearing.

3. Let us assume that this data is not written to all nodes and no repair is 
run within the TTL. After the TTL has passed, the data expires on the nodes 
where it has been written. Now, we have tombstones on these nodes. If we get 
rid of the tombstones, there is no risk of the data reappearing, because there 
are no nodes that have the data, so even if we run a repair in the future, 
there is no risk that the data magically reappears.

Now, let us consider the cases where data that either had no TTL or had a TTL 
that expired after the TTL of the newly inserted data was overwritten. Again, 
there are three possible scenarios:

4. Let us assume that the data is successfully written to all nodes and no 
repair is run. After the TTL expires, the data turns into a tombstone, but 
because the data was present on all nodes, the tombstone is present on all 
nodes, so there is no risk of data reappearing.

5. Let us assume that this data is not written to all nodes but a repair is run 
within the TTL. After that, we effectively have the first situation, so there 
is no risk of data reappearing.

6. Let us assume that this data is not written to all nodes and no repair is 
run within the TTL. After the TTL has passed, the data expires on the nodes 
where it has been written. Now, we have tombstones on these nodes. If we get 
rid of the tombstones, there is the risk of the data reappearing, because the 
older data that was overwritten by the INSERT / UPDATE might still exist on 
some nodes, and as the data with the TTL never made it to these nodes, there is 
no tombstone on these nodes and thus the older data can reappear.

So, we only have to worry about the last scenario. In this scenario, we have to 
ensure that either the inserted data with the TTL is repaired (which brings us 
back to scenario 5), or that the tombstones are repaired before they are 
discarded.

This is why I claim that for data with a TTL, gc_grace_seconds should 
effectively start when the data is inserted, not when it is converted into a 
tombstone: It does not matter whether the data with the TTL is repaired or the 
tombstone is repaired. As long as either of these things between the data with 
the TTL being inserted and the tombstone being reclaimed, there is no risk of 
deleted or overwritten data reappearing.



smime.p7s
Description: S/MIME cryptographic signature

Re: Documentation about TTL and tombstones

2024-03-17 Thread Gil Ganz

It's actually correct to do it how it is today.
Insertion date does not matter, what matters is the time after tombstones
are supposed to be deleted.
If the delete got to all nodes, sure, no problem, but if any of the nodes
didn't get the delete, and you would get rid of the tombstones before
running a repair, you might have nodes that still has that data.
Then following a repair, that data will be copied to other replicas, and
that data you thought you deleted, will be brought back to life.

On Sat, Mar 16, 2024 at 5:39 PM Sebastian Marsching 
wrote:

> > That's not how gc_grace_seconds work.
> > gc_grace_seconds controls how much time *after* a tombstone can be
> deleted, it can actually be deleted, in order to give you enough time to
> run repairs.
> >
> > Say you have data that is about to expire on March 16 8am, and
> gc_grace_seconds is 10 days.
> > After Mar 16 8am that data will be a tombstone, and only after March 26
> 8am, a compaction  *might* remove it, if all other conditions are met.
>
> You are right. I do not understand why it is implemented this way, but you
> are 100 % correct that it works this way.
>
> I thought that gc_grace_seconds is all about being able to repair the
> table before tombstones are removed, so that deleted data cannot repappear.
> But when the data has a TTL, it should not matter whether the original data
> ore the tombstone is synchronized as part of the repair process. After all,
> the original data should turn into a tombstone, so if it was present on all
> nodes, there is no risk of deleted data reappearing. Therefore, I think it
> would make more sense to start gc_grace_seconds when the data is inserted /
> updated. I don’t know why it was not implemented this way.
>
>

Re: Documentation about TTL and tombstones

2024-03-16 Thread Sebastian Marsching


> That's not how gc_grace_seconds work.
> gc_grace_seconds controls how much time *after* a tombstone can be deleted, 
> it can actually be deleted, in order to give you enough time to run repairs.
>
> Say you have data that is about to expire on March 16 8am, and 
> gc_grace_seconds is 10 days.
> After Mar 16 8am that data will be a tombstone, and only after March 26 8am, 
> a compaction  *might* remove it, if all other conditions are met.

You are right. I do not understand why it is implemented this way, but you are 
100 % correct that it works this way.

I thought that gc_grace_seconds is all about being able to repair the table 
before tombstones are removed, so that deleted data cannot repappear. But when 
the data has a TTL, it should not matter whether the original data ore the 
tombstone is synchronized as part of the repair process. After all, the 
original data should turn into a tombstone, so if it was present on all nodes, 
there is no risk of deleted data reappearing. Therefore, I think it would make 
more sense to start gc_grace_seconds when the data is inserted / updated. I 
don’t know why it was not implemented this way.



smime.p7s
Description: S/MIME cryptographic signature

Re: Documentation about TTL and tombstones

2024-03-16 Thread Gil Ganz

That's not how gc_grace_seconds work.
gc_grace_seconds controls how much time *after* a tombstone can be deleted,
it can actually be deleted, in order to give you enough time to run repairs.

Say you have data that is about to expire on March 16 8am, and
gc_grace_seconds is 10 days.
After Mar 16 8am that data will be a tombstone, and only after March 26
8am, a compaction  *might* remove it, if all other conditions are met.
gil


On Fri, Mar 15, 2024 at 12:58 AM Sebastian Marsching <
sebast...@marsching.com> wrote:

>
> by reading the documentation about TTL
>
> https://cassandra.apache.org/doc/4.1/cassandra/operating/compaction/index.html#ttl
> It mention that it creates a tombstone when data expired, how does it
> possible without writing to the tombstone on the table ? I thought TTL
> doesn't create tombstones since the ttl is present together with the write
> time timestmap
> at the row level
>
>
> If you read carefully, you will notice that no tombstone is created and
> instead the data is *converted* into a tombstone. So, after the TTL has
> expired, the inserted data effectively acts as a tombstone. This is needed,
> because the now expired data might hide older data that has not expired
> yet. If the newer data was simply dropped after the TTL expired, older data
> might reappear.
>
> If I understand it correctly, you can avoid data with a TTL being
> converted into a tombstone by choosing a TTL that is greater than
> gc_grace_seconds. Technically, the data is still going to be converted into
> a tombstone when the TTL expires, but this tombstone will immediately be
> eligible for garbage collection.
>
>

Re: Documentation about TTL and tombstones

2024-03-14 Thread Sebastian Marsching


> by reading the documentation about TTL
> https://cassandra.apache.org/doc/4.1/cassandra/operating/compaction/index.html#ttl
> It mention that it creates a tombstone when data expired, how does it  
> possible without writing to the tombstone on the table ? I thought TTL 
> doesn't create tombstones since the ttl is present together with the write 
> time timestmap
> at the row level

If you read carefully, you will notice that no tombstone is created and instead 
the data is *converted* into a tombstone. So, after the TTL has expired, the 
inserted data effectively acts as a tombstone. This is needed, because the now 
expired data might hide older data that has not expired yet. If the newer data 
was simply dropped after the TTL expired, older data might reappear.

If I understand it correctly, you can avoid data with a TTL being converted 
into a tombstone by choosing a TTL that is greater than gc_grace_seconds. 
Technically, the data is still going to be converted into a tombstone when the 
TTL expires, but this tombstone will immediately be eligible for garbage 
collection.



smime.p7s
Description: S/MIME cryptographic signature

RE: SStables stored in directory with different table ID than the one found in system_schema.tables

2024-03-13 Thread Michalis Kotsiouros (EXT) via user

Hello everyone,

The recovery was performed successfully some days ago. Finally, the problematic 
datacenter was removed and added back to the cluster.

 

BR

MK

 

From: Michalis Kotsiouros (EXT) via user  
Sent: February 12, 2024 17:59
To: Sebastian Marsching ; user@cassandra.apache.org
Cc: Michalis Kotsiouros (EXT) 
Subject: RE: SStables stored in directory with different table ID than the one 
found in system_schema.tables

 

Hello Sebastian and community,

Thanks a lot for the post. It is really helpful.

After some additional observations, I am more concerned about trying to 
rename/move the sstables directory. I have observed that my client processes 
complain about missing columns even though those columns appear on the describe 
schema output.

My plan is to first try a restart of the Cassandra nodes and if that does not 
help to re-build the datacenter – remove it and then add it back to the cluster.

 

BR

MK

 

From: Sebastian Marsching mailto:sebast...@marsching.com> > 
Sent: February 10, 2024 01:00
To: Bowen Song via user mailto:user@cassandra.apache.org> >
Cc: Michalis Kotsiouros (EXT) mailto:michalis.kotsiouros@ericsson.com> >
Subject: Re: SStables stored in directory with different table ID than the one 
found in system_schema.tables

 

You might the following discussion from the mailing-list archive helpful:

 

https://lists.apache.org/thread/6hnypp6vfxj1yc35ptp0xf15f11cx77d

 

This thread discusses a similar situation gives a few pointers on when it might 
be save to simply move the SSTables around.

 

Am 08.02.2024 um 13:06 schrieb Michalis Kotsiouros (EXT) via user 
mailto:user@cassandra.apache.org> >:

 

Hello everyone,

I have found this post on-line and seems to be recent.

 
<https://stackoverflow.com/questions/77837100/mismatch-between-cassandra-table-uuid-in-linux-file-directory-and-system-schema>
 Mismatch between Cassandra table uuid in linux file directory and 
system_schema.tables - Stack Overflow

The description seems to be the same as my problem as well.

In this post, the proposal is to copy the sstables to the dir with the ID found 
in system_schema.tables. I think it is equivalent with my assumption to rename 
the directories….

Have anyone seen this before? Do you consider those approaches safe?

 

BR

MK

 

From: Michalis Kotsiouros (EXT) 
Sent: February 08, 2024 11:33
To: user@cassandra.apache.org <mailto:user@cassandra.apache.org> 
Subject: SStables stored in directory with different table ID than the one 
found in system_schema.tables

 

Hello community,

I have a Cassandra server on 3.11.13 on SLESS 12.5.

I have noticed in the logs the following line:

Datacenter A

org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find table for 
cfId d8c1bea0-82ed-11ee-8ac8-1513e17b60b1. If a table was just created, this is 
likely due to the schema not being fully propagated.  Please wait for schema 
agreement on table creation.

Datacenter B

org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find table for 
cfId 0fedabd0-11f7-11ea-9450-e3ff59b2496b. If a table was just created, this is 
likely due to the schema not being fully propagated.  Please wait for schema 
agreement on table creation.

 

This error results in failure of all streaming tasks.

I have checked the sstables directories and I see that:

 

In Datacenter A the sstables directory is:

-0fedabd0-11f7-11ea-9450-e3ff59b2496b

 

In Datacenter B the sstables directory are:

-0fedabd011f711ea9450e3ff59b2496b

- d8c1bea082ed11ee8ac81513e17b60b1

In this datacenter although the - d8c1bea082ed11ee8ac81513e17b60b1 
dir is more recent it is empty and all sstables are stored under 
-0fedabd011f711ea9450e3ff59b2496b

 

I have also checked the system_schema.tables in all Cassandra nodes and I see 
that for the specific table the ID is consistent across all nodes and it is:

d8c1bea0-82ed-11ee-8ac8-1513e17b60b1

 

So it seems that the schema is a bit mess in all my datacenters. I am not 
really interested to understand how it ended up in this status but more on how 
to recover.

Both datacenters seem to have this inconsistency between the id stored 
system_schema.tables and the one used in the sstables directory.

Do you have any proposal on how to recover?

I have thought of renaming the dir from 
-0fedabd011f711ea9450e3ff59b2496b to - 
d8c1bea082ed11ee8ac81513e17b60b1 but it does not look safe and I would not want 
to risk my data since this is a production system.

 

Thank you in advance.

 

BR

Michail Kotsiouros

 



smime.p7s
Description: S/MIME cryptographic signature

Re: Question about commit consistency level for Lightweight Transactions in Paxos v2

2024-03-11 Thread Weng, Justin via user

So for upgrading Paxos to v2, the non-serial consistency level should be set to 
ANY or LOCAL_QUORUM, and the serial consistency level should still be SERIAL or 
LOCAL_SERIAL. Got it, thanks!

From: Laxmikant Upadhyay 
Date: Tuesday, 12 March 2024 at 7:33 am
To: user@cassandra.apache.org 
Cc: Weng, Justin 
Subject: Re: Question about commit consistency level for Lightweight 
Transactions in Paxos v2
You don't often get email from laxmikant@gmail.com. Learn why this is 
important<https://aka.ms/LearnAboutSenderIdentification>

EXTERNAL EMAIL - USE CAUTION when clicking links or attachments

You need to set both in case of lwt. your regular non -serial consistency level 
will only applied during commit phase of lwt.

On Wed, 6 Mar, 2024, 03:30 Weng, Justin via user, 
mailto:user@cassandra.apache.org>> wrote:
Hi Cassandra Community,

I’ve been investigating Cassandra Paxos v2 (as implemented in 
CEP-14<https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-14%3A+Paxos+Improvements>)
 which improves the performance of lightweight transaction (LWT). But I’ve got 
a question about setting the commit consistency level for LWT after upgrading 
Paxos.

In 
cqlsh<https://docs.datastax.com/en/cql-oss/3.3/cql/cql_reference/cqlshSerialConsistency.html>,
 gocql<https://github.com/gocql/gocql/blob/master/session.go#L1247> and Python 
driver<https://docs.datastax.com/en/developer/python-driver/3.29/api/cassandra/query/#cassandra.query.Statement.serial_consistency_level>,
 there are two settings for consistency levels: normal Consistency Level and 
Serial Consistency Level. As mentioned in the cqlsh 
documentation<https://docs.datastax.com/en/cql-oss/3.3/cql/cql_reference/cqlshSerialConsistency.html>,
 Serial Consistency Level is only used for LWT and can only be set to either 
SERIAL or LOCAL_SERIAL. However, the Steps for Upgrading 
Paxos<https://github.com/apache/cassandra/blob/trunk/NEWS.txt#L532> and 
CEP-14<https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-14%3A+Paxos+Improvements>
 mention that ANY or LOCAL_QUOROM can be used as the commit consistency level 
for LWT after upgrading Paxos to v2. Therefore, I have a question about how to 
correctly set the commit consistency level to ANY or LOCAL_QUORUM for LWT. 
Namely, which consistency level should I set, the normal Consistency Level or 
Serial Consistency Level?

Any help would be really appreciated.

Thanks,
Justin

Re: Question about commit consistency level for Lightweight Transactions in Paxos v2

2024-03-11 Thread Laxmikant Upadhyay

You need to set both in case of lwt. your regular non -serial consistency
level will only applied during commit phase of lwt.


On Wed, 6 Mar, 2024, 03:30 Weng, Justin via user, 
wrote:

> Hi Cassandra Community,
>
>
>
> I’ve been investigating Cassandra Paxos v2 (as implemented in CEP-14
> )
> which improves the performance of lightweight transaction (LWT). But I’ve
> got a question about setting the commit consistency level for LWT after
> upgrading Paxos.
>
>
>
> In cqlsh
> ,
> gocql  and Python
> driver
> ,
> there are two settings for consistency levels: normal Consistency Level and
> Serial Consistency Level. As mentioned in the cqlsh documentation
> ,
> Serial Consistency Level is only used for LWT and can only be set to either
> SERIAL or LOCAL_SERIAL. However, the Steps for Upgrading Paxos
>  and CEP-14
> 
> mention that ANY or LOCAL_QUOROM can be used as the commit consistency
> level for LWT after upgrading Paxos to v2. Therefore, I have a question
> about how to correctly set the commit consistency level to ANY or
> LOCAL_QUORUM for LWT. Namely, which consistency level should I set, the
> normal Consistency Level or Serial Consistency Level?
>
>
>
> Any help would be really appreciated.
>
>
>
> Thanks,
>
> Justin
>

Re: Best Practices for Managing Concurrent Client Connections in Cassandra

2024-02-29 Thread Andrew Weaver

We've used these settings in production with no issues.

What has been more valuable to us though is limiting the rate of client
connections via iptables. Often times users configure an aggressive
reconnection policy that floods the cluster with connections in certain
circumstances like a node restart or a network glitch.

Are you sure it's the number of connections causing problems or is it the
rate of connections being made? I've seen nodes handle over 40k
connections.

On Thu, Feb 29, 2024, 4:51 AM Naman kaushik  wrote:

> Hello Cassandra Community,
>
> We've been experiencing occasional spikes in the number of client
> connections to our Cassandra cluster, particularly during high-volume API
> request periods. We're using persistent connections, and we've noticed that
> the number of connections can increase significantly during these spikes.
>
> We're considering using the following Cassandra parameters to manage
> concurrent client connections:
>
> *native_transport_max_concurrent_connections*: This parameter sets the
> maximum number of concurrent client connections allowed by the native
> transport protocol. Currently, it's set to -1, indicating no limit.
>
> *native_transport_max_concurrent_connections_per_ip*: This parameter sets
> the maximum number of concurrent client connections allowed per source IP
> address. Like the previous parameter, it's also set to -1.
>
> We're thinking of using these parameters to limit the maximum number of
> connections from a single IP address, especially to prevent overwhelming
> the database during spikes in API requests that should be handled by our
> SOA team exclusively.
>
> Are these parameters suitable for production use, and would implementing
> restrictions on concurrent connections per IP be considered a good practice
> in managing Cassandra clusters?
>
> Any insights or recommendations would be greatly appreciated.
>
> Thank you!
>
> Naman
>

Re: Best Practices for Managing Concurrent Client Connections in Cassandra

2024-02-29 Thread Bowen Song via user

They are suitable for production use for protecting your Cassandra 
server, not the clients. The clients likely will experience an error 
when the limit is reached, and it needs to handle that error appropriately.


What you really want to do probably are:

1. change the client's behaviour, limit the number of servers it 
connects to concurrently. The client can close connections not in use, 
and/or only connect to a subset of servers (note: affects token-aware 
routing).


2. after made the above change, if the number of connections is still an 
issue, horizontally scale up your Cassandra cluster to handle the peak 
number of connections. More nodes means less connections to each node.



On 29/02/2024 10:50, Naman kaushik wrote:


Hello Cassandra Community,

We've been experiencing occasional spikes in the number of client 
connections to our Cassandra cluster, particularly during high-volume 
API request periods. We're using persistent connections, and we've 
noticed that the number of connections can increase significantly 
during these spikes.


We're considering using the following Cassandra parameters to manage 
concurrent client connections:


*native_transport_max_concurrent_connections*: This parameter sets the 
maximum number of concurrent client connections allowed by the native 
transport protocol. Currently, it's set to -1, indicating no limit.


*native_transport_max_concurrent_connections_per_ip*: This parameter 
sets the maximum number of concurrent client connections allowed per 
source IP address. Like the previous parameter, it's also set to -1.


We're thinking of using these parameters to limit the maximum number 
of connections from a single IP address, especially to prevent 
overwhelming the database during spikes in API requests that should be 
handled by our SOA team exclusively.


Are these parameters suitable for production use, and would 
implementing restrictions on concurrent connections per IP be 
considered a good practice in managing Cassandra clusters?


Any insights or recommendations would be greatly appreciated.

Thank you!

Naman

RE: Check out new features in K8ssandra and Mission Control

2024-02-28 Thread Durity, Sean R via user

The k8ssandra requirement is a major blocker.

Sean R. Durity

INTERNAL USE
From: Christopher Bradford 
Sent: Tuesday, February 27, 2024 9:49 PM
To: user@cassandra.apache.org
Cc: Christopher Bradford 
Subject: [EXTERNAL] Re: Check out new features in K8ssandra and Mission Control

Hey Jon, * What aspects of Mission Control are dependent on using K8ssandra? 
Mission Control bundles in K8ssandra for the core automation workflows 
(lifecycle management, cluster operations, medusa &. reaper). In fact we 
include the K8ssandraSpec

Hey Jon,

* What aspects of Mission Control are dependent on using K8ssandra?

Mission Control bundles in K8ssandra for the core automation workflows 
(lifecycle management, cluster operations, medusa &. reaper). In fact we 
include the K8ssandraSpec in the top-level MissionControlCluster resource 
verbatim.

 * Can Mission Control work without K8ssandra?

Not at this time, K8ssandra powers a significant portion of the C* side of the 
stack. Mission Control provides additional functionality (web interface, 
certificate coordination, observability stack, etc) and applies some 
conventions to how K8ssandra objects are created / templated out, but the 
actually K8ssandra operator present in MC is the same one available via the 
Helm charts.

* Is mission control open source?

Not at this time. While the majority of the Kubernetes operators are open 
source as part of K8ssandra, there are some pieces which are closed source. I 
expect some of the components may move from closed source into K8ssandra over 
time.

* I'm not familiar with Vector - does it require an agent?

Vector 
[vector.dev]<https://urldefense.com/v3/__https:/vector.dev/__;!!M-nmYVHPHQ!L0M2-GDjSTuPzpD3TvfxiLgm2I5pLnFGmW5BhdhaYQBI6RlxK6e6ZMM2khnp8YUhfnqy2wYzjPX-UYAG9q8HOgVBPg$>
 is a pretty neat project. We run a few of their components as part of the 
stack. There is a DaemonSet which runs on each worker to collect host level 
metrics and scrape logs being emitted by containers, a sidecar for collecting 
logs from the C* container, and an aggregator which performs some filtering and 
transformation before pushing to an object store.

* Is Reaper deployed separately or integrated in?

Reaper is deployed as part of the cluster creation workflow. It is spun up and 
configured to connect to the cluster automatically.

~Chris

Christopher Bradford

On Tue, Feb 27, 2024 at 6:55 PM Jon Haddad 
mailto:j...@jonhaddad.com>> wrote:
Hey Chris - this looks pretty interesting!  It looks like there's a lot of 
functionality in here.

* What aspects of Mission Control are dependent on using K8ssandra?
* Can Mission Control work without K8ssandra?
* Is mission control open source?
* I'm not familiar with Vector - does it require an agent?
* Is Reaper deployed separately or integrated in?

Thanks!  Looking forward to trying this out.
Jon

On Tue, Feb 27, 2024 at 7:07 AM Christopher Bradford 
mailto:bradfor...@gmail.com>> wrote:

Hey C* folks,

I'm excited to share that the DataStax team has just released Mission Control 
[datastax.com]<https://urldefense.com/v3/__https:/datastax.com/products/mission-control__;!!M-nmYVHPHQ!L0M2-GDjSTuPzpD3TvfxiLgm2I5pLnFGmW5BhdhaYQBI6RlxK6e6ZMM2khnp8YUhfnqy2wYzjPX-UYAG9q-QlbHB9g$>,
 a new operations platform for running Apache Cassandra and DataStax 
Enterprise. Built around the open source core of K8ssandra 
[k8ssandra.io]<https://urldefense.com/v3/__https:/k8ssandra.io/__;!!M-nmYVHPHQ!L0M2-GDjSTuPzpD3TvfxiLgm2I5pLnFGmW5BhdhaYQBI6RlxK6e6ZMM2khnp8YUhfnqy2wYzjPX-UYAG9q9NjbtYtg$>
 we've been hard at work expanding multi-region capabilities. If you haven't 
seen some of the new features coming in here are some highlights:

  *   Management API support in Reaper - no more JMX credentials, YAY
  *   Additional support for TLS across the stack- including operator to node, 
Reaper to management API, etc
  *   Updated metrics pipeline - removal of collectd from nodes, Vector for 
monitoring log files (goodbye tail -f)
  *   Deterministic node selection for cluster operations
  *   Top-level management tasks in the control plane (no more forced 
connections to data planes to trigger a restart)

On top of this Mission Control offers:

  *   A single web-interface to monitor and manage your clusters wherever 
they're deployed
  *   Automatic management of internode and operator to node certificates - 
this includes integration with third party CAs and rotation of all 
certificates, keys, and various Java stores
  *   Centralized metrics and logs aggregation, querying and storage with the 
capability to split the pipeline allowing for exporting of streams to other 
observability tools within your environment
  *   Per-node configuration (this is an edge case, but still something we 
wanted to make possible)

While building our Mission Control, K8ssandra has seen a number of releases 
with quite a few contributions from the community. From Helm chart updates to 
oper

Re: stress testing & lab provisioning tools

2024-02-28 Thread Alexander DEJANOVSKI

Hey Jon,

It's awesome to see that you're reviving both these projects!

I was eager to get my hands on an updated version of tlp-cluster with up to
date AMIs 
tlp-stress is by far the best Cassandra stress tool I've worked with, and I
recommend everyone to test easy-cass-stress and build additional workload
types.

Looking forward to testing these new forks.

Alex

Le mar. 27 févr. 2024, 02:00, Jon Haddad  a écrit :

> Hey everyone,
>
> Over the last several months I've put a lot of work into 2 projects I
> started back at The Last Pickle, for stress testing Cassandra and for
> building labs in AWS.  You may know them as tlp-stress and tlp-cluster.
>
> Since I haven't worked at TLP in almost half a decade, and am the primary
> / sole person investing time, I've rebranded them to easy-cass-stress and
> easy-cass-lab.  There's been several major improvements in both projects
> and I invite you to take a look at both of them.
>
> easy-cass-stress
>
> Many of you are familiar with tlp-stress.  easy-cass-stress is a fork /
> rebrand of the project that uses almost the same familiar interface as
> tlp-stress, but with some improvements.  easy-cass-stress is even easier to
> use, requiring less guessing to the parameters to help you figure out your
> performance profile.  Instead of providing a -c flag (for in-flight
> concurrency) you can now simply provide your max read and write latencies
> and it'll figure out the throughput it can get on its own or used fixed
> rate scheduling like many other benchmarking tools have.  The adaptive
> scheduling is based on a Netflix Tech Blog post, but slightly modified to
> be sensitive to latency metrics instead of just errors.   You can read more
> about some of my changes here:
> https://rustyrazorblade.com/post/2023/2023-10-31-tlp-stress-adaptive-scheduler/
>
> GH repo: https://github.com/rustyrazorblade/easy-cass-stress
>
> easy-cass-lab
>
> This is a powerful tool that makes it much easier to spin up lab
> environments using any released version of Cassandra, with functionality
> coming to test custom branches and trunk.  It's a departure from the old
> tlp-cluster that installed and configured everything at runtime.  By
> creating a universal, multi-version AMI complete with all my favorite
> debugging tools, it's now possible to create a lab environment in under 2
> minutes in AWS.  The image includes easy-cass-stress making it
> straightforward to spin up clusters to test existing releases, and soon
> custom builds and trunk.  Fellow committer Jordan West has been working on
> this with me and we've made a ton of progress over the last several weeks.
>  For a demo check out my working session live stream last week where I
> fixed a few issues and discussed the potential and development path for the
> tool: https://youtu.be/dPtsBut7_MM
>
> GH repo: https://github.com/rustyrazorblade/easy-cass-lab
>
> I hope you find these tools as useful as I have.  I am aware of many
> extremely large Cassandra teams using tlp-stress with their 1K+ node
> environments, and hope the additional functionality in easy-cass-stress
> makes it easier for folks to start benchmarking C*, possibly in conjunction
> with easy-cass-lab.
>
> Looking forward to hearing your feedback,
> Jon
>

Re: Check out new features in K8ssandra and Mission Control

2024-02-27 Thread Christopher Bradford

Hey Jon,

* What aspects of Mission Control are dependent on using K8ssandra?
>

Mission Control bundles in K8ssandra for the core automation workflows
(lifecycle management, cluster operations, medusa &. reaper). In fact we
include the K8ssandraSpec in the top-level MissionControlCluster resource
verbatim.

 * Can Mission Control work without K8ssandra?

Not at this time, K8ssandra powers a significant portion of the C* side of
the stack. Mission Control provides additional functionality (web
interface, certificate coordination, observability stack, etc) and
applies some conventions to how K8ssandra objects are created / templated
out, but the actually K8ssandra operator present in MC is the same one
available via the Helm charts.

* Is mission control open source?
>

Not at this time. While the majority of the Kubernetes operators are open
source as part of K8ssandra, there are some pieces which are closed source.
I expect some of the components may move from closed source into K8ssandra
over time.

* I'm not familiar with Vector - does it require an agent?

Vector  is a pretty neat project. We run a few of
their components as part of the stack. There is a DaemonSet which runs on
each worker to collect host level metrics and scrape logs being emitted by
containers, a sidecar for collecting logs from the C* container, and an
aggregator which performs some filtering and transformation before pushing
to an object store.

* Is Reaper deployed separately or integrated in?
>

Reaper is deployed as part of the cluster creation workflow. It is spun up
and configured to connect to the cluster automatically.

~Chris

Christopher Bradford

On Tue, Feb 27, 2024 at 6:55 PM Jon Haddad  wrote:

> Hey Chris - this looks pretty interesting!  It looks like there's a lot of
> functionality in here.
>
> * What aspects of Mission Control are dependent on using K8ssandra?
> * Can Mission Control work without K8ssandra?
> * Is mission control open source?
> * I'm not familiar with Vector - does it require an agent?
> * Is Reaper deployed separately or integrated in?
>
> Thanks!  Looking forward to trying this out.
> Jon
>
>
> On Tue, Feb 27, 2024 at 7:07 AM Christopher Bradford 
> wrote:
>
>> Hey C* folks,
>>
>> I'm excited to share that the DataStax team has just released Mission
>> Control , a new
>> operations platform for running Apache Cassandra and DataStax Enterprise.
>> Built around the open source core of K8ssandra 
>> we've been hard at work expanding multi-region capabilities. If you haven't
>> seen some of the new features coming in here are some highlights:
>>
>>
>>-
>>
>>Management API support in Reaper - no more JMX credentials, YAY
>>-
>>
>>Additional support for TLS across the stack- including operator to
>>node, Reaper to management API, etc
>>-
>>
>>Updated metrics pipeline - removal of collectd from nodes, Vector for
>>monitoring log files (goodbye tail -f)
>>-
>>
>>Deterministic node selection for cluster operations
>>-
>>
>>Top-level management tasks in the control plane (no more forced
>>connections to data planes to trigger a restart)
>>
>>
>> On top of this Mission Control offers:
>>
>>
>>-
>>
>>A single web-interface to monitor and manage your clusters wherever
>>they're deployed
>>-
>>
>>Automatic management of internode and operator to node certificates -
>>this includes integration with third party CAs and rotation of all
>>certificates, keys, and various Java stores
>>-
>>
>>Centralized metrics and logs aggregation, querying and storage with
>>the capability to split the pipeline allowing for exporting of streams to
>>other observability tools within your environment
>>-
>>
>>Per-node configuration (this is an edge case, but still something we
>>wanted to make possible)
>>
>>
>> While building our Mission Control, K8ssandra has seen a number of
>> releases with quite a few contributions from the community. From Helm chart
>> updates to operator tweaks we want to send out a huge THANK YOU to everyone
>> who has filed issues, opened pull requests, and helped us test bugfixes and
>> new functionality.
>>
>> If you've been sleeping on K8ssandra, now is a good time to check it out
>> . It has all of the pieces needed to
>> run Cassandra in production. Looking for something out of the box instead
>> of putting the pieces together yourself, take Mission Control for a spin
>> and sign up for the trial
>> . I'm happy to
>> answer any K8ssandra or Mission Control questions you may have here or on
>> our Discord .
>>
>> Cheers,
>>
>> ~Chris
>>
>> Christopher Bradford
>>
>>

Re: Check out new features in K8ssandra and Mission Control

2024-02-27 Thread Jon Haddad

Hey Chris - this looks pretty interesting!  It looks like there's a lot of
functionality in here.

* What aspects of Mission Control are dependent on using K8ssandra?
* Can Mission Control work without K8ssandra?
* Is mission control open source?
* I'm not familiar with Vector - does it require an agent?
* Is Reaper deployed separately or integrated in?

Thanks!  Looking forward to trying this out.
Jon


On Tue, Feb 27, 2024 at 7:07 AM Christopher Bradford 
wrote:

> Hey C* folks,
>
> I'm excited to share that the DataStax team has just released Mission
> Control , a new operations
> platform for running Apache Cassandra and DataStax Enterprise. Built around
> the open source core of K8ssandra  we've been hard
> at work expanding multi-region capabilities. If you haven't seen some of
> the new features coming in here are some highlights:
>
>
>-
>
>Management API support in Reaper - no more JMX credentials, YAY
>-
>
>Additional support for TLS across the stack- including operator to
>node, Reaper to management API, etc
>-
>
>Updated metrics pipeline - removal of collectd from nodes, Vector for
>monitoring log files (goodbye tail -f)
>-
>
>Deterministic node selection for cluster operations
>-
>
>Top-level management tasks in the control plane (no more forced
>connections to data planes to trigger a restart)
>
>
> On top of this Mission Control offers:
>
>
>-
>
>A single web-interface to monitor and manage your clusters wherever
>they're deployed
>-
>
>Automatic management of internode and operator to node certificates -
>this includes integration with third party CAs and rotation of all
>certificates, keys, and various Java stores
>-
>
>Centralized metrics and logs aggregation, querying and storage with
>the capability to split the pipeline allowing for exporting of streams to
>other observability tools within your environment
>-
>
>Per-node configuration (this is an edge case, but still something we
>wanted to make possible)
>
>
> While building our Mission Control, K8ssandra has seen a number of
> releases with quite a few contributions from the community. From Helm chart
> updates to operator tweaks we want to send out a huge THANK YOU to everyone
> who has filed issues, opened pull requests, and helped us test bugfixes and
> new functionality.
>
> If you've been sleeping on K8ssandra, now is a good time to check it out
> . It has all of the pieces needed to
> run Cassandra in production. Looking for something out of the box instead
> of putting the pieces together yourself, take Mission Control for a spin
> and sign up for the trial
> . I'm happy to
> answer any K8ssandra or Mission Control questions you may have here or on
> our Discord .
>
> Cheers,
>
> ~Chris
>
> Christopher Bradford
>
>

Re: Question Regarding Cassandra-19336

2024-02-25 Thread manish khandelwal

It looks a crtirical bug for setup with multi DC using high number of
vnodes and running full repair with -PR option, since number of parallel
repair sessions  can be as high as number of vnodes. Thus it can fill up
memory causing pom or direct buffer memory oom. It should get prioritized
for release.



On Thu, Feb 22, 2024, 12:31 C. Scott Andreas  wrote:

> The “Since Version” for the ticket is set to 3.0.19, presumably based on
> C-14096 as the predecessor for this ticket.
>
> C-14096 was merged up into 3.11.x in the 3.11.5 release, so 3.11.5 would
> be the equivalent “since version” for that release series. The patch
> addressing this ticket is included in 4.0.12+, 4.1.4+, and 5.0-beta2+.
>
> If the question behind the question is an OOM related to repair, keep in
> mind that this ticket’s title is a bit non-specific and will not capture
> all sources of memory allocated during repair or all causes of OOMs.
>
> If you’re running 3.11.x, most members of the Cassandra community would
> recommend upgrading to the latest 4.0.x release at a minimum to take
> advantage of years of stability and performance improvements in the project.
>
> - Scott
>
> On Feb 21, 2024, at 10:42 PM, ranju goel  wrote:
>
> 
> Hi All,
>
> https://issues.apache.org/jira/browse/CASSANDRA-19336
> Does the same issue mentioned in the above JIRA exists for version 3.11.x
>
> Regards
> Ranju
>
>

Re: Cassandra 4.1 compaction thread no longer low priority (cpu nice)

2024-02-23 Thread Pierre Fersing

Hi,

Thanks for your detailed answers. I understand the reason why using low
priority compaction may not be a great idea in the general case (the example
with too high CPU for reading).

I’ll give a try with the compaction throughput which I total forgot that this
option exists. It may fix the issue I see after my upgrade. I’ll see.
If it’s not fixing my trouble, I’ll try to confirm that my issue is indeed due
to compaction putting too much pressure on CPU because going further as it will
probably require more works that initial though and likely being a configurable
option.

Regards,
Pierre Fersing

De : Dmitry Konstantinov
Date : jeudi, 22 février 2024 à 20:39
À : user@cassandra.apache.org
Objet : Re: Cassandra 4.1 compaction thread no longer low priority (cpu nice)
Hi all,

I was not participating in the changes but I analyzed the question some time
ago from another side.
There were also changes related to -XX:ThreadPriorityPolicy JVM option. When
you set a thread priority for a Java thread it does not mean it is always
propagated as a native OS thread priority. To propagate the priority you should
use -XX:ThreadPriorityPolicy and -XX:JavaPriorityN_To_OSPriority JVM options,
but there is an issue with them because JVM wants to be executed under root to
set -XX:ThreadPriorityPolicy=1 which enables the priorities usage. A hack was
invented long time ago to workaround it by setting -XX:ThreadPriorityPolicy=42
value (any value not equal to 0 or 1) and bypass the not so needed and annoying
grants validation logic (see
http://tech.stolsvik.com/2010/01/linux-java-thread-priorities-workaround.html
for more details about).
It worked for Java 8 but then there was a change in Java 9 about adding extra
validation for JVM option values (JEP 245: Validate JVM Command-Line Flag
Arguments - https://bugs.openjdk.org/browse/JDK-8059557) and the hack stopped
to work and started to cause JVM failure with a validation error. As a reaction
to it - the flag was removed from Cassandra JVM configuration files in
https://issues.apache.org/jira/browse/CASSANDRA-13107. After it the lower
priority value for compaction threads have not had any actual effect.
The interesting story is that the JVM logic has been changed to support the
ability to set -XX:ThreadPriorityPolicy=1 for non-root users in Java 13
(https://bugs.openjdk.org/browse/JDK-8215962) and the change was backported to
Java 11 as well (https://bugs.openjdk.org/browse/JDK-8217494).
So, from this point of view I think it would be nice to return back the ability
to set thread priority for compaction threads. At the same time I would not
expect too much improvement by enabling it.

P.S. There was also an idea about using ionice
(https://issues.apache.org/jira/browse/CASSANDRA-9946) but the current Linux IO
schedulers do not take that into account anymore. It looks like the only
scheduler that supported ionice was CFQ
(https://issues.apache.org/jira/browse/CASSANDRA-9946?focusedCommentId=14648616=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14648616)
and it was deprecated and removed since Linux kernel 5.x
(https://github.com/torvalds/linux/commit/f382fb0bcef4c37dc049e9f6963e3baf204d815c).

Regards,
Dmitry

On Thu, 22 Feb 2024 at 15:30, Bowen Song via user
mailto:user@cassandra.apache.org>> wrote:

Hi Pierre,

Is there anything stopping you from using the
compaction_throughput<https://github.com/apache/cassandra/blob/f9e033f519c14596da4dc954875756a69aea4e78/conf/cassandra.yaml#L989>
option in the cassandra.yaml file to manage the performance impact of
compaction operations?

With thread priority, there's a failure scenario on busy nodes when the read
operations uses too much CPU. If the compaction thread has lower priority, it
does not get enough CPU time to run, and SSTable files will build up, causing
read to become slower and more expensive, which in turn result in compaction
gets even less CPU time. At the end, one of the following three will happen:

* the node becomes too slow and most queries time out
* the Java process crashes due to too many open files or OOM because JVM GC
can't keep up
* the filesystem run out of free space or inodes

However, I'm unsure whether the compaction thread priority was intentionally
removed from 4.1.0. Someone familiar with this matter may be able to answer
that.

Cheers,
Bowen

On 22/02/2024 13:26, Pierre Fersing wrote:
Hello all,

I've recently upgraded to Cassandra 4.1 and see a change in compaction behavior
that seems unwanted:

* With Cassandra 3.11 compaction was run by thread in low priority and thus
using CPU nice (visible using top) (I believe Cassandra 4.0 also had this
behavior)

* With Cassandra 4.1, compactions are no longer run as low priority thread
(compaction now use "normal" CPU).

This means that when the server had limited CPU, Cassandra compaction now
compete for the CPU with other process (probab

Re: Cassandra 4.1 compaction thread no longer low priority (cpu nice)

2024-02-22 Thread Dmitry Konstantinov

ice (visible using top) (I believe Cassandra 4.0 also had
>> this behavior)
>>
>> * With Cassandra 4.1, compactions are no longer run as low priority
>> thread (compaction now use "normal" CPU).
>>
>> This means that when the server had limited CPU, Cassandra compaction now
>> compete for the CPU with other process (probably including Cassandra
>> itself) that need CPU. When it was using CPU nice, the compaction only
>> competed for CPU with other lower priority process which was great as it
>> leaves CPU available for processes that need to kept small response time
>> (like an API used by human).
>>
>> Is it wanted to lose this feature in Cassandra 4.1 or was it just a
>> forget during re-write of compaction executor ? Should I open a bug to
>> re-introduce this feature in Cassandra ?
>>
>>
>> I've done few searches, and:
>>
>> * I believe compaction used CPU nice because the compactor executor was
>> created with minimal priority:
>> https://github.com/apache/cassandra/blob/cassandra-3.11.16/src/java/org/apache/cassandra/db/compaction/CompactionManager.java#L1906
>>
>> * I think it was dropped by commit
>> https://github.com/apache/cassandra/commit/be1f050bc8c0cd695a42952e3fc84625ad48d83a
>>
>> * It looks doable to set the thread priority with new executor, I think
>> adding ".withThreadPriority(Thread.MIN_PRIORITY)" when using
>> executorFactory in
>> https://github.com/apache/cassandra/blob/77a3e0e818df3cce45a974ecc977ad61bdcace47/src/java/org/apache/cassandra/db/compaction/CompactionManager.java#L2028
>> should do it.
>>
>>
>> Did I miss a reason to no longer use low priority threads for compaction
>> ? Should I open a bug for re-adding this feature / submit a PR ?
>>
>> Regards,
>>
>> Pierre Fersing
>>
>>
>
> --
> Dmitry Konstantinov
>
>

-- 
Dmitry Konstantinov

Re: Cassandra 4.1 compaction thread no longer low priority (cpu nice)

2024-02-22 Thread Bowen Song via user

rocess which was great as it leaves CPU available for
processes that need to kept small response time (like an API used
by human).

Is it wanted to lose this feature in Cassandra 4.1 or was it just
a forget during re-write of compaction executor ? Should I open a
bug to re-introduce this feature in Cassandra ?


I've done few searches, and:

* I believe compaction used CPU nice because the compactor
executor was created with minimal priority:

https://github.com/apache/cassandra/blob/cassandra-3.11.16/src/java/org/apache/cassandra/db/compaction/CompactionManager.java#L1906

<https://github.com/apache/cassandra/blob/cassandra-3.11.16/src/java/org/apache/cassandra/db/compaction/CompactionManager.java#L1906>

* I think it was dropped by commit

https://github.com/apache/cassandra/commit/be1f050bc8c0cd695a42952e3fc84625ad48d83a

<https://github.com/apache/cassandra/commit/be1f050bc8c0cd695a42952e3fc84625ad48d83a>

* It looks doable to set the thread priority with new executor, I
think adding ".withThreadPriority(Thread.MIN_PRIORITY)" when
using executorFactory in

https://github.com/apache/cassandra/blob/77a3e0e818df3cce45a974ecc977ad61bdcace47/src/java/org/apache/cassandra/db/compaction/CompactionManager.java#L2028

<https://github.com/apache/cassandra/blob/77a3e0e818df3cce45a974ecc977ad61bdcace47/src/java/org/apache/cassandra/db/compaction/CompactionManager.java#L2028>should
do it.


Did I miss a reason to no longer use low priority threads for
compaction ? Should I open a bug for re-adding this feature /
submit a PR ?

Regards,

Pierre Fersing




--
Dmitry Konstantinov

Re: Cassandra 4.1 compaction thread no longer low priority (cpu nice)

2024-02-22 Thread Dmitry Konstantinov

Hi all,

I was not participating in the changes but I analyzed the question some
time ago from another side.
There were also changes related to -XX:ThreadPriorityPolicy JVM option.
When you set a thread priority for a Java thread it does not mean it is
always propagated as a native OS thread priority. To propagate the priority
you should use  -XX:ThreadPriorityPolicy and -XX:JavaPriorityN_To_OSPriority
 JVM options, but there is an issue with them because JVM wants to be
executed under root to set -XX:ThreadPriorityPolicy=1 which enables the
priorities usage. A hack was invented long time ago to workaround it by
setting -XX:ThreadPriorityPolicy=42 value (any value not equal to 0 or 1)
and bypass the not so needed and annoying grants validation logic (see
http://tech.stolsvik.com/2010/01/linux-java-thread-priorities-workaround.html
for more details about).
It worked for Java 8 but then there was a change in Java 9 about adding
extra validation for JVM option values (JEP 245: Validate JVM Command-Line
Flag Arguments  - https://bugs.openjdk.org/browse/JDK-8059557) and the hack
stopped to work and started to cause JVM failure with a validation error.
As a reaction to it - the flag was removed from Cassandra JVM configuration
files in https://issues.apache.org/jira/browse/CASSANDRA-13107. After it
the lower priority value for compaction threads have not had any actual
effect.
The interesting story is that the JVM logic has been changed to support the
ability to set -XX:ThreadPriorityPolicy=1 for non-root users in Java 13 (
https://bugs.openjdk.org/browse/JDK-8215962) and the change was backported
to Java 11 as well (https://bugs.openjdk.org/browse/JDK-8217494).
So, from this point of view I think it would be nice to return back the
ability to set thread priority for compaction threads. At the same time I
would not expect too much improvement by enabling it.

P.S. There was also an idea about using ionice (
https://issues.apache.org/jira/browse/CASSANDRA-9946) but the current Linux
IO schedulers do not take that into account anymore. It looks like the only
scheduler that supported ionice was CFQ (
https://issues.apache.org/jira/browse/CASSANDRA-9946?focusedCommentId=14648616=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14648616)
and it was deprecated and removed since Linux kernel 5.x (
https://github.com/torvalds/linux/commit/f382fb0bcef4c37dc049e9f6963e3baf204d815c
).

Regards,
Dmitry

On Thu, 22 Feb 2024 at 15:30, Bowen Song via user 
wrote:

> Hi Pierre,
>
> Is there anything stopping you from using the compaction_throughput
> <https://github.com/apache/cassandra/blob/f9e033f519c14596da4dc954875756a69aea4e78/conf/cassandra.yaml#L989>
> option in the cassandra.yaml file to manage the performance impact of
> compaction operations?
>
> With thread priority, there's a failure scenario on busy nodes when the
> read operations uses too much CPU. If the compaction thread has lower
> priority, it does not get enough CPU time to run, and SSTable files will
> build up, causing read to become slower and more expensive, which in turn
> result in compaction gets even less CPU time. At the end, one of the
> following three will happen:
>
>- the node becomes too slow and most queries time out
>- the Java process crashes due to too many open files or OOM because
>JVM GC can't keep up
>- the filesystem run out of free space or inodes
>
> However, I'm unsure whether the compaction thread priority was
> intentionally removed from 4.1.0. Someone familiar with this matter may be
> able to answer that.
>
> Cheers,
> Bowen
>
>
> On 22/02/2024 13:26, Pierre Fersing wrote:
>
> Hello all,
>
> I've recently upgraded to Cassandra 4.1 and see a change in compaction
> behavior that seems unwanted:
>
> * With Cassandra 3.11 compaction was run by thread in low priority and
> thus using CPU nice (visible using top) (I believe Cassandra 4.0 also had
> this behavior)
>
> * With Cassandra 4.1, compactions are no longer run as low priority thread
> (compaction now use "normal" CPU).
>
> This means that when the server had limited CPU, Cassandra compaction now
> compete for the CPU with other process (probably including Cassandra
> itself) that need CPU. When it was using CPU nice, the compaction only
> competed for CPU with other lower priority process which was great as it
> leaves CPU available for processes that need to kept small response time
> (like an API used by human).
>
> Is it wanted to lose this feature in Cassandra 4.1 or was it just a forget
> during re-write of compaction executor ? Should I open a bug to
> re-introduce this feature in Cassandra ?
>
>
> I've done few searches, and:
>
> * I believe compaction used CPU nice because the compactor executor was
> created with minimal priority:
> h

Re: Cassandra 4.1 compaction thread no longer low priority (cpu nice)

2024-02-22 Thread Bowen Song via user


Hi Pierre,

Is there anything stopping you from using the compaction_throughput 
<https://github.com/apache/cassandra/blob/f9e033f519c14596da4dc954875756a69aea4e78/conf/cassandra.yaml#L989> 
option in the cassandra.yaml file to manage the performance impact of 
compaction operations?


With thread priority, there's a failure scenario on busy nodes when the 
read operations uses too much CPU. If the compaction thread has lower 
priority, it does not get enough CPU time to run, and SSTable files will 
build up, causing read to become slower and more expensive, which in 
turn result in compaction gets even less CPU time. At the end, one of 
the following three will happen:


 * the node becomes too slow and most queries time out
 * the Java process crashes due to too many open files or OOM because
   JVM GC can't keep up
 * the filesystem run out of free space or inodes

However, I'm unsure whether the compaction thread priority was 
intentionally removed from 4.1.0. Someone familiar with this matter may 
be able to answer that.


Cheers,
Bowen


On 22/02/2024 13:26, Pierre Fersing wrote:


Hello all,

I've recently upgraded to Cassandra 4.1 and see a change in compaction 
behavior that seems unwanted:


* With Cassandra 3.11 compaction was run by thread in low priority and 
thus using CPU nice (visible using top) (I believe Cassandra 4.0 also 
had this behavior)


* With Cassandra 4.1, compactions are no longer run as low priority 
thread (compaction now use "normal" CPU).


This means that when the server had limited CPU, Cassandra compaction 
now compete for the CPU with other process (probably including 
Cassandra itself) that need CPU. When it was using CPU nice, the 
compaction only competed for CPU with other lower priority process 
which was great as it leaves CPU available for processes that need to 
kept small response time (like an API used by human).


Is it wanted to lose this feature in Cassandra 4.1 or was it just a 
forget during re-write of compaction executor ? Should I open a bug to 
re-introduce this feature in Cassandra ?



I've done few searches, and:

* I believe compaction used CPU nice because the compactor executor 
was created with minimal priority: 
https://github.com/apache/cassandra/blob/cassandra-3.11.16/src/java/org/apache/cassandra/db/compaction/CompactionManager.java#L1906 
<https://github.com/apache/cassandra/blob/cassandra-3.11.16/src/java/org/apache/cassandra/db/compaction/CompactionManager.java#L1906>


* I think it was dropped by commit 
https://github.com/apache/cassandra/commit/be1f050bc8c0cd695a42952e3fc84625ad48d83a 
<https://github.com/apache/cassandra/commit/be1f050bc8c0cd695a42952e3fc84625ad48d83a>


* It looks doable to set the thread priority with new executor, I 
think adding ".withThreadPriority(Thread.MIN_PRIORITY)" when using 
executorFactory in 
https://github.com/apache/cassandra/blob/77a3e0e818df3cce45a974ecc977ad61bdcace47/src/java/org/apache/cassandra/db/compaction/CompactionManager.java#L2028 
<https://github.com/apache/cassandra/blob/77a3e0e818df3cce45a974ecc977ad61bdcace47/src/java/org/apache/cassandra/db/compaction/CompactionManager.java#L2028>should 
do it.



Did I miss a reason to no longer use low priority threads for 
compaction ? Should I open a bug for re-adding this feature / submit a 
PR ?


Regards,

Pierre Fersing

Re: Question Regarding Cassandra-19336

2024-02-21 Thread C. Scott Andreas

The “Since Version” for the ticket is set to 3.0.19, presumably based on 
C-14096 as the predecessor for this ticket.

C-14096 was merged up into 3.11.x in the 3.11.5 release, so 3.11.5 would be the 
equivalent “since version” for that release series. The patch addressing this 
ticket is included in 4.0.12+, 4.1.4+, and 5.0-beta2+.

If the question behind the question is an OOM related to repair, keep in mind 
that this ticket’s title is a bit non-specific and will not capture all sources 
of memory allocated during repair or all causes of OOMs.

If you’re running 3.11.x, most members of the Cassandra community would 
recommend upgrading to the latest 4.0.x release at a minimum to take advantage 
of years of stability and performance improvements in the project.

- Scott

> On Feb 21, 2024, at 10:42 PM, ranju goel  wrote:
> 
> 
> Hi All,
> 
> https://issues.apache.org/jira/browse/CASSANDRA-19336
> Does the same issue mentioned in the above JIRA exists for version 3.11.x
> 
> Regards
> Ranju

Re: Requesting Feedback for Cassandra as a backup solution.

2024-02-19 Thread Gowtham S

Thanks for your valuable reply, will check.
Thanks and regards,
Gowtham S


On Mon, 19 Feb 2024 at 15:46, Bowen Song via user 
wrote:

> You can have a read at
> https://www.datastax.com/blog/cassandra-anti-patterns-queues-and-queue-datasets
>
> Your table schema does not include the most important piece of information
> - the partition keys (and clustering keys, if any). Keep in mind that you
> can only efficiently query Cassandra by the exact partition key or the
> token of a partition key, otherwise you will have to rely on MV or
> secondary index, or worse, scan the entire table (all the nodes) to find
> your data.
>
> A Cassandra schema should look like this:
> CREATE TABLE xyz (
>   a text,
>   b text,
>   c timeuuid,
>   d int,
>   e text,
>   PRIMARY KEY ((a, b), c, d)
> );
>
> The line "PRIMARY KEY" contains arguably the most important piece of
> information of the table schema.
>
>
> On 19/02/2024 06:52, Gowtham S wrote:
>
> Hi Bowen
>
> which is a well documented anti-pattern.
>>
> Can you please explain more on this, I'm not aware of it. It will be
> helpful to make decisions.
> Please find the below table schema
>
> *Table schema*
> TopicName - text
> Partition - int
> MessageUUID - text
> Actual data - text
> OccurredTime - Timestamp
> Status - boolean
>
> We are planning to read the table with the topic name and the status is
> not true. And produce those to the respective topic when Kafka is live.
>
> Thanks and regards,
> Gowtham S
>
>
> On Sat, 17 Feb 2024 at 18:10, Bowen Song via user <
> user@cassandra.apache.org> wrote:
>
>> Hi Gowtham,
>>
>> On the face of it, it sounds like you are planning to use Cassandra for a
>> queue-like application, which is a well documented anti-pattern. If that's
>> not the case, can you please show the table schema and some example queries?
>>
>> Cheers,
>> Bowen
>> On 17/02/2024 08:44, Gowtham S wrote:
>>
>> Dear Cassandra Community,
>>
>> I am reaching out to seek your valuable feedback and insights on a
>> proposed solution we are considering for managing Kafka outages using
>> Cassandra.
>>
>> At our organization, we heavily rely on Kafka for real-time data
>> processing and messaging. However, like any technology, Kafka is
>> susceptible to occasional outages which can disrupt our operations and
>> impact our services. To mitigate the impact of such outages and ensure
>> continuity, we are exploring the possibility of leveraging Cassandra as a
>> backup solution.
>>
>> Our proposed approach involves storing messages in Cassandra during Kafka
>> outages. Subsequently, we plan to implement a scheduler that will read from
>> Cassandra and attempt to write these messages back into Kafka once it is
>> operational again.
>>
>> We believe that by adopting this strategy, we can achieve the following
>> benefits:
>>
>>1.
>>
>>Improved Fault Tolerance: By having a backup mechanism in place, we
>>can reduce the risk of data loss and ensure continuity of operations 
>> during
>>Kafka outages.
>>2.
>>
>>Enhanced Reliability: Cassandra's distributed architecture and
>>built-in replication features make it well-suited for storing data
>>reliably, even in the face of failures.
>>3.
>>
>>Scalability: Both Cassandra and Kafka are designed to scale
>>horizontally, allowing us to handle increased loads seamlessly.
>>
>> Before proceeding further with this approach, we would greatly appreciate
>> any feedback, suggestions, or concerns from the community. Specifically, we
>> are interested in hearing about:
>>
>>- Potential challenges or drawbacks of using Cassandra as a backup
>>solution for Kafka outages.
>>- Best practices or recommendations for implementing such a backup
>>mechanism effectively.
>>- Any alternative approaches or technologies that we should consider?
>>
>> Your expertise and insights are invaluable to us, and we are eager to
>> learn from your experiences and perspectives. Please feel free to share
>> your thoughts or reach out to us with any questions or clarifications.
>>
>> Thank you for taking the time to consider our proposal, and we look
>> forward to hearing from you soon.
>> Thanks and regards,
>> Gowtham S
>>
>>

Re: Requesting Feedback for Cassandra as a backup solution.

2024-02-19 Thread Bowen Song via user

You can have a read at 
https://www.datastax.com/blog/cassandra-anti-patterns-queues-and-queue-datasets


Your table schema does not include the most important piece of 
information - the partition keys (and clustering keys, if any). Keep in 
mind that you can only efficiently query Cassandra by the exact 
partition key or the token of a partition key, otherwise you will have 
to rely on MV or secondary index, or worse, scan the entire table (all 
the nodes) to find your data.


A Cassandra schema should look like this:

CREATE TABLE xyz (
  a text,
  b text,
  c timeuuid,
  d int,
  e text,
  PRIMARY KEY ((a, b), c, d)
);

The line "PRIMARY KEY" contains arguably the most important piece of 
information of the table schema.



On 19/02/2024 06:52, Gowtham S wrote:

Hi Bowen

which is a well documented anti-pattern.

Can you please explain more on this, I'm not aware of it. It will be 
helpful to make decisions.

Please find the below table schema

*Table schema*
TopicName - text
Partition - int
MessageUUID - text
Actual data - text
OccurredTime - Timestamp
Status - boolean

We are planning to read the table with the topic name and the status 
is not true. And produce those to the respective topic when Kafka is live.


Thanks and regards,
Gowtham S


On Sat, 17 Feb 2024 at 18:10, Bowen Song via user 
 wrote:


Hi Gowtham,

On the face of it, it sounds like you are planning to use
Cassandra for a queue-like application, which is a well documented
anti-pattern. If that's not the case, can you please show the
table schema and some example queries?

Cheers,
Bowen

On 17/02/2024 08:44, Gowtham S wrote:


Dear Cassandra Community,

I am reaching out to seek your valuable feedback and insights on
a proposed solution we are considering for managing Kafka outages
using Cassandra.

At our organization, we heavily rely on Kafka for real-time data
processing and messaging. However, like any technology, Kafka is
susceptible to occasional outages which can disrupt our
operations and impact our services. To mitigate the impact of
such outages and ensure continuity, we are exploring the
possibility of leveraging Cassandra as a backup solution.

Our proposed approach involves storing messages in Cassandra
during Kafka outages. Subsequently, we plan to implement a
scheduler that will read from Cassandra and attempt to write
these messages back into Kafka once it is operational again.

We believe that by adopting this strategy, we can achieve the
following benefits:

1.

Improved Fault Tolerance: By having a backup mechanism in
place, we can reduce the risk of data loss and ensure
continuity of operations during Kafka outages.

2.

Enhanced Reliability: Cassandra's distributed architecture
and built-in replication features make it well-suited for
storing data reliably, even in the face of failures.

3.

Scalability: Both Cassandra and Kafka are designed to scale
horizontally, allowing us to handle increased loads seamlessly.

Before proceeding further with this approach, we would greatly
appreciate any feedback, suggestions, or concerns from the
community. Specifically, we are interested in hearing about:

  * Potential challenges or drawbacks of using Cassandra as a
backup solution for Kafka outages.
  * Best practices or recommendations for implementing such a
backup mechanism effectively.
  * Any alternative approaches or technologies that we should
consider?

Your expertise and insights are invaluable to us, and we are
eager to learn from your experiences and perspectives. Please
feel free to share your thoughts or reach out to us with any
questions or clarifications.

Thank you for taking the time to consider our proposal, and we
look forward to hearing from you soon.

Thanks and regards,
Gowtham S

Re: Requesting Feedback for Cassandra as a backup solution.

2024-02-18 Thread Gowtham S

Hi Bowen

which is a well documented anti-pattern.
>
Can you please explain more on this, I'm not aware of it. It will be
helpful to make decisions.
Please find the below table schema

*Table schema*
TopicName - text
Partition - int
MessageUUID - text
Actual data - text
OccurredTime - Timestamp
Status - boolean

We are planning to read the table with the topic name and the status is
not true. And produce those to the respective topic when Kafka is live.

Thanks and regards,
Gowtham S


On Sat, 17 Feb 2024 at 18:10, Bowen Song via user 
wrote:

> Hi Gowtham,
>
> On the face of it, it sounds like you are planning to use Cassandra for a
> queue-like application, which is a well documented anti-pattern. If that's
> not the case, can you please show the table schema and some example queries?
>
> Cheers,
> Bowen
> On 17/02/2024 08:44, Gowtham S wrote:
>
> Dear Cassandra Community,
>
> I am reaching out to seek your valuable feedback and insights on a
> proposed solution we are considering for managing Kafka outages using
> Cassandra.
>
> At our organization, we heavily rely on Kafka for real-time data
> processing and messaging. However, like any technology, Kafka is
> susceptible to occasional outages which can disrupt our operations and
> impact our services. To mitigate the impact of such outages and ensure
> continuity, we are exploring the possibility of leveraging Cassandra as a
> backup solution.
>
> Our proposed approach involves storing messages in Cassandra during Kafka
> outages. Subsequently, we plan to implement a scheduler that will read from
> Cassandra and attempt to write these messages back into Kafka once it is
> operational again.
>
> We believe that by adopting this strategy, we can achieve the following
> benefits:
>
>1.
>
>Improved Fault Tolerance: By having a backup mechanism in place, we
>can reduce the risk of data loss and ensure continuity of operations during
>Kafka outages.
>2.
>
>Enhanced Reliability: Cassandra's distributed architecture and
>built-in replication features make it well-suited for storing data
>reliably, even in the face of failures.
>3.
>
>Scalability: Both Cassandra and Kafka are designed to scale
>horizontally, allowing us to handle increased loads seamlessly.
>
> Before proceeding further with this approach, we would greatly appreciate
> any feedback, suggestions, or concerns from the community. Specifically, we
> are interested in hearing about:
>
>- Potential challenges or drawbacks of using Cassandra as a backup
>solution for Kafka outages.
>- Best practices or recommendations for implementing such a backup
>mechanism effectively.
>- Any alternative approaches or technologies that we should consider?
>
> Your expertise and insights are invaluable to us, and we are eager to
> learn from your experiences and perspectives. Please feel free to share
> your thoughts or reach out to us with any questions or clarifications.
>
> Thank you for taking the time to consider our proposal, and we look
> forward to hearing from you soon.
> Thanks and regards,
> Gowtham S
>
>

Re: Requesting Feedback for Cassandra as a backup solution.

2024-02-17 Thread Slater, Ben via user



TBH, this sounds to me like a very expensive (in terms of effort) way to deal 
with whatever Kafka unreliability you’re having. We have lots of both Kafka and 
Cassandra clusters under management and I have no doubt that Kafka is capable 
of being as reliable as Cassandra (and both are capable of achieving 99.99%+ 
availability) and, if anything, is easier to achieve that reliability with 
Kafka. Adding an additional distributed tech to manage is a whole lot of new 
learning if you’re not already expert at it.

I think someone else suggest just running parallel Kafka cluster – I’ve 
certainly seen that be succesful. However, a really good recommendation 
probably requires a bit more understand of just what kind of issues you’re 
worried about with Kafka.

Cheers
Ben




From: Bowen Song via user 
Date: Saturday, 17 February 2024 at 23:40
To: user@cassandra.apache.org 
Cc: Bowen Song 
Subject: Re: Requesting Feedback for Cassandra as a backup solution.
EXTERNAL EMAIL - USE CAUTION when clicking links or attachments



Hi Gowtham,

On the face of it, it sounds like you are planning to use Cassandra for a 
queue-like application, which is a well documented anti-pattern. If that's not 
the case, can you please show the table schema and some example queries?

Cheers,
Bowen
On 17/02/2024 08:44, Gowtham S wrote:

Dear Cassandra Community,

I am reaching out to seek your valuable feedback and insights on a proposed 
solution we are considering for managing Kafka outages using Cassandra.

At our organization, we heavily rely on Kafka for real-time data processing and 
messaging. However, like any technology, Kafka is susceptible to occasional 
outages which can disrupt our operations and impact our services. To mitigate 
the impact of such outages and ensure continuity, we are exploring the 
possibility of leveraging Cassandra as a backup solution.

Our proposed approach involves storing messages in Cassandra during Kafka 
outages. Subsequently, we plan to implement a scheduler that will read from 
Cassandra and attempt to write these messages back into Kafka once it is 
operational again.

We believe that by adopting this strategy, we can achieve the following 
benefits:

  1.  Improved Fault Tolerance: By having a backup mechanism in place, we can 
reduce the risk of data loss and ensure continuity of operations during Kafka 
outages.
  2.  Enhanced Reliability: Cassandra's distributed architecture and built-in 
replication features make it well-suited for storing data reliably, even in the 
face of failures.
  3.  Scalability: Both Cassandra and Kafka are designed to scale horizontally, 
allowing us to handle increased loads seamlessly.

Before proceeding further with this approach, we would greatly appreciate any 
feedback, suggestions, or concerns from the community. Specifically, we are 
interested in hearing about:

  *   Potential challenges or drawbacks of using Cassandra as a backup solution 
for Kafka outages.
  *   Best practices or recommendations for implementing such a backup 
mechanism effectively.
  *   Any alternative approaches or technologies that we should consider?

Your expertise and insights are invaluable to us, and we are eager to learn 
from your experiences and perspectives. Please feel free to share your thoughts 
or reach out to us with any questions or clarifications.

Thank you for taking the time to consider our proposal, and we look forward to 
hearing from you soon.
Thanks and regards,
Gowtham S

Re: Requesting Feedback for Cassandra as a backup solution.

2024-02-17 Thread Bowen Song via user


Hi Gowtham,

On the face of it, it sounds like you are planning to use Cassandra for 
a queue-like application, which is a well documented anti-pattern. If 
that's not the case, can you please show the table schema and some 
example queries?


Cheers,
Bowen

On 17/02/2024 08:44, Gowtham S wrote:


Dear Cassandra Community,

I am reaching out to seek your valuable feedback and insights on a 
proposed solution we are considering for managing Kafka outages using 
Cassandra.


At our organization, we heavily rely on Kafka for real-time data 
processing and messaging. However, like any technology, Kafka is 
susceptible to occasional outages which can disrupt our operations and 
impact our services. To mitigate the impact of such outages and ensure 
continuity, we are exploring the possibility of leveraging Cassandra 
as a backup solution.


Our proposed approach involves storing messages in Cassandra during 
Kafka outages. Subsequently, we plan to implement a scheduler that 
will read from Cassandra and attempt to write these messages back into 
Kafka once it is operational again.


We believe that by adopting this strategy, we can achieve the 
following benefits:


1.

Improved Fault Tolerance: By having a backup mechanism in place,
we can reduce the risk of data loss and ensure continuity of
operations during Kafka outages.

2.

Enhanced Reliability: Cassandra's distributed architecture and
built-in replication features make it well-suited for storing data
reliably, even in the face of failures.

3.

Scalability: Both Cassandra and Kafka are designed to scale
horizontally, allowing us to handle increased loads seamlessly.

Before proceeding further with this approach, we would greatly 
appreciate any feedback, suggestions, or concerns from the community. 
Specifically, we are interested in hearing about:


  * Potential challenges or drawbacks of using Cassandra as a backup
solution for Kafka outages.
  * Best practices or recommendations for implementing such a backup
mechanism effectively.
  * Any alternative approaches or technologies that we should consider?

Your expertise and insights are invaluable to us, and we are eager to 
learn from your experiences and perspectives. Please feel free to 
share your thoughts or reach out to us with any questions or 
clarifications.


Thank you for taking the time to consider our proposal, and we look 
forward to hearing from you soon.


Thanks and regards,
Gowtham S

Re: Requesting Feedback for Cassandra as a backup solution.

2024-02-17 Thread Gowtham S

Thanks for your suggestion
Thanks and regards,
Gowtham S


On Sat, 17 Feb 2024 at 14:58, CPC  wrote:

> hi,
>
> We implemented same strategy in one of our customers. Since 2016 we had
> one downtime in one DC because of high temperature(whole physical DC
> shutdown).
>
> With that approach I assume you will use Cassandra as a queue. You have to
> be careful about modeling and should use multiple partitions may be based
> on hour or fixed size partitions.
>
> Another thing is that Kafka has really high throughput so you should plan
> how many Cassandra node you need to meet same throughput.
>
> Another approach would be to use another Kafka cluster or queue technology
> as backup.
>
>
>
> On Sat, Feb 17, 2024, 11:45 AM Gowtham S  wrote:
>
>> Dear Cassandra Community,
>>
>> I am reaching out to seek your valuable feedback and insights on a
>> proposed solution we are considering for managing Kafka outages using
>> Cassandra.
>>
>> At our organization, we heavily rely on Kafka for real-time data
>> processing and messaging. However, like any technology, Kafka is
>> susceptible to occasional outages which can disrupt our operations and
>> impact our services. To mitigate the impact of such outages and ensure
>> continuity, we are exploring the possibility of leveraging Cassandra as a
>> backup solution.
>>
>> Our proposed approach involves storing messages in Cassandra during Kafka
>> outages. Subsequently, we plan to implement a scheduler that will read from
>> Cassandra and attempt to write these messages back into Kafka once it is
>> operational again.
>>
>> We believe that by adopting this strategy, we can achieve the following
>> benefits:
>>
>>1.
>>
>>Improved Fault Tolerance: By having a backup mechanism in place, we
>>can reduce the risk of data loss and ensure continuity of operations 
>> during
>>Kafka outages.
>>2.
>>
>>Enhanced Reliability: Cassandra's distributed architecture and
>>built-in replication features make it well-suited for storing data
>>reliably, even in the face of failures.
>>3.
>>
>>Scalability: Both Cassandra and Kafka are designed to scale
>>horizontally, allowing us to handle increased loads seamlessly.
>>
>> Before proceeding further with this approach, we would greatly appreciate
>> any feedback, suggestions, or concerns from the community. Specifically, we
>> are interested in hearing about:
>>
>>- Potential challenges or drawbacks of using Cassandra as a backup
>>solution for Kafka outages.
>>- Best practices or recommendations for implementing such a backup
>>mechanism effectively.
>>- Any alternative approaches or technologies that we should consider?
>>
>> Your expertise and insights are invaluable to us, and we are eager to
>> learn from your experiences and perspectives. Please feel free to share
>> your thoughts or reach out to us with any questions or clarifications.
>>
>> Thank you for taking the time to consider our proposal, and we look
>> forward to hearing from you soon.
>> Thanks and regards,
>> Gowtham S
>>
>

Re: Requesting Feedback for Cassandra as a backup solution.

2024-02-17 Thread CPC

hi,

We implemented same strategy in one of our customers. Since 2016 we had one
downtime in one DC because of high temperature(whole physical DC shutdown).

With that approach I assume you will use Cassandra as a queue. You have to
be careful about modeling and should use multiple partitions may be based
on hour or fixed size partitions.

Another thing is that Kafka has really high throughput so you should plan
how many Cassandra node you need to meet same throughput.

Another approach would be to use another Kafka cluster or queue technology
as backup.



On Sat, Feb 17, 2024, 11:45 AM Gowtham S  wrote:

> Dear Cassandra Community,
>
> I am reaching out to seek your valuable feedback and insights on a
> proposed solution we are considering for managing Kafka outages using
> Cassandra.
>
> At our organization, we heavily rely on Kafka for real-time data
> processing and messaging. However, like any technology, Kafka is
> susceptible to occasional outages which can disrupt our operations and
> impact our services. To mitigate the impact of such outages and ensure
> continuity, we are exploring the possibility of leveraging Cassandra as a
> backup solution.
>
> Our proposed approach involves storing messages in Cassandra during Kafka
> outages. Subsequently, we plan to implement a scheduler that will read from
> Cassandra and attempt to write these messages back into Kafka once it is
> operational again.
>
> We believe that by adopting this strategy, we can achieve the following
> benefits:
>
>1.
>
>Improved Fault Tolerance: By having a backup mechanism in place, we
>can reduce the risk of data loss and ensure continuity of operations during
>Kafka outages.
>2.
>
>Enhanced Reliability: Cassandra's distributed architecture and
>built-in replication features make it well-suited for storing data
>reliably, even in the face of failures.
>3.
>
>Scalability: Both Cassandra and Kafka are designed to scale
>horizontally, allowing us to handle increased loads seamlessly.
>
> Before proceeding further with this approach, we would greatly appreciate
> any feedback, suggestions, or concerns from the community. Specifically, we
> are interested in hearing about:
>
>- Potential challenges or drawbacks of using Cassandra as a backup
>solution for Kafka outages.
>- Best practices or recommendations for implementing such a backup
>mechanism effectively.
>- Any alternative approaches or technologies that we should consider?
>
> Your expertise and insights are invaluable to us, and we are eager to
> learn from your experiences and perspectives. Please feel free to share
> your thoughts or reach out to us with any questions or clarifications.
>
> Thank you for taking the time to consider our proposal, and we look
> forward to hearing from you soon.
> Thanks and regards,
> Gowtham S
>

Re: Switching to Incremental Repair

2024-02-15 Thread Chris Lohfink

I would recommend adding something to C* to be able to flip the repaired
state on all sstables quickly (with default OSS can turn nodes off one at a
time and use sstablerepairedset). It's a life saver to be able to revert
back to non-IR if migration going south. Same can be used to quickly switch
into IR sstables with more caveats. Probably worth a jira to add a faster
solution

On Thu, Feb 15, 2024 at 12:50 PM Kristijonas Zalys  wrote:

> Hi folks,
>
> One last question regarding incremental repair.
>
> What would be a safe approach to temporarily stop running incremental
> repair on a cluster (e.g.: during a Cassandra major version upgrade)? My
> understanding is that if we simply stop running incremental repair, the
> cluster's nodes can, in the worst case, double in disk size as the repaired
> dataset will not get compacted with the unrepaired dataset. Similar to
> Sebastian, we have nodes where the disk usage is multiple TiBs so
> significant growth can be quite dangerous in our case. Would the only safe
> choice be to mark all SSTables as unrepaired before stopping regular
> incremental repair?
>
> Thanks,
> Kristijonas
>
>
> On Wed, Feb 7, 2024 at 4:33 PM Bowen Song via user <
> user@cassandra.apache.org> wrote:
>
>> The over-streaming is only problematic for the repaired SSTables, but it
>> can be triggered by inconsistencies within the unrepaired SSTables
>> during an incremental repair session. This is because although an
>> incremental repair will only compare the unrepaired SSTables, but it
>> will stream both the unrepaired and repaired SSTables for the
>> inconsistent token ranges. Keep in mind that the source SSTables for
>> streaming is selected based on the token ranges, not the
>> repaired/unrepaired state.
>>
>> Base on the above, I'm unsure running an incremental repair before a
>> full repair can fully avoid the over-streaming issue.
>>
>> On 07/02/2024 22:41, Sebastian Marsching wrote:
>> > Thank you very much for your explanation.
>> >
>> > Streaming happens on the token range level, not the SSTable level,
>> right? So, when running an incremental repair before the full repair, the
>> problem that “some unrepaired SSTables are being marked as repaired on one
>> node but not on another” should not exist any longer. Now this data should
>> be marked as repaired on all nodes.
>> >
>> > Thus, when repairing the SSTables that are marked as repaired, this
>> data should be included on all nodes when calculating the Merkle trees and
>> no overstreaming should happen.
>> >
>> > Of course, this means that running an incremental repair *first* after
>> marking SSTables as repaired and only running the full repair *after* that
>> is critical. I have to admit that previously I wasn’t fully aware of how
>> critical this step is.
>> >
>> >> Am 07.02.2024 um 20:22 schrieb Bowen Song via user <
>> user@cassandra.apache.org>:
>> >>
>> >> Unfortunately repair doesn't compare each partition individually.
>> Instead, it groups multiple partitions together and calculate a hash of
>> them, stores the hash in a leaf of a merkle tree, and then compares the
>> merkle trees between replicas during a repair session. If any one of the
>> partitions covered by a leaf is inconsistent between replicas, the hash
>> values in these leaves will be different, and all partitions covered by the
>> same leaf will need to be streamed in full.
>> >>
>> >> Knowing that, and also know that your approach can create a lots of
>> inconsistencies in the repaired SSTables because some unrepaired SSTables
>> are being marked as repaired on one node but not on another, you would then
>> understand why over-streaming can happen. The over-streaming is only
>> problematic for the repaired SSTables, because they are much bigger than
>> the unrepaired.
>> >>
>> >>
>> >> On 07/02/2024 17:00, Sebastian Marsching wrote:
>>  Caution, using the method you described, the amount of data streamed
>> at the end with the full repair is not the amount of data written between
>> stopping the first node and the last node, but depends on the table size,
>> the number of partitions written, their distribution in the ring and the
>> 'repair_session_space' value. If the table is large, the writes touch a
>> large number of partitions scattered across the token ring, and the value
>> of 'repair_session_space' is small, you may end up with a very expensive
>> over-streaming.
>> >>> Thanks for the warning. In our case it worked well (obviously we
>> tested it on a test cluster before applying it on the production clusters),
>> but it is good to know that this might not always be the case.
>> >>>
>> >>> Maybe I misunderstand how full and incremental repairs work in C*
>> 4.x. I would appreciate if you could clarify this for me.
>> >>>
>> >>> So far, I assumed that a full repair on a cluster that is also using
>> incremental repair pretty much works like on a cluster that is not using
>> incremental repair at all, the only difference being that the set

Re: Switching to Incremental Repair

2024-02-15 Thread Bowen Song via user

The gc_grace_seconds, which default to 10 days, is the maximal safe 
interval between repairs. How much data gets written during that period 
of time? Will your nodes run out of disk space because of the new data 
written during that time? If so, it sounds like your nodes are 
dangerously close to running out of disk space, and you should address 
that issue first before even considering upgrading Cassandra.

On 15/02/2024 18:49, Kristijonas Zalys wrote:

Hi folks,

One last question regarding incremental repair.

What would be a safe approach to temporarily stop running incremental 
repair on a cluster (e.g.: during a Cassandra major version upgrade)? 
My understanding is that if we simply stop running incremental repair, 
the cluster's nodes can, in the worst case, double in disk size as the 
repaired dataset will not get compacted with the unrepaired dataset. 
Similar to Sebastian, we have nodes where the disk usage is multiple 
TiBs so significant growth can be quite dangerous in our case. Would 
the only safe choice be to mark all SSTables as unrepaired before 
stopping regular incremental repair?

Thanks,
Kristijonas

On Wed, Feb 7, 2024 at 4:33 PM Bowen Song via user 
 wrote:

The over-streaming is only problematic for the repaired SSTables,
but it
can be triggered by inconsistencies within the unrepaired SSTables
during an incremental repair session. This is because although an
incremental repair will only compare the unrepaired SSTables, but it
will stream both the unrepaired and repaired SSTables for the
inconsistent token ranges. Keep in mind that the source SSTables for
streaming is selected based on the token ranges, not the
repaired/unrepaired state.

Base on the above, I'm unsure running an incremental repair before a
full repair can fully avoid the over-streaming issue.

On 07/02/2024 22:41, Sebastian Marsching wrote:
> Thank you very much for your explanation.
>
> Streaming happens on the token range level, not the SSTable
level, right? So, when running an incremental repair before the
full repair, the problem that “some unrepaired SSTables are being
marked as repaired on one node but not on another” should not
exist any longer. Now this data should be marked as repaired on
all nodes.
>
> Thus, when repairing the SSTables that are marked as repaired,
this data should be included on all nodes when calculating the
Merkle trees and no overstreaming should happen.
>
> Of course, this means that running an incremental repair *first*
after marking SSTables as repaired and only running the full
repair *after* that is critical. I have to admit that previously I
wasn’t fully aware of how critical this step is.
>
>> Am 07.02.2024 um 20:22 schrieb Bowen Song via user
:
>>
>> Unfortunately repair doesn't compare each partition
individually. Instead, it groups multiple partitions together and
calculate a hash of them, stores the hash in a leaf of a merkle
tree, and then compares the merkle trees between replicas during a
repair session. If any one of the partitions covered by a leaf is
inconsistent between replicas, the hash values in these leaves
will be different, and all partitions covered by the same leaf
will need to be streamed in full.
>>
>> Knowing that, and also know that your approach can create a
lots of inconsistencies in the repaired SSTables because some
unrepaired SSTables are being marked as repaired on one node but
not on another, you would then understand why over-streaming can
happen. The over-streaming is only problematic for the repaired
SSTables, because they are much bigger than the unrepaired.
>>
>>
>> On 07/02/2024 17:00, Sebastian Marsching wrote:
 Caution, using the method you described, the amount of data
streamed at the end with the full repair is not the amount of data
written between stopping the first node and the last node, but
depends on the table size, the number of partitions written, their
distribution in the ring and the 'repair_session_space' value. If
the table is large, the writes touch a large number of partitions
scattered across the token ring, and the value of
'repair_session_space' is small, you may end up with a very
expensive over-streaming.
>>> Thanks for the warning. In our case it worked well (obviously
we tested it on a test cluster before applying it on the
production clusters), but it is good to know that this might not
always be the case.
>>>
>>> Maybe I misunderstand how full and incremental repairs work in
C* 4.x. I would appreciate if you could clarify this for me.
>>>
>>> So far, I assumed that a full repair on a cluster that is also
using incremental repair pretty much works like on a cluster that
is not using incremental repair at all, the only difference

Re: Switching to Incremental Repair

2024-02-15 Thread Kristijonas Zalys

Hi folks,

One last question regarding incremental repair.

What would be a safe approach to temporarily stop running incremental
repair on a cluster (e.g.: during a Cassandra major version upgrade)? My
understanding is that if we simply stop running incremental repair, the
cluster's nodes can, in the worst case, double in disk size as the repaired
dataset will not get compacted with the unrepaired dataset. Similar to
Sebastian, we have nodes where the disk usage is multiple TiBs so
significant growth can be quite dangerous in our case. Would the only safe
choice be to mark all SSTables as unrepaired before stopping regular
incremental repair?

Thanks,
Kristijonas


On Wed, Feb 7, 2024 at 4:33 PM Bowen Song via user <
user@cassandra.apache.org> wrote:

> The over-streaming is only problematic for the repaired SSTables, but it
> can be triggered by inconsistencies within the unrepaired SSTables
> during an incremental repair session. This is because although an
> incremental repair will only compare the unrepaired SSTables, but it
> will stream both the unrepaired and repaired SSTables for the
> inconsistent token ranges. Keep in mind that the source SSTables for
> streaming is selected based on the token ranges, not the
> repaired/unrepaired state.
>
> Base on the above, I'm unsure running an incremental repair before a
> full repair can fully avoid the over-streaming issue.
>
> On 07/02/2024 22:41, Sebastian Marsching wrote:
> > Thank you very much for your explanation.
> >
> > Streaming happens on the token range level, not the SSTable level,
> right? So, when running an incremental repair before the full repair, the
> problem that “some unrepaired SSTables are being marked as repaired on one
> node but not on another” should not exist any longer. Now this data should
> be marked as repaired on all nodes.
> >
> > Thus, when repairing the SSTables that are marked as repaired, this data
> should be included on all nodes when calculating the Merkle trees and no
> overstreaming should happen.
> >
> > Of course, this means that running an incremental repair *first* after
> marking SSTables as repaired and only running the full repair *after* that
> is critical. I have to admit that previously I wasn’t fully aware of how
> critical this step is.
> >
> >> Am 07.02.2024 um 20:22 schrieb Bowen Song via user <
> user@cassandra.apache.org>:
> >>
> >> Unfortunately repair doesn't compare each partition individually.
> Instead, it groups multiple partitions together and calculate a hash of
> them, stores the hash in a leaf of a merkle tree, and then compares the
> merkle trees between replicas during a repair session. If any one of the
> partitions covered by a leaf is inconsistent between replicas, the hash
> values in these leaves will be different, and all partitions covered by the
> same leaf will need to be streamed in full.
> >>
> >> Knowing that, and also know that your approach can create a lots of
> inconsistencies in the repaired SSTables because some unrepaired SSTables
> are being marked as repaired on one node but not on another, you would then
> understand why over-streaming can happen. The over-streaming is only
> problematic for the repaired SSTables, because they are much bigger than
> the unrepaired.
> >>
> >>
> >> On 07/02/2024 17:00, Sebastian Marsching wrote:
>  Caution, using the method you described, the amount of data streamed
> at the end with the full repair is not the amount of data written between
> stopping the first node and the last node, but depends on the table size,
> the number of partitions written, their distribution in the ring and the
> 'repair_session_space' value. If the table is large, the writes touch a
> large number of partitions scattered across the token ring, and the value
> of 'repair_session_space' is small, you may end up with a very expensive
> over-streaming.
> >>> Thanks for the warning. In our case it worked well (obviously we
> tested it on a test cluster before applying it on the production clusters),
> but it is good to know that this might not always be the case.
> >>>
> >>> Maybe I misunderstand how full and incremental repairs work in C* 4.x.
> I would appreciate if you could clarify this for me.
> >>>
> >>> So far, I assumed that a full repair on a cluster that is also using
> incremental repair pretty much works like on a cluster that is not using
> incremental repair at all, the only difference being that the set of
> repaired und unrepaired data is repaired separately, so the Merkle trees
> that are calculated for repaired and unrepaired data are completely
> separate.
> >>>
> >>> I also assumed that incremental repair only looks at unrepaired data,
> which is why it is so fast.
> >>>
> >>> Is either of these two assumptions wrong?
> >>>
> >>> If not, I do not quite understand how a lot of overstreaming might
> happen, as long as (I forgot to mention this step in my original e-mail) I
> run an incremental repair directly after restarting the nodes

RE: SStables stored in directory with different table ID than the one found in system_schema.tables

2024-02-12 Thread Michalis Kotsiouros (EXT) via user

Hello Sebastian and community,

Thanks a lot for the post. It is really helpful.

After some additional observations, I am more concerned about trying to 
rename/move the sstables directory. I have observed that my client processes 
complain about missing columns even though those columns appear on the describe 
schema output.

My plan is to first try a restart of the Cassandra nodes and if that does not 
help to re-build the datacenter – remove it and then add it back to the cluster.

 

BR

MK

 

From: Sebastian Marsching  
Sent: February 10, 2024 01:00
To: Bowen Song via user 
Cc: Michalis Kotsiouros (EXT) 
Subject: Re: SStables stored in directory with different table ID than the one 
found in system_schema.tables

 

You might the following discussion from the mailing-list archive helpful:

 

https://lists.apache.org/thread/6hnypp6vfxj1yc35ptp0xf15f11cx77d

 

This thread discusses a similar situation gives a few pointers on when it might 
be save to simply move the SSTables around.





Am 08.02.2024 um 13:06 schrieb Michalis Kotsiouros (EXT) via user 
mailto:user@cassandra.apache.org> >:

 

Hello everyone,

I have found this post on-line and seems to be recent.

 
<https://stackoverflow.com/questions/77837100/mismatch-between-cassandra-table-uuid-in-linux-file-directory-and-system-schema>
 Mismatch between Cassandra table uuid in linux file directory and 
system_schema.tables - Stack Overflow

The description seems to be the same as my problem as well.

In this post, the proposal is to copy the sstables to the dir with the ID found 
in system_schema.tables. I think it is equivalent with my assumption to rename 
the directories….

Have anyone seen this before? Do you consider those approaches safe?

 

BR

MK

 

From: Michalis Kotsiouros (EXT) 
Sent: February 08, 2024 11:33
To: user@cassandra.apache.org <mailto:user@cassandra.apache.org> 
Subject: SStables stored in directory with different table ID than the one 
found in system_schema.tables

 

Hello community,

I have a Cassandra server on 3.11.13 on SLESS 12.5.

I have noticed in the logs the following line:

Datacenter A

org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find table for 
cfId d8c1bea0-82ed-11ee-8ac8-1513e17b60b1. If a table was just created, this is 
likely due to the schema not being fully propagated.  Please wait for schema 
agreement on table creation.

Datacenter B

org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find table for 
cfId 0fedabd0-11f7-11ea-9450-e3ff59b2496b. If a table was just created, this is 
likely due to the schema not being fully propagated.  Please wait for schema 
agreement on table creation.

 

This error results in failure of all streaming tasks.

I have checked the sstables directories and I see that:

 

In Datacenter A the sstables directory is:

-0fedabd0-11f7-11ea-9450-e3ff59b2496b

 

In Datacenter B the sstables directory are:

-0fedabd011f711ea9450e3ff59b2496b

- d8c1bea082ed11ee8ac81513e17b60b1

In this datacenter although the - d8c1bea082ed11ee8ac81513e17b60b1 
dir is more recent it is empty and all sstables are stored under 
-0fedabd011f711ea9450e3ff59b2496b

 

I have also checked the system_schema.tables in all Cassandra nodes and I see 
that for the specific table the ID is consistent across all nodes and it is:

d8c1bea0-82ed-11ee-8ac8-1513e17b60b1

 

So it seems that the schema is a bit mess in all my datacenters. I am not 
really interested to understand how it ended up in this status but more on how 
to recover.

Both datacenters seem to have this inconsistency between the id stored 
system_schema.tables and the one used in the sstables directory.

Do you have any proposal on how to recover?

I have thought of renaming the dir from 
-0fedabd011f711ea9450e3ff59b2496b to - 
d8c1bea082ed11ee8ac81513e17b60b1 but it does not look safe and I would not want 
to risk my data since this is a production system.

 

Thank you in advance.

 

BR

Michail Kotsiouros

 



smime.p7s
Description: S/MIME cryptographic signature

Re: SStables stored in directory with different table ID than the one found in system_schema.tables

2024-02-09 Thread Sebastian Marsching

You might the following discussion from the mailing-list archive helpful:

https://lists.apache.org/thread/6hnypp6vfxj1yc35ptp0xf15f11cx77d

This thread discusses a similar situation gives a few pointers on when it might 
be save to simply move the SSTables around.

> Am 08.02.2024 um 13:06 schrieb Michalis Kotsiouros (EXT) via user 
> :
> 
> Hello everyone,
> I have found this post on-line and seems to be recent.
> Mismatch between Cassandra table uuid in linux file directory and 
> system_schema.tables - Stack Overflow 
> 
> The description seems to be the same as my problem as well.
> In this post, the proposal is to copy the sstables to the dir with the ID 
> found in system_schema.tables. I think it is equivalent with my assumption to 
> rename the directories….
> Have anyone seen this before? Do you consider those approaches safe?
>  
> BR
> MK
>  
> From: Michalis Kotsiouros (EXT) 
> Sent: February 08, 2024 11:33
> To: user@cassandra.apache.org
> Subject: SStables stored in directory with different table ID than the one 
> found in system_schema.tables
>  
> Hello community,
> I have a Cassandra server on 3.11.13 on SLESS 12.5.
> I have noticed in the logs the following line:
> Datacenter A
> org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find table for 
> cfId d8c1bea0-82ed-11ee-8ac8-1513e17b60b1. If a table was just created, this 
> is likely due to the schema not being fully propagated.  Please wait for 
> schema agreement on table creation.
> Datacenter B
> org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find table for 
> cfId 0fedabd0-11f7-11ea-9450-e3ff59b2496b. If a table was just created, this 
> is likely due to the schema not being fully propagated.  Please wait for 
> schema agreement on table creation.
>  
> This error results in failure of all streaming tasks.
> I have checked the sstables directories and I see that:
>  
> In Datacenter A the sstables directory is:
> -0fedabd0-11f7-11ea-9450-e3ff59b2496b
>  
> In Datacenter B the sstables directory are:
> -0fedabd011f711ea9450e3ff59b2496b
> - d8c1bea082ed11ee8ac81513e17b60b1
> In this datacenter although the - 
> d8c1bea082ed11ee8ac81513e17b60b1 dir is more recent it is empty and all 
> sstables are stored under -0fedabd011f711ea9450e3ff59b2496b
>  
> I have also checked the system_schema.tables in all Cassandra nodes and I see 
> that for the specific table the ID is consistent across all nodes and it is:
> d8c1bea0-82ed-11ee-8ac8-1513e17b60b1
>  
> So it seems that the schema is a bit mess in all my datacenters. I am not 
> really interested to understand how it ended up in this status but more on 
> how to recover.
> Both datacenters seem to have this inconsistency between the id stored 
> system_schema.tables and the one used in the sstables directory.
> Do you have any proposal on how to recover?
> I have thought of renaming the dir from 
> -0fedabd011f711ea9450e3ff59b2496b to - 
> d8c1bea082ed11ee8ac81513e17b60b1 but it does not look safe and I would not 
> want to risk my data since this is a production system.
>  
> Thank you in advance.
>  
> BR
> Michail Kotsiouros



smime.p7s
Description: S/MIME cryptographic signature

Re: Regarding Cassandra 4 Support End time

2024-02-09 Thread Mukhesh Chowdary

Unsubscribe

Regards
V. Mukhesh Chowdary
Architect.
CA/2012/55397


On Fri, 9 Feb 2024 at 12:12, ranju goel  wrote:

> Hi All,
>
> As per the link (https://cassandra.apache.org/_/download.html) Cassandra
> 4.0 is going to be maintained till release of 5.1. (July 2024 tentative).
> Since Cassandra 5 is yet to be released, Can we expect Cassandra 4.0.x
> support to be increased. This information will help us in planning our
> upgrade.
>
> Thanks & Regards
> Ranju
>

RE: SStables stored in directory with different table ID than the one found in system_schema.tables

2024-02-08 Thread Michalis Kotsiouros (EXT) via user

Hello everyone,
I have found this post on-line and seems to be recent.
Mismatch between Cassandra table uuid in linux file directory and 
system_schema.tables - Stack 
Overflow
The description seems to be the same as my problem as well.
In this post, the proposal is to copy the sstables to the dir with the ID found 
in system_schema.tables. I think it is equivalent with my assumption to rename 
the directories
Have anyone seen this before? Do you consider those approaches safe?

BR
MK

From: Michalis Kotsiouros (EXT)
Sent: February 08, 2024 11:33
To: user@cassandra.apache.org
Subject: SStables stored in directory with different table ID than the one 
found in system_schema.tables

Hello community,
I have a Cassandra server on 3.11.13 on SLESS 12.5.
I have noticed in the logs the following line:
Datacenter A
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find table for 
cfId d8c1bea0-82ed-11ee-8ac8-1513e17b60b1. If a table was just created, this is 
likely due to the schema not being fully propagated.  Please wait for schema 
agreement on table creation.
Datacenter B
org.apache.cassandra.db.UnknownColumnFamilyException: Couldn't find table for 
cfId 0fedabd0-11f7-11ea-9450-e3ff59b2496b. If a table was just created, this is 
likely due to the schema not being fully propagated.  Please wait for schema 
agreement on table creation.

This error results in failure of all streaming tasks.
I have checked the sstables directories and I see that:

In Datacenter A the sstables directory is:
-0fedabd0-11f7-11ea-9450-e3ff59b2496b

In Datacenter B the sstables directory are:
-0fedabd011f711ea9450e3ff59b2496b
- d8c1bea082ed11ee8ac81513e17b60b1
In this datacenter although the - d8c1bea082ed11ee8ac81513e17b60b1 
dir is more recent it is empty and all sstables are stored under 
-0fedabd011f711ea9450e3ff59b2496b

I have also checked the system_schema.tables in all Cassandra nodes and I see 
that for the specific table the ID is consistent across all nodes and it is:
d8c1bea0-82ed-11ee-8ac8-1513e17b60b1

So it seems that the schema is a bit mess in all my datacenters. I am not 
really interested to understand how it ended up in this status but more on how 
to recover.
Both datacenters seem to have this inconsistency between the id stored 
system_schema.tables and the one used in the sstables directory.
Do you have any proposal on how to recover?
I have thought of renaming the dir from 
-0fedabd011f711ea9450e3ff59b2496b to - 
d8c1bea082ed11ee8ac81513e17b60b1 but it does not look safe and I would not want 
to risk my data since this is a production system.

Thank you in advance.

BR
Michail Kotsiouros

Re: Switching to Incremental Repair

The over-streaming is only problematic for the repaired SSTables, but it 
can be triggered by inconsistencies within the unrepaired SSTables 
during an incremental repair session. This is because although an 
incremental repair will only compare the unrepaired SSTables, but it 
will stream both the unrepaired and repaired SSTables for the 
inconsistent token ranges. Keep in mind that the source SSTables for 
streaming is selected based on the token ranges, not the 
repaired/unrepaired state.


Base on the above, I'm unsure running an incremental repair before a 
full repair can fully avoid the over-streaming issue.


On 07/02/2024 22:41, Sebastian Marsching wrote:

Thank you very much for your explanation.

Streaming happens on the token range level, not the SSTable level, right? So, 
when running an incremental repair before the full repair, the problem that 
“some unrepaired SSTables are being marked as repaired on one node but not on 
another” should not exist any longer. Now this data should be marked as 
repaired on all nodes.

Thus, when repairing the SSTables that are marked as repaired, this data should 
be included on all nodes when calculating the Merkle trees and no overstreaming 
should happen.

Of course, this means that running an incremental repair *first* after marking 
SSTables as repaired and only running the full repair *after* that is critical. 
I have to admit that previously I wasn’t fully aware of how critical this step 
is.


Am 07.02.2024 um 20:22 schrieb Bowen Song via user :

Unfortunately repair doesn't compare each partition individually. Instead, it 
groups multiple partitions together and calculate a hash of them, stores the 
hash in a leaf of a merkle tree, and then compares the merkle trees between 
replicas during a repair session. If any one of the partitions covered by a 
leaf is inconsistent between replicas, the hash values in these leaves will be 
different, and all partitions covered by the same leaf will need to be streamed 
in full.

Knowing that, and also know that your approach can create a lots of 
inconsistencies in the repaired SSTables because some unrepaired SSTables are 
being marked as repaired on one node but not on another, you would then 
understand why over-streaming can happen. The over-streaming is only 
problematic for the repaired SSTables, because they are much bigger than the 
unrepaired.


On 07/02/2024 17:00, Sebastian Marsching wrote:

Caution, using the method you described, the amount of data streamed at the end 
with the full repair is not the amount of data written between stopping the 
first node and the last node, but depends on the table size, the number of 
partitions written, their distribution in the ring and the 
'repair_session_space' value. If the table is large, the writes touch a large 
number of partitions scattered across the token ring, and the value of 
'repair_session_space' is small, you may end up with a very expensive 
over-streaming.

Thanks for the warning. In our case it worked well (obviously we tested it on a 
test cluster before applying it on the production clusters), but it is good to 
know that this might not always be the case.

Maybe I misunderstand how full and incremental repairs work in C* 4.x. I would 
appreciate if you could clarify this for me.

So far, I assumed that a full repair on a cluster that is also using 
incremental repair pretty much works like on a cluster that is not using 
incremental repair at all, the only difference being that the set of repaired 
und unrepaired data is repaired separately, so the Merkle trees that are 
calculated for repaired and unrepaired data are completely separate.

I also assumed that incremental repair only looks at unrepaired data, which is 
why it is so fast.

Is either of these two assumptions wrong?

If not, I do not quite understand how a lot of overstreaming might happen, as 
long as (I forgot to mention this step in my original e-mail) I run an 
incremental repair directly after restarting the nodes and marking all data as 
repaired.

I understand that significant overstreaming might happen during this first 
repair (in the worst case streaming all the unrepaired data that a node 
stores), but due to the short amount of time between starting to mark data as 
repaired and running the incremental repair, the whole set of unrepaired data 
should be rather small, so this overstreaming should not cause any issues.

 From this point on, the unrepaired data on the different nodes should be in 
sync and discrepancies in the repaired data during the full repair should not 
be bigger than they had been if I had run a full repair without marking an data 
as repaired.

I would really appreciate if you could point out the hole in this reasoning. 
Maybe I have a fundamentally wrong understanding of the repair process, and if 
I do I would like to correct this.

Re: Switching to Incremental Repair


Thank you very much for your explanation.

Streaming happens on the token range level, not the SSTable level, right? So, 
when running an incremental repair before the full repair, the problem that 
“some unrepaired SSTables are being marked as repaired on one node but not on 
another” should not exist any longer. Now this data should be marked as 
repaired on all nodes.

Thus, when repairing the SSTables that are marked as repaired, this data should 
be included on all nodes when calculating the Merkle trees and no overstreaming 
should happen.

Of course, this means that running an incremental repair *first* after marking 
SSTables as repaired and only running the full repair *after* that is critical. 
I have to admit that previously I wasn’t fully aware of how critical this step 
is.

> Am 07.02.2024 um 20:22 schrieb Bowen Song via user 
> :
>
> Unfortunately repair doesn't compare each partition individually. Instead, it 
> groups multiple partitions together and calculate a hash of them, stores the 
> hash in a leaf of a merkle tree, and then compares the merkle trees between 
> replicas during a repair session. If any one of the partitions covered by a 
> leaf is inconsistent between replicas, the hash values in these leaves will 
> be different, and all partitions covered by the same leaf will need to be 
> streamed in full.
>
> Knowing that, and also know that your approach can create a lots of 
> inconsistencies in the repaired SSTables because some unrepaired SSTables are 
> being marked as repaired on one node but not on another, you would then 
> understand why over-streaming can happen. The over-streaming is only 
> problematic for the repaired SSTables, because they are much bigger than the 
> unrepaired.
>
>
> On 07/02/2024 17:00, Sebastian Marsching wrote:
>>> Caution, using the method you described, the amount of data streamed at the 
>>> end with the full repair is not the amount of data written between stopping 
>>> the first node and the last node, but depends on the table size, the number 
>>> of partitions written, their distribution in the ring and the 
>>> 'repair_session_space' value. If the table is large, the writes touch a 
>>> large number of partitions scattered across the token ring, and the value 
>>> of 'repair_session_space' is small, you may end up with a very expensive 
>>> over-streaming.
>> Thanks for the warning. In our case it worked well (obviously we tested it 
>> on a test cluster before applying it on the production clusters), but it is 
>> good to know that this might not always be the case.
>>
>> Maybe I misunderstand how full and incremental repairs work in C* 4.x. I 
>> would appreciate if you could clarify this for me.
>>
>> So far, I assumed that a full repair on a cluster that is also using 
>> incremental repair pretty much works like on a cluster that is not using 
>> incremental repair at all, the only difference being that the set of 
>> repaired und unrepaired data is repaired separately, so the Merkle trees 
>> that are calculated for repaired and unrepaired data are completely separate.
>>
>> I also assumed that incremental repair only looks at unrepaired data, which 
>> is why it is so fast.
>>
>> Is either of these two assumptions wrong?
>>
>> If not, I do not quite understand how a lot of overstreaming might happen, 
>> as long as (I forgot to mention this step in my original e-mail) I run an 
>> incremental repair directly after restarting the nodes and marking all data 
>> as repaired.
>>
>> I understand that significant overstreaming might happen during this first 
>> repair (in the worst case streaming all the unrepaired data that a node 
>> stores), but due to the short amount of time between starting to mark data 
>> as repaired and running the incremental repair, the whole set of unrepaired 
>> data should be rather small, so this overstreaming should not cause any 
>> issues.
>>
>> From this point on, the unrepaired data on the different nodes should be in 
>> sync and discrepancies in the repaired data during the full repair should 
>> not be bigger than they had been if I had run a full repair without marking 
>> an data as repaired.
>>
>> I would really appreciate if you could point out the hole in this reasoning. 
>> Maybe I have a fundamentally wrong understanding of the repair process, and 
>> if I do I would like to correct this.
>>
>



smime.p7s
Description: S/MIME cryptographic signature

Re: Switching to Incremental Repair

Unfortunately repair doesn't compare each partition individually. 
Instead, it groups multiple partitions together and calculate a hash of 
them, stores the hash in a leaf of a merkle tree, and then compares the 
merkle trees between replicas during a repair session. If any one of the 
partitions covered by a leaf is inconsistent between replicas, the hash 
values in these leaves will be different, and all partitions covered by 
the same leaf will need to be streamed in full.


Knowing that, and also know that your approach can create a lots of 
inconsistencies in the repaired SSTables because some unrepaired 
SSTables are being marked as repaired on one node but not on another, 
you would then understand why over-streaming can happen. The 
over-streaming is only problematic for the repaired SSTables, because 
they are much bigger than the unrepaired.



On 07/02/2024 17:00, Sebastian Marsching wrote:

Caution, using the method you described, the amount of data streamed at the end 
with the full repair is not the amount of data written between stopping the 
first node and the last node, but depends on the table size, the number of 
partitions written, their distribution in the ring and the 
'repair_session_space' value. If the table is large, the writes touch a large 
number of partitions scattered across the token ring, and the value of 
'repair_session_space' is small, you may end up with a very expensive 
over-streaming.

Thanks for the warning. In our case it worked well (obviously we tested it on a 
test cluster before applying it on the production clusters), but it is good to 
know that this might not always be the case.

Maybe I misunderstand how full and incremental repairs work in C* 4.x. I would 
appreciate if you could clarify this for me.

So far, I assumed that a full repair on a cluster that is also using 
incremental repair pretty much works like on a cluster that is not using 
incremental repair at all, the only difference being that the set of repaired 
und unrepaired data is repaired separately, so the Merkle trees that are 
calculated for repaired and unrepaired data are completely separate.

I also assumed that incremental repair only looks at unrepaired data, which is 
why it is so fast.

Is either of these two assumptions wrong?

If not, I do not quite understand how a lot of overstreaming might happen, as 
long as (I forgot to mention this step in my original e-mail) I run an 
incremental repair directly after restarting the nodes and marking all data as 
repaired.

I understand that significant overstreaming might happen during this first 
repair (in the worst case streaming all the unrepaired data that a node 
stores), but due to the short amount of time between starting to mark data as 
repaired and running the incremental repair, the whole set of unrepaired data 
should be rather small, so this overstreaming should not cause any issues.

 From this point on, the unrepaired data on the different nodes should be in 
sync and discrepancies in the repaired data during the full repair should not 
be bigger than they had been if I had run a full repair without marking an data 
as repaired.

I would really appreciate if you could point out the hole in this reasoning. 
Maybe I have a fundamentally wrong understanding of the repair process, and if 
I do I would like to correct this.

Re: Switching to Incremental Repair


> Caution, using the method you described, the amount of data streamed at the 
> end with the full repair is not the amount of data written between stopping 
> the first node and the last node, but depends on the table size, the number 
> of partitions written, their distribution in the ring and the 
> 'repair_session_space' value. If the table is large, the writes touch a large 
> number of partitions scattered across the token ring, and the value of 
> 'repair_session_space' is small, you may end up with a very expensive 
> over-streaming.

Thanks for the warning. In our case it worked well (obviously we tested it on a 
test cluster before applying it on the production clusters), but it is good to 
know that this might not always be the case.

Maybe I misunderstand how full and incremental repairs work in C* 4.x. I would 
appreciate if you could clarify this for me.

So far, I assumed that a full repair on a cluster that is also using 
incremental repair pretty much works like on a cluster that is not using 
incremental repair at all, the only difference being that the set of repaired 
und unrepaired data is repaired separately, so the Merkle trees that are 
calculated for repaired and unrepaired data are completely separate.

I also assumed that incremental repair only looks at unrepaired data, which is 
why it is so fast.

Is either of these two assumptions wrong?

If not, I do not quite understand how a lot of overstreaming might happen, as 
long as (I forgot to mention this step in my original e-mail) I run an 
incremental repair directly after restarting the nodes and marking all data as 
repaired.

I understand that significant overstreaming might happen during this first 
repair (in the worst case streaming all the unrepaired data that a node 
stores), but due to the short amount of time between starting to mark data as 
repaired and running the incremental repair, the whole set of unrepaired data 
should be rather small, so this overstreaming should not cause any issues.

From this point on, the unrepaired data on the different nodes should be in 
sync and discrepancies in the repaired data during the full repair should not 
be bigger than they had been if I had run a full repair without marking an data 
as repaired.

I would really appreciate if you could point out the hole in this reasoning. 
Maybe I have a fundamentally wrong understanding of the repair process, and if 
I do I would like to correct this.



smime.p7s
Description: S/MIME cryptographic signature

Re: Switching to Incremental Repair

Caution, using the method you described, the amount of data streamed at 
the end with the full repair is not the amount of data written between 
stopping the first node and the last node, but depends on the table 
size, the number of partitions written, their distribution in the ring 
and the 'repair_session_space' value. If the table is large, the writes 
touch a large number of partitions scattered across the token ring, and 
the value of 'repair_session_space' is small, you may end up with a very 
expensive over-streaming.


On 07/02/2024 12:33, Sebastian Marsching wrote:
Full repair running for an entire week sounds excessively long. Even 
if you've got 1 TB of data per node, 1 week means the repair speed is 
less than 2 MB/s, that's very slow. Perhaps you should focus on 
finding the bottleneck of the full repair speed and work on that instead.


We store about 3–3.5 TB per node on spinning disks (time-series data), 
so I don’t think it is too surprising.


Not disabling auto-compaction may result in repaired SSTables getting 
compacted together with unrepaired SSTables before the repair state 
is set on them, which leads to mismatch in the repaired data between 
nodes, and potentially very expensive over-streaming in a future full 
repair. You should follow the documented and tested steps and not 
improvise or getting creative if you value your data and time.


There is a different method that we successfully used on three 
clusters, but I agree that anti-entropy repair is a tricky business 
and one should be cautious with trying less tested methods.


Due to the long time for a full repair (see my earlier explanation), 
disabling anticompaction while running the full repair wasn’t an 
option for us. It was previously suggested that one could run the 
repair per node instead of the full cluster, but I don’t think that 
this will work, because only marking the SSTables on a single node as 
repaired would lead to massive overstreaming when running the full 
repair for the next node that shares data with the first one.


So, I want to describe the method that we used, just in case someone 
is in the same situation:


Going around the ring, we temporarily stopped each node and marked all 
of its SSTables as repaired. Then we immediately ran a full repair, so 
that any inconsistencies in the data that was now marked as repaired 
but not actually repaired were fixed.


Using this approach, the amount over over-streaming is minimal (at 
least for not too large clusters, where the rolling restart can be 
done in less than an hour or so), because the only difference between 
the “unrepaired” SSTables on the different nodes will be the data that 
was written before stopping the first node and stopping the last node.


Any inconsistencies that might exist in the SSTables that were marked 
as repaired should be caught in the full repair, so I do not think it 
is too dangerous either. However, I agree that for clusters where a 
full repair is quick (e.g. finishes in a few hours), using the 
well-tested and frequently used approach is probably better.

Re: Switching to Incremental Repair


> That's a feature we need to implement in Reaper. I think disallowing the 
> start of the new incremental repair would be easier to manage than pausing 
> the full repair that's already running. It's also what I think I'd expect as 
> a user.
>
> I'll create an issue to track this.

Thank you, Alexander, that’s great!

I was considering the other approach (pausing the full repair in order to be 
able to start the incremental repair) because this is what I have been doing 
manually in the past few days. Due to full repairs taking a lot of time for us 
(see my other e-mail), I didn’t want too much unrepaired data too accumulate 
over time.

However, I guess that this is a niche use case, and in most cases inhibiting 
the incremental repair is the correct and expected approach, so I wouldn’t 
expect such a feature in Cassandra Reaper.

For our use case, I am considering abandoning the scheduling feature of Reaper 
and instead writing a simple script that schedules repairs through Reaper’s 
API. This will also give us an easier way of staggering different repair jobs 
instead of having to rely on choosing the start time correctly in order to get 
the desired effect. Doing all this in a custom script is probably much, much 
easier than trying to implement it as a generic, user-configurable feature in 
Reaper.



smime.p7s
Description: S/MIME cryptographic signature

Re: Switching to Incremental Repair

> Full repair running for an entire week sounds excessively long. Even if 
> you've got 1 TB of data per node, 1 week means the repair speed is less than 
> 2 MB/s, that's very slow. Perhaps you should focus on finding the bottleneck 
> of the full repair speed and work on that instead.

We store about 3–3.5 TB per node on spinning disks (time-series data), so I 
don’t think it is too surprising.
> Not disabling auto-compaction may result in repaired SSTables getting 
> compacted together with unrepaired SSTables before the repair state is set on 
> them, which leads to mismatch in the repaired data between nodes, and 
> potentially very expensive over-streaming in a future full repair. You should 
> follow the documented and tested steps and not improvise or getting creative 
> if you value your data and time.
> 
There is a different method that we successfully used on three clusters, but I 
agree that anti-entropy repair is a tricky business and one should be cautious 
with trying less tested methods.

Due to the long time for a full repair (see my earlier explanation), disabling 
anticompaction while running the full repair wasn’t an option for us. It was 
previously suggested that one could run the repair per node instead of the full 
cluster, but I don’t think that this will work, because only marking the 
SSTables on a single node as repaired would lead to massive overstreaming when 
running the full repair for the next node that shares data with the first one.

So, I want to describe the method that we used, just in case someone is in the 
same situation:

Going around the ring, we temporarily stopped each node and marked all of its 
SSTables as repaired. Then we immediately ran a full repair, so that any 
inconsistencies in the data that was now marked as repaired but not actually 
repaired were fixed.

Using this approach, the amount over over-streaming is minimal (at least for 
not too large clusters, where the rolling restart can be done in less than an 
hour or so), because the only difference between the “unrepaired” SSTables on 
the different nodes will be the data that was written before stopping the first 
node and stopping the last node.

Any inconsistencies that might exist in the SSTables that were marked as 
repaired should be caught in the full repair, so I do not think it is too 
dangerous either. However, I agree that for clusters where a full repair is 
quick (e.g. finishes in a few hours), using the well-tested and frequently used 
approach is probably better.

smime.p7s
Description: S/MIME cryptographic signature

Re: Switching to Incremental Repair