from:"Bowen Song via user"

Re: compaction trigger after every fix interval

2024-04-28 Thread Bowen Song via user

There's many things that can trigger a compaction, knowing the type of 
compaction can help narrow it down.


Have you looked at the nodetool compactionstats command output when it 
is happening? What is the compaction type? It can be "compaction", but 
can also be something else, such as "validation" or "cleanup".



On 28/04/2024 10:49, Prerna Jain wrote:

Hi team,

I have a query, in our prod environment, there are multiple key spaces 
and tables. According to requirements, every table has different 
compaction strategies like level/time/size.
Somehow, when I checked the compaction history, I noticed that 
compaction occurs every 6 hr for every table.
We did not trigger any job manual and neither did I find any 
configuration. Also, write traffic is also not happening at fix 
interval on that tables

Can you please help me find out the root cause of this case?

I appreciate any help you can provide.

Regards
Prerna Jain

Re: Trouble with using group commitlog_sync

2024-04-24 Thread Bowen Song via user


Okay, that proves I was wrong on the client side bottleneck.

On 24/04/2024 17:55, Nathan Marz wrote:
I tried running two client processes in parallel and the numbers were 
unchanged. The max throughput is still a single client doing 10 
in-flight BatchStatement containing 100 inserts.


On Tue, Apr 23, 2024 at 10:24 PM Bowen Song via user 
 wrote:


You might have run into the bottleneck of the driver's IO thread.
Try increase the driver's connections-per-server limit to 2 or 3
if you've only got 1 server in the cluster. Or alternatively, run
two client processes in parallel.


On 24/04/2024 07:19, Nathan Marz wrote:

Tried it again with one more client thread, and that had no
effect on performance. This is unsurprising as there's only 2 CPU
on this node and they were already at 100%. These were good
ideas, but I'm still unable to even match the performance of
batch commit mode with group commit mode.

On Tue, Apr 23, 2024 at 12:46 PM Bowen Song via user
 wrote:

To achieve 10k loop iterations per second, each iteration
must take 0.1 milliseconds or less. Considering that each
iteration needs to lock and unlock the semaphore (two
syscalls) and make network requests (more syscalls), that's a
lots of context switches. It may a bit too much to ask for a
single thread. I would suggest try multi-threading or
multi-processing, and see if the combined insert rate is higher.

I should also note that executeAsync() also has implicit
limits on the number of in-flight requests, which default to
1024 requests per connection and 1 connection per server. See

https://docs.datastax.com/en/developer/java-driver/4.17/manual/core/pooling/


On 23/04/2024 23:18, Nathan Marz wrote:

It's using the async API, so why would it need multiple
threads? Using the exact same approach I'm able to get 38k /
second with periodic commitlog_sync. For what it's worth, I
do see 100% CPU utilization in every single one of these tests.

On Tue, Apr 23, 2024 at 11:01 AM Bowen Song via user
 wrote:

Have you checked the thread CPU utilisation of the
client side? You likely will need more than one thread
to do insertion in a loop to achieve tens of thousands
of inserts per second.


On 23/04/2024 21:55, Nathan Marz wrote:

Thanks for the explanation.

I tried again with commitlog_sync_group_window at 2ms,
concurrent_writes at 512, and doing 1000 individual
inserts at a time with the same loop + semaphore
approach. This only nets 9k / second.

I got much higher throughput for the other modes with
BatchStatement of 100 inserts rather than 100x more
individual inserts.

On Tue, Apr 23, 2024 at 10:45 AM Bowen Song via user
 wrote:

I suspect you are abusing batch statements. Batch
statements should only be used where atomicity or
isolation is needed. Using batch statements won't
make inserting multiple partitions faster. In fact,
it often will make that slower.

Also, the liner relationship between
commitlog_sync_group_window and write throughput is
expected. That's because the max number of
uncompleted writes is limited by the write
concurrency, and a write is not considered
"complete" before it is synced to disk when
commitlog sync is in group or batch mode. That
means within each interval, only limited number of
writes can be done. The ways to increase that
including: add more nodes, sync commitlog at
shorter intervals and allow more concurrent writes.


On 23/04/2024 20:43, Nathan Marz wrote:

Thanks. I raised concurrent_writes to 128 and
set commitlog_sync_group_window to 20ms. This
causes a single execute of a BatchStatement
containing 100 inserts to succeed. However, the
throughput I'm seeing is atrocious.

With these settings, I'm executing 10
BatchStatement concurrently at a time using the
semaphore + loop approach I showed in my first
message. So as requests complete, more are sent
out such that there are 10 in-flight at a time.
Each BatchStatement has 100 individual inserts.
I'm seeing only 730 inserts / second. Again, with
periodic mode I see 38k / second and with batch I
see 14k / second. My expectation was that group
commit mode throughput would be somewhe

Re: Mixed Cluster 4.0 and 4.1

2024-04-24 Thread Bowen Song via user


Hi Paul,

IMO, if they are truly risk-adverse, they should follow the tested and 
proven best practices, instead of doing things in a less tested way 
which is also know to pose a danger to the data correctness.


If they must do this over a long period of time, then they may need to 
temporarily increase the gc_grace_seconds on all tables, and ensure that 
no DDL or repair is run before the upgrade completes. It is unknown 
whether this route is safe, because it's a less tested route to upgrade 
a cluster.


Please be aware that if they do deletes frequently, increasing the 
gc_grace_seconds may cause some reads to fail due to the elevated number 
of tombstones.


Cheers,
Bowen

On 24/04/2024 17:25, Paul Chandler wrote:

Hi Bowen,

Thanks for your quick reply.

Sorry I used the wrong term there, there it is a maintenance window rather than 
an outage. This is a key system and the vital nature of it means that the 
customer is rightly very risk adverse, so we will only even get permission to 
upgrade one DC per night via a rolling upgrade, meaning this will always be 
over more than a week.

So we can’t shorten the time the cluster is in mixed mode, but I am concerned 
about having a schema mismatch for this long time. Should I be concerned, or 
have others upgraded in a similar way?

Thanks

Paul


On 24 Apr 2024, at 17:02, Bowen Song via user  wrote:

Hi Paul,

You don't need to plan for or introduce an outage for a rolling upgrade, which 
is the preferred route. It isn't advisable to take down an entire DC to do 
upgrade.

You should aim to complete upgrading the entire cluster and finish a full 
repair within the shortest gc_grace_seconds (default to 10 days) of all tables. 
Failing to do that may cause data resurrections.

During the rolling upgrade, you should not run repair or any DDL query (such as 
ALTER TABLE, TRUNCATE, etc.).

You don't need to do the rolling upgrade node by node. You can do it rack by 
rack. Stopping all nodes in a single rack and upgrade them concurrently is much 
faster. The number of nodes doesn't matter that much to the time required to 
complete a rolling upgrade, it's the number of DCs and racks matter.

Cheers,
Bowen

On 24/04/2024 16:16, Paul Chandler wrote:

Hi all,

We have some large clusters ( 1000+  nodes ), these are across multiple 
datacenters.

When we perform upgrades we would normally upgrade a DC at a time during a 
planned outage for one DC. This means that a cluster might be in a mixed mode 
with multiple versions for a week or 2.

We have noticed that during our testing that upgrading to 4.1 causes a schema 
mismatch due to the new tables added into the system keyspace.

Is this going to be an issue if this schema mismatch lasts for maybe several 
weeks? I assume that running any DDL during that time would be a bad idea, is 
there any other issues to look out for?

Thanks

Paul Chandler

Re: Mixed Cluster 4.0 and 4.1

2024-04-24 Thread Bowen Song via user


Hi Paul,

You don't need to plan for or introduce an outage for a rolling upgrade, 
which is the preferred route. It isn't advisable to take down an entire 
DC to do upgrade.


You should aim to complete upgrading the entire cluster and finish a 
full repair within the shortest gc_grace_seconds (default to 10 days) of 
all tables. Failing to do that may cause data resurrections.


During the rolling upgrade, you should not run repair or any DDL query 
(such as ALTER TABLE, TRUNCATE, etc.).


You don't need to do the rolling upgrade node by node. You can do it 
rack by rack. Stopping all nodes in a single rack and upgrade them 
concurrently is much faster. The number of nodes doesn't matter that 
much to the time required to complete a rolling upgrade, it's the number 
of DCs and racks matter.


Cheers,
Bowen

On 24/04/2024 16:16, Paul Chandler wrote:

Hi all,

We have some large clusters ( 1000+  nodes ), these are across multiple 
datacenters.

When we perform upgrades we would normally upgrade a DC at a time during a 
planned outage for one DC. This means that a cluster might be in a mixed mode 
with multiple versions for a week or 2.

We have noticed that during our testing that upgrading to 4.1 causes a schema 
mismatch due to the new tables added into the system keyspace.

Is this going to be an issue if this schema mismatch lasts for maybe several 
weeks? I assume that running any DDL during that time would be a bad idea, is 
there any other issues to look out for?

Thanks

Paul Chandler

Re: Trouble with using group commitlog_sync

2024-04-24 Thread Bowen Song via user

You might have run into the bottleneck of the driver's IO thread. Try 
increase the driver's connections-per-server limit to 2 or 3 if you've 
only got 1 server in the cluster. Or alternatively, run two client 
processes in parallel.



On 24/04/2024 07:19, Nathan Marz wrote:
Tried it again with one more client thread, and that had no effect on 
performance. This is unsurprising as there's only 2 CPU on this node 
and they were already at 100%. These were good ideas, but I'm still 
unable to even match the performance of batch commit mode with group 
commit mode.


On Tue, Apr 23, 2024 at 12:46 PM Bowen Song via user 
 wrote:


To achieve 10k loop iterations per second, each iteration must
take 0.1 milliseconds or less. Considering that each iteration
needs to lock and unlock the semaphore (two syscalls) and make
network requests (more syscalls), that's a lots of context
switches. It may a bit too much to ask for a single thread. I
would suggest try multi-threading or multi-processing, and see if
the combined insert rate is higher.

I should also note that executeAsync() also has implicit limits on
the number of in-flight requests, which default to 1024 requests
per connection and 1 connection per server. See
https://docs.datastax.com/en/developer/java-driver/4.17/manual/core/pooling/


On 23/04/2024 23:18, Nathan Marz wrote:

It's using the async API, so why would it need multiple threads?
Using the exact same approach I'm able to get 38k / second with
periodic commitlog_sync. For what it's worth, I do see 100% CPU
utilization in every single one of these tests.

On Tue, Apr 23, 2024 at 11:01 AM Bowen Song via user
 wrote:

Have you checked the thread CPU utilisation of the client
side? You likely will need more than one thread to do
insertion in a loop to achieve tens of thousands of inserts
per second.


On 23/04/2024 21:55, Nathan Marz wrote:

Thanks for the explanation.

I tried again with commitlog_sync_group_window at 2ms,
concurrent_writes at 512, and doing 1000 individual inserts
at a time with the same loop + semaphore approach. This only
nets 9k / second.

I got much higher throughput for the other modes with
BatchStatement of 100 inserts rather than 100x more
individual inserts.

On Tue, Apr 23, 2024 at 10:45 AM Bowen Song via user
 wrote:

I suspect you are abusing batch statements. Batch
statements should only be used where atomicity or
isolation is needed. Using batch statements won't make
inserting multiple partitions faster. In fact, it often
will make that slower.

Also, the liner relationship between
commitlog_sync_group_window and write throughput is
expected. That's because the max number of uncompleted
writes is limited by the write concurrency, and a write
is not considered "complete" before it is synced to disk
when commitlog sync is in group or batch mode. That
means within each interval, only limited number of
writes can be done. The ways to increase that including:
add more nodes, sync commitlog at shorter intervals and
allow more concurrent writes.


On 23/04/2024 20:43, Nathan Marz wrote:

Thanks. I raised concurrent_writes to 128 and
set commitlog_sync_group_window to 20ms. This causes a
single execute of a BatchStatement containing 100
inserts to succeed. However, the throughput I'm seeing
is atrocious.

With these settings, I'm executing 10 BatchStatement
concurrently at a time using the semaphore + loop
approach I showed in my first message. So as requests
complete, more are sent out such that there are 10
in-flight at a time. Each BatchStatement has 100
individual inserts. I'm seeing only 730 inserts /
second. Again, with periodic mode I see 38k / second
and with batch I see 14k / second. My expectation was
that group commit mode throughput would be somewhere
between those two.

If I set commitlog_sync_group_window to 100ms, the
throughput drops to 14 / second.

If I set commitlog_sync_group_window to 10ms, the
throughput increases to 1587 / second.

If I set commitlog_sync_group_window to 5ms, the
throughput increases to 3200 / second.

If I set commitlog_sync_group_window to 1ms, the
throughput increases to 13k / second, which is slightly
less than batch commit mode.

Is group commit mode supposed to have better
performance than batch mode?


On Tue, Apr 23, 2024

Re: Trouble with using group commitlog_sync

2024-04-23 Thread Bowen Song via user

To achieve 10k loop iterations per second, each iteration must take 0.1 
milliseconds or less. Considering that each iteration needs to lock and 
unlock the semaphore (two syscalls) and make network requests (more 
syscalls), that's a lots of context switches. It may a bit too much to 
ask for a single thread. I would suggest try multi-threading or 
multi-processing, and see if the combined insert rate is higher.


I should also note that executeAsync() also has implicit limits on the 
number of in-flight requests, which default to 1024 requests per 
connection and 1 connection per server. See 
https://docs.datastax.com/en/developer/java-driver/4.17/manual/core/pooling/



On 23/04/2024 23:18, Nathan Marz wrote:
It's using the async API, so why would it need multiple threads? Using 
the exact same approach I'm able to get 38k / second with periodic 
commitlog_sync. For what it's worth, I do see 100% CPU utilization in 
every single one of these tests.


On Tue, Apr 23, 2024 at 11:01 AM Bowen Song via user 
 wrote:


Have you checked the thread CPU utilisation of the client side?
You likely will need more than one thread to do insertion in a
loop to achieve tens of thousands of inserts per second.


On 23/04/2024 21:55, Nathan Marz wrote:

Thanks for the explanation.

I tried again with commitlog_sync_group_window at 2ms,
concurrent_writes at 512, and doing 1000 individual inserts at a
time with the same loop + semaphore approach. This only nets 9k /
second.

I got much higher throughput for the other modes with
BatchStatement of 100 inserts rather than 100x more individual
inserts.

On Tue, Apr 23, 2024 at 10:45 AM Bowen Song via user
 wrote:

I suspect you are abusing batch statements. Batch statements
should only be used where atomicity or isolation is needed.
Using batch statements won't make inserting multiple
partitions faster. In fact, it often will make that slower.

Also, the liner relationship between
commitlog_sync_group_window and write throughput is expected.
That's because the max number of uncompleted writes is
limited by the write concurrency, and a write is not
considered "complete" before it is synced to disk when
commitlog sync is in group or batch mode. That means within
each interval, only limited number of writes can be done. The
ways to increase that including: add more nodes, sync
commitlog at shorter intervals and allow more concurrent writes.


On 23/04/2024 20:43, Nathan Marz wrote:

Thanks. I raised concurrent_writes to 128 and
set commitlog_sync_group_window to 20ms. This causes a
single execute of a BatchStatement containing 100 inserts to
succeed. However, the throughput I'm seeing is atrocious.

With these settings, I'm executing 10 BatchStatement
concurrently at a time using the semaphore + loop approach I
showed in my first message. So as requests complete, more
are sent out such that there are 10 in-flight at a time.
Each BatchStatement has 100 individual inserts. I'm seeing
only 730 inserts / second. Again, with periodic mode I see
38k / second and with batch I see 14k / second. My
expectation was that group commit mode throughput would be
somewhere between those two.

If I set commitlog_sync_group_window to 100ms, the
throughput drops to 14 / second.

If I set commitlog_sync_group_window to 10ms, the throughput
increases to 1587 / second.

If I set commitlog_sync_group_window to 5ms, the throughput
increases to 3200 / second.

If I set commitlog_sync_group_window to 1ms, the throughput
increases to 13k / second, which is slightly less than batch
commit mode.

Is group commit mode supposed to have better performance
than batch mode?


On Tue, Apr 23, 2024 at 8:46 AM Bowen Song via user
 wrote:

The default commitlog_sync_group_window is very long for
SSDs. Try reduce it if you are using SSD-backed storage
for the commit log. 10-15 ms is a good starting point.
You may also want to increase the value of
concurrent_writes, consider at least double or quadruple
it from the default. You'll need even higher write
concurrency for longer commitlog_sync_group_window.


On 23/04/2024 19:26, Nathan Marz wrote:

"batch" mode works fine. I'm having trouble with
"group" mode. The only config for that is
"commitlog_sync_group_window", and I have that set to
the default 1000ms.

    On Tue, Apr 23, 2024 at 8:15 AM Bowen Song via user
 wrote:

Why would you want to set
commitlog

Re: Trouble with using group commitlog_sync

2024-04-23 Thread Bowen Song via user

Have you checked the thread CPU utilisation of the client side? You 
likely will need more than one thread to do insertion in a loop to 
achieve tens of thousands of inserts per second.



On 23/04/2024 21:55, Nathan Marz wrote:

Thanks for the explanation.

I tried again with commitlog_sync_group_window at 2ms, 
concurrent_writes at 512, and doing 1000 individual inserts at a time 
with the same loop + semaphore approach. This only nets 9k / second.


I got much higher throughput for the other modes with BatchStatement 
of 100 inserts rather than 100x more individual inserts.


On Tue, Apr 23, 2024 at 10:45 AM Bowen Song via user 
 wrote:


I suspect you are abusing batch statements. Batch statements
should only be used where atomicity or isolation is needed. Using
batch statements won't make inserting multiple partitions faster.
In fact, it often will make that slower.

Also, the liner relationship between commitlog_sync_group_window
and write throughput is expected. That's because the max number of
uncompleted writes is limited by the write concurrency, and a
write is not considered "complete" before it is synced to disk
when commitlog sync is in group or batch mode. That means within
each interval, only limited number of writes can be done. The ways
to increase that including: add more nodes, sync commitlog at
shorter intervals and allow more concurrent writes.


On 23/04/2024 20:43, Nathan Marz wrote:

Thanks. I raised concurrent_writes to 128 and
set commitlog_sync_group_window to 20ms. This causes a single
execute of a BatchStatement containing 100 inserts to succeed.
However, the throughput I'm seeing is atrocious.

With these settings, I'm executing 10 BatchStatement concurrently
at a time using the semaphore + loop approach I showed in my
first message. So as requests complete, more are sent out such
that there are 10 in-flight at a time. Each BatchStatement has
100 individual inserts. I'm seeing only 730 inserts / second.
Again, with periodic mode I see 38k / second and with batch I see
14k / second. My expectation was that group commit mode
throughput would be somewhere between those two.

If I set commitlog_sync_group_window to 100ms, the throughput
drops to 14 / second.

If I set commitlog_sync_group_window to 10ms, the throughput
increases to 1587 / second.

If I set commitlog_sync_group_window to 5ms, the throughput
increases to 3200 / second.

If I set commitlog_sync_group_window to 1ms, the throughput
increases to 13k / second, which is slightly less than batch
commit mode.

Is group commit mode supposed to have better performance than
batch mode?


On Tue, Apr 23, 2024 at 8:46 AM Bowen Song via user
 wrote:

The default commitlog_sync_group_window is very long for
SSDs. Try reduce it if you are using SSD-backed storage for
the commit log. 10-15 ms is a good starting point. You may
also want to increase the value of concurrent_writes,
consider at least double or quadruple it from the default.
You'll need even higher write concurrency for longer
commitlog_sync_group_window.


On 23/04/2024 19:26, Nathan Marz wrote:

"batch" mode works fine. I'm having trouble with "group"
mode. The only config for that is
"commitlog_sync_group_window", and I have that set to the
default 1000ms.

    On Tue, Apr 23, 2024 at 8:15 AM Bowen Song via user
 wrote:

Why would you want to set commitlog_sync_batch_window to
1 second long when commitlog_sync is set to batch mode?
The documentation

<https://cassandra.apache.org/doc/stable/cassandra/architecture/storage_engine.html>
on this says:

/This window should be kept short because the writer
threads will be unable to do extra work while
waiting. You may need to increase concurrent_writes
for the same reason/

If you want to use batch mode, at least ensure
commitlog_sync_batch_window is reasonably short. The
default is 2 millisecond.


On 23/04/2024 18:32, Nathan Marz wrote:

I'm doing some benchmarking of Cassandra on a single
m6gd.large instance. It works fine with periodic or
batch commitlog_sync options, but I'm having tons of
issues when I change it to "group". I have
"commitlog_sync_group_window" set to 1000ms.

My client is doing writes like this (pseudocode):

Semaphore sem = new Semaphore(numTickets);
while(true) {

sem.acquire();
session.executeAsync(insert.bind(genUUIDStr(),
genUUIDStr(), genUUIDStr(

Re: Trouble with using group commitlog_sync

2024-04-23 Thread Bowen Song via user

I suspect you are abusing batch statements. Batch statements should only 
be used where atomicity or isolation is needed. Using batch statements 
won't make inserting multiple partitions faster. In fact, it often will 
make that slower.


Also, the liner relationship between commitlog_sync_group_window and 
write throughput is expected. That's because the max number of 
uncompleted writes is limited by the write concurrency, and a write is 
not considered "complete" before it is synced to disk when commitlog 
sync is in group or batch mode. That means within each interval, only 
limited number of writes can be done. The ways to increase that 
including: add more nodes, sync commitlog at shorter intervals and allow 
more concurrent writes.



On 23/04/2024 20:43, Nathan Marz wrote:
Thanks. I raised concurrent_writes to 128 and 
set commitlog_sync_group_window to 20ms. This causes a single execute 
of a BatchStatement containing 100 inserts to succeed. However, the 
throughput I'm seeing is atrocious.


With these settings, I'm executing 10 BatchStatement concurrently at a 
time using the semaphore + loop approach I showed in my first message. 
So as requests complete, more are sent out such that there are 10 
in-flight at a time. Each BatchStatement has 100 individual inserts. 
I'm seeing only 730 inserts / second. Again, with periodic mode I see 
38k / second and with batch I see 14k / second. My expectation was 
that group commit mode throughput would be somewhere between those two.


If I set commitlog_sync_group_window to 100ms, the throughput drops to 
14 / second.


If I set commitlog_sync_group_window to 10ms, the throughput increases 
to 1587 / second.


If I set commitlog_sync_group_window to 5ms, the throughput increases 
to 3200 / second.


If I set commitlog_sync_group_window to 1ms, the throughput increases 
to 13k / second, which is slightly less than batch commit mode.


Is group commit mode supposed to have better performance than batch mode?


On Tue, Apr 23, 2024 at 8:46 AM Bowen Song via user 
 wrote:


The default commitlog_sync_group_window is very long for SSDs. Try
reduce it if you are using SSD-backed storage for the commit log.
10-15 ms is a good starting point. You may also want to increase
the value of concurrent_writes, consider at least double or
quadruple it from the default. You'll need even higher write
concurrency for longer commitlog_sync_group_window.


On 23/04/2024 19:26, Nathan Marz wrote:

"batch" mode works fine. I'm having trouble with "group" mode.
The only config for that is "commitlog_sync_group_window", and I
have that set to the default 1000ms.

    On Tue, Apr 23, 2024 at 8:15 AM Bowen Song via user
 wrote:

Why would you want to set commitlog_sync_batch_window to 1
second long when commitlog_sync is set to batch mode? The
documentation

<https://cassandra.apache.org/doc/stable/cassandra/architecture/storage_engine.html>
on this says:

/This window should be kept short because the writer
threads will be unable to do extra work while waiting.
You may need to increase concurrent_writes for the same
reason/

If you want to use batch mode, at least ensure
commitlog_sync_batch_window is reasonably short. The default
is 2 millisecond.


On 23/04/2024 18:32, Nathan Marz wrote:

I'm doing some benchmarking of Cassandra on a single
m6gd.large instance. It works fine with periodic or batch
commitlog_sync options, but I'm having tons of issues when I
change it to "group". I have "commitlog_sync_group_window"
set to 1000ms.

My client is doing writes like this (pseudocode):

Semaphore sem = new Semaphore(numTickets);
while(true) {

sem.acquire();
session.executeAsync(insert.bind(genUUIDStr(),
genUUIDStr(), genUUIDStr())
            .whenComplete((t, u) -> sem.release())

}

If I set numTickets higher than 20, I get tons of timeout
errors.

I've also tried doing single commands with BatchStatement
with many inserts at a time, and that fails with timeout
when the batch size gets more than 20.

Increasing the write request timeout in cassandra.yaml makes
it time out at slightly higher numbers of concurrent requests.

With periodic I'm able to get about 38k writes / second, and
with batch I'm able to get about 14k / second.

Any tips on what I should be doing to get group
commitlog_sync to work properly? I didn't expect to have to
do anything other than change the config.

Re: Trouble with using group commitlog_sync

2024-04-23 Thread Bowen Song via user

The default commitlog_sync_group_window is very long for SSDs. Try 
reduce it if you are using SSD-backed storage for the commit log. 10-15 
ms is a good starting point. You may also want to increase the value of 
concurrent_writes, consider at least double or quadruple it from the 
default. You'll need even higher write concurrency for longer 
commitlog_sync_group_window.



On 23/04/2024 19:26, Nathan Marz wrote:
"batch" mode works fine. I'm having trouble with "group" mode. The 
only config for that is "commitlog_sync_group_window", and I have that 
set to the default 1000ms.


On Tue, Apr 23, 2024 at 8:15 AM Bowen Song via user 
 wrote:


Why would you want to set commitlog_sync_batch_window to 1 second
long when commitlog_sync is set to batch mode? The documentation

<https://cassandra.apache.org/doc/stable/cassandra/architecture/storage_engine.html>
on this says:

/This window should be kept short because the writer threads
will be unable to do extra work while waiting. You may need to
increase concurrent_writes for the same reason/

If you want to use batch mode, at least ensure
commitlog_sync_batch_window is reasonably short. The default is 2
millisecond.


On 23/04/2024 18:32, Nathan Marz wrote:

I'm doing some benchmarking of Cassandra on a single m6gd.large
instance. It works fine with periodic or batch commitlog_sync
options, but I'm having tons of issues when I change it to
"group". I have "commitlog_sync_group_window" set to 1000ms.

My client is doing writes like this (pseudocode):

Semaphore sem = new Semaphore(numTickets);
while(true) {

sem.acquire();
session.executeAsync(insert.bind(genUUIDStr(), genUUIDStr(),
genUUIDStr())
            .whenComplete((t, u) -> sem.release())

}

If I set numTickets higher than 20, I get tons of timeout errors.

I've also tried doing single commands with BatchStatement with
many inserts at a time, and that fails with timeout when the
batch size gets more than 20.

Increasing the write request timeout in cassandra.yaml makes it
time out at slightly higher numbers of concurrent requests.

With periodic I'm able to get about 38k writes / second, and with
batch I'm able to get about 14k / second.

Any tips on what I should be doing to get group commitlog_sync to
work properly? I didn't expect to have to do anything other than
change the config.

Re: Trouble with using group commitlog_sync

2024-04-23 Thread Bowen Song via user

Why would you want to set commitlog_sync_batch_window to 1 second long 
when commitlog_sync is set to batch mode? The documentation 
 
on this says:


   /This window should be kept short because the writer threads will be
   unable to do extra work while waiting. You may need to increase
   concurrent_writes for the same reason/

If you want to use batch mode, at least ensure 
commitlog_sync_batch_window is reasonably short. The default is 2 
millisecond.



On 23/04/2024 18:32, Nathan Marz wrote:
I'm doing some benchmarking of Cassandra on a single m6gd.large 
instance. It works fine with periodic or batch commitlog_sync options, 
but I'm having tons of issues when I change it to "group". I have 
"commitlog_sync_group_window" set to 1000ms.


My client is doing writes like this (pseudocode):

Semaphore sem = new Semaphore(numTickets);
while(true) {

sem.acquire();
session.executeAsync(insert.bind(genUUIDStr(), genUUIDStr(),
genUUIDStr())
            .whenComplete((t, u) -> sem.release())

}

If I set numTickets higher than 20, I get tons of timeout errors.

I've also tried doing single commands with BatchStatement with many 
inserts at a time, and that fails with timeout when the batch size 
gets more than 20.


Increasing the write request timeout in cassandra.yaml makes it time 
out at slightly higher numbers of concurrent requests.


With periodic I'm able to get about 38k writes / second, and with 
batch I'm able to get about 14k / second.


Any tips on what I should be doing to get group commitlog_sync to work 
properly? I didn't expect to have to do anything other than change the 
config.

Re: Alternate apt repo for Debian installation?

2024-03-20 Thread Bowen Song via user


You can try https://archive.apache.org/dist/cassandra/debian/

The deb files can be found here: 
https://archive.apache.org/dist/cassandra/debian/pool/main/c/cassandra/


On 20/03/2024 20:47, Grant Talarico wrote:
Hi there. Hopefully this is the right place to ask this question. I'm 
trying to install the latest version of Cassandra 3.11 using debian 
packages through the debian.cassandra.apache.org 
 apt repo but it appears to be 
down at the moment. Is there an alternate apt repo I might be able to 
use as a backup?


- Grant

Re: [EXTERNAL] Re: About Cassandra stable version having Java 17 support

2024-03-18 Thread Bowen Song via user


Short answer:

There's no definite answer to that question.


Longer answer:

I doubt such date has already been decided. It's largely driven by the 
time required to fix known issues and any potential new issues 
discovered during the BETA and RC process. If you want to track the 
progress, feel free to look at the project's Jira boards, there's a 5.0 
GA board dedicated for that.


Furthermore, it's likely there will only be an experimental support for 
Java 17 in Cassandra 5.0, which means it shouldn't be used on production 
environments.


So, would you like to keep waiting indefinitely for the Java 17 official 
support, or run Cassandra 4.1 on Java 11 today and upgrade when newer 
version becomes available?



On 18/03/2024 13:10, Divyanshi Kaushik via user wrote:

Thanks for your reply.

As Cassandra has moved to Java 17 in it's *5.0-BETA1* (Latest release 
on 2023-12-05). Can you please let us know when the team is planning 
to GA Cassandra 5.0 version which has Java 17 support?


Regards,
Divyanshi

*From:* Bowen Song via user 
*Sent:* Monday, March 18, 2024 5:14 PM
*To:* user@cassandra.apache.org 
*Cc:* Bowen Song 
*Subject:* [EXTERNAL] Re: About Cassandra stable version having Java 
17 support


*CAUTION:* This email originated from outside the organization. Do not 
click links or open attachments unless you recognize the sender and 
know the content is safe.


Why Java 17? It makes no sense to choose an officially non-supported 
library version for a piece of software. That decision making process 
is the problem, not the software's library version compatibility.



On 18/03/2024 09:44, Divyanshi Kaushik via user wrote:

Hi All,

As per my project requirement, Java 17 needs to be used. Can you
please let us know when you are planning to release the next
stable version of Cassandra having Java 17 support?

Regards,
Divyanshi
This email and any files transmitted with it are confidential,
proprietary and intended solely for the individual or entity to
whom they are addressed. If you have received this email in error
please delete it immediately.

This email and any files transmitted with it are confidential, 
proprietary and intended solely for the individual or entity to whom 
they are addressed. If you have received this email in error please 
delete it immediately.

Re: About Cassandra stable version having Java 17 support

2024-03-18 Thread Bowen Song via user

Why Java 17? It makes no sense to choose an officially non-supported 
library version for a piece of software. That decision making process is 
the problem, not the software's library version compatibility.



On 18/03/2024 09:44, Divyanshi Kaushik via user wrote:

Hi All,

As per my project requirement, Java 17 needs to be used. Can you 
please let us know when you are planning to release the next stable 
version of Cassandra having Java 17 support?


Regards,
Divyanshi
This email and any files transmitted with it are confidential, 
proprietary and intended solely for the individual or entity to whom 
they are addressed. If you have received this email in error please 
delete it immediately.

Re: Best Practices for Managing Concurrent Client Connections in Cassandra

2024-02-29 Thread Bowen Song via user

They are suitable for production use for protecting your Cassandra 
server, not the clients. The clients likely will experience an error 
when the limit is reached, and it needs to handle that error appropriately.


What you really want to do probably are:

1. change the client's behaviour, limit the number of servers it 
connects to concurrently. The client can close connections not in use, 
and/or only connect to a subset of servers (note: affects token-aware 
routing).


2. after made the above change, if the number of connections is still an 
issue, horizontally scale up your Cassandra cluster to handle the peak 
number of connections. More nodes means less connections to each node.



On 29/02/2024 10:50, Naman kaushik wrote:


Hello Cassandra Community,

We've been experiencing occasional spikes in the number of client 
connections to our Cassandra cluster, particularly during high-volume 
API request periods. We're using persistent connections, and we've 
noticed that the number of connections can increase significantly 
during these spikes.


We're considering using the following Cassandra parameters to manage 
concurrent client connections:


*native_transport_max_concurrent_connections*: This parameter sets the 
maximum number of concurrent client connections allowed by the native 
transport protocol. Currently, it's set to -1, indicating no limit.


*native_transport_max_concurrent_connections_per_ip*: This parameter 
sets the maximum number of concurrent client connections allowed per 
source IP address. Like the previous parameter, it's also set to -1.


We're thinking of using these parameters to limit the maximum number 
of connections from a single IP address, especially to prevent 
overwhelming the database during spikes in API requests that should be 
handled by our SOA team exclusively.


Are these parameters suitable for production use, and would 
implementing restrictions on concurrent connections per IP be 
considered a good practice in managing Cassandra clusters?


Any insights or recommendations would be greatly appreciated.

Thank you!

Naman

Re: Cassandra 4.1 compaction thread no longer low priority (cpu nice)

2024-02-22 Thread Bowen Song via user

On the IO scheduler point, cfq WAS the only scheduler supporting IO 
priorities (such as ionice) shipped by default with the Linux kernel, 
but that has changed since bfq and mq-deadline were added to the Linux 
kernel. Both bfq and mq-deadline supports IO priority, as documented 
here: https://docs.kernel.org/block/ioprio.html



On 22/02/2024 19:39, Dmitry Konstantinov wrote:

Hi all,

I was not participating in the changes but I analyzed the question 
some time ago from another side.
There were also changes related to -XX:ThreadPriorityPolicy JVM 
option. When you set a thread priority for a Java thread it does not 
mean it is always propagated as a native OS thread priority. To 
propagate the priority you should use  -XX:ThreadPriorityPolicy and 
-XX:JavaPriorityN_To_OSPriorityJVM options, but there is an issue with 
them because JVM wants to be executed under root to 
set -XX:ThreadPriorityPolicy=1 which enables the priorities usage. A 
hack was invented long time ago to workaround it by setting 
-XX:ThreadPriorityPolicy=42 value (any value not equal to 0 or 1) and 
bypass the not so needed and annoying grants validation logic (see 
http://tech.stolsvik.com/2010/01/linux-java-thread-priorities-workaround.html 
for more details about).
It worked for Java 8 but then there was a change in Java 9 about 
adding extra validation for JVM option values (JEP 245: Validate JVM 
Command-Line Flag Arguments  - 
https://bugs.openjdk.org/browse/JDK-8059557) and the hack stopped to 
work and started to cause JVM failure with a validation error. As a 
reaction to it - the flag was removed from Cassandra JVM configuration 
files in https://issues.apache.org/jira/browse/CASSANDRA-13107. After 
it the lower priority value for compaction threads have not had any 
actual effect.
The interesting story is that the JVM logic has been changed to 
support the ability to set -XX:ThreadPriorityPolicy=1 for non-root 
users in Java 13 (https://bugs.openjdk.org/browse/JDK-8215962) and the 
change was backported to Java 11 as well 
(https://bugs.openjdk.org/browse/JDK-8217494).
So, from this point of view I think it would be nice to return back 
the ability to set thread priority for compaction threads. At the same 
time I would not expect too much improvement by enabling it.


P.S. There was also an idea about using ionice 
(https://issues.apache.org/jira/browse/CASSANDRA-9946) but the current 
Linux IO schedulers do not take that into account anymore. It looks 
like the only scheduler that supported ionice was CFQ 
(https://issues.apache.org/jira/browse/CASSANDRA-9946?focusedCommentId=14648616=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14648616 
<https://issues.apache.org/jira/browse/CASSANDRA-9946?focusedCommentId=14648616=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-14648616>) 
and it was deprecated and removed since Linux kernel 5.x 
(https://github.com/torvalds/linux/commit/f382fb0bcef4c37dc049e9f6963e3baf204d815c).


Regards,
Dmitry


On Thu, 22 Feb 2024 at 15:30, Bowen Song via user 
 wrote:


Hi Pierre,

Is there anything stopping you from using the
compaction_throughput

<https://github.com/apache/cassandra/blob/f9e033f519c14596da4dc954875756a69aea4e78/conf/cassandra.yaml#L989>
option in the cassandra.yaml file to manage the performance impact
of compaction operations?

With thread priority, there's a failure scenario on busy nodes
when the read operations uses too much CPU. If the compaction
thread has lower priority, it does not get enough CPU time to run,
and SSTable files will build up, causing read to become slower and
more expensive, which in turn result in compaction gets even less
CPU time. At the end, one of the following three will happen:

  * the node becomes too slow and most queries time out
  * the Java process crashes due to too many open files or OOM
because JVM GC can't keep up
  * the filesystem run out of free space or inodes

However, I'm unsure whether the compaction thread priority was
intentionally removed from 4.1.0. Someone familiar with this
matter may be able to answer that.

Cheers,
Bowen


On 22/02/2024 13:26, Pierre Fersing wrote:


Hello all,

I've recently upgraded to Cassandra 4.1 and see a change in
compaction behavior that seems unwanted:

* With Cassandra 3.11 compaction was run by thread in low
priority and thus using CPU nice (visible using top) (I believe
Cassandra 4.0 also had this behavior)

* With Cassandra 4.1, compactions are no longer run as low
priority thread (compaction now use "normal" CPU).

This means that when the server had limited CPU, Cassandra
compaction now compete for the CPU with other process (probably
including Cassandra itself) that need CPU. When it was using CPU
nice, the compaction only competed for CPU with other lower
priority p

Re: Cassandra 4.1 compaction thread no longer low priority (cpu nice)

2024-02-22 Thread Bowen Song via user


Hi Pierre,

Is there anything stopping you from using the compaction_throughput 
 
option in the cassandra.yaml file to manage the performance impact of 
compaction operations?


With thread priority, there's a failure scenario on busy nodes when the 
read operations uses too much CPU. If the compaction thread has lower 
priority, it does not get enough CPU time to run, and SSTable files will 
build up, causing read to become slower and more expensive, which in 
turn result in compaction gets even less CPU time. At the end, one of 
the following three will happen:


 * the node becomes too slow and most queries time out
 * the Java process crashes due to too many open files or OOM because
   JVM GC can't keep up
 * the filesystem run out of free space or inodes

However, I'm unsure whether the compaction thread priority was 
intentionally removed from 4.1.0. Someone familiar with this matter may 
be able to answer that.


Cheers,
Bowen


On 22/02/2024 13:26, Pierre Fersing wrote:


Hello all,

I've recently upgraded to Cassandra 4.1 and see a change in compaction 
behavior that seems unwanted:


* With Cassandra 3.11 compaction was run by thread in low priority and 
thus using CPU nice (visible using top) (I believe Cassandra 4.0 also 
had this behavior)


* With Cassandra 4.1, compactions are no longer run as low priority 
thread (compaction now use "normal" CPU).


This means that when the server had limited CPU, Cassandra compaction 
now compete for the CPU with other process (probably including 
Cassandra itself) that need CPU. When it was using CPU nice, the 
compaction only competed for CPU with other lower priority process 
which was great as it leaves CPU available for processes that need to 
kept small response time (like an API used by human).


Is it wanted to lose this feature in Cassandra 4.1 or was it just a 
forget during re-write of compaction executor ? Should I open a bug to 
re-introduce this feature in Cassandra ?



I've done few searches, and:

* I believe compaction used CPU nice because the compactor executor 
was created with minimal priority: 
https://github.com/apache/cassandra/blob/cassandra-3.11.16/src/java/org/apache/cassandra/db/compaction/CompactionManager.java#L1906 



* I think it was dropped by commit 
https://github.com/apache/cassandra/commit/be1f050bc8c0cd695a42952e3fc84625ad48d83a 



* It looks doable to set the thread priority with new executor, I 
think adding ".withThreadPriority(Thread.MIN_PRIORITY)" when using 
executorFactory in 
https://github.com/apache/cassandra/blob/77a3e0e818df3cce45a974ecc977ad61bdcace47/src/java/org/apache/cassandra/db/compaction/CompactionManager.java#L2028 
should 
do it.



Did I miss a reason to no longer use low priority threads for 
compaction ? Should I open a bug for re-adding this feature / submit a 
PR ?


Regards,

Pierre Fersing

Re: Requesting Feedback for Cassandra as a backup solution.

2024-02-19 Thread Bowen Song via user

You can have a read at 
https://www.datastax.com/blog/cassandra-anti-patterns-queues-and-queue-datasets


Your table schema does not include the most important piece of 
information - the partition keys (and clustering keys, if any). Keep in 
mind that you can only efficiently query Cassandra by the exact 
partition key or the token of a partition key, otherwise you will have 
to rely on MV or secondary index, or worse, scan the entire table (all 
the nodes) to find your data.


A Cassandra schema should look like this:

CREATE TABLE xyz (
  a text,
  b text,
  c timeuuid,
  d int,
  e text,
  PRIMARY KEY ((a, b), c, d)
);

The line "PRIMARY KEY" contains arguably the most important piece of 
information of the table schema.



On 19/02/2024 06:52, Gowtham S wrote:

Hi Bowen

which is a well documented anti-pattern.

Can you please explain more on this, I'm not aware of it. It will be 
helpful to make decisions.

Please find the below table schema

*Table schema*
TopicName - text
Partition - int
MessageUUID - text
Actual data - text
OccurredTime - Timestamp
Status - boolean

We are planning to read the table with the topic name and the status 
is not true. And produce those to the respective topic when Kafka is live.


Thanks and regards,
Gowtham S


On Sat, 17 Feb 2024 at 18:10, Bowen Song via user 
 wrote:


Hi Gowtham,

On the face of it, it sounds like you are planning to use
Cassandra for a queue-like application, which is a well documented
anti-pattern. If that's not the case, can you please show the
table schema and some example queries?

Cheers,
Bowen

On 17/02/2024 08:44, Gowtham S wrote:


Dear Cassandra Community,

I am reaching out to seek your valuable feedback and insights on
a proposed solution we are considering for managing Kafka outages
using Cassandra.

At our organization, we heavily rely on Kafka for real-time data
processing and messaging. However, like any technology, Kafka is
susceptible to occasional outages which can disrupt our
operations and impact our services. To mitigate the impact of
such outages and ensure continuity, we are exploring the
possibility of leveraging Cassandra as a backup solution.

Our proposed approach involves storing messages in Cassandra
during Kafka outages. Subsequently, we plan to implement a
scheduler that will read from Cassandra and attempt to write
these messages back into Kafka once it is operational again.

We believe that by adopting this strategy, we can achieve the
following benefits:

1.

Improved Fault Tolerance: By having a backup mechanism in
place, we can reduce the risk of data loss and ensure
continuity of operations during Kafka outages.

2.

Enhanced Reliability: Cassandra's distributed architecture
and built-in replication features make it well-suited for
storing data reliably, even in the face of failures.

3.

Scalability: Both Cassandra and Kafka are designed to scale
horizontally, allowing us to handle increased loads seamlessly.

Before proceeding further with this approach, we would greatly
appreciate any feedback, suggestions, or concerns from the
community. Specifically, we are interested in hearing about:

  * Potential challenges or drawbacks of using Cassandra as a
backup solution for Kafka outages.
  * Best practices or recommendations for implementing such a
backup mechanism effectively.
  * Any alternative approaches or technologies that we should
consider?

Your expertise and insights are invaluable to us, and we are
eager to learn from your experiences and perspectives. Please
feel free to share your thoughts or reach out to us with any
questions or clarifications.

Thank you for taking the time to consider our proposal, and we
look forward to hearing from you soon.

Thanks and regards,
Gowtham S

Re: Requesting Feedback for Cassandra as a backup solution.

2024-02-17 Thread Bowen Song via user


Hi Gowtham,

On the face of it, it sounds like you are planning to use Cassandra for 
a queue-like application, which is a well documented anti-pattern. If 
that's not the case, can you please show the table schema and some 
example queries?


Cheers,
Bowen

On 17/02/2024 08:44, Gowtham S wrote:


Dear Cassandra Community,

I am reaching out to seek your valuable feedback and insights on a 
proposed solution we are considering for managing Kafka outages using 
Cassandra.


At our organization, we heavily rely on Kafka for real-time data 
processing and messaging. However, like any technology, Kafka is 
susceptible to occasional outages which can disrupt our operations and 
impact our services. To mitigate the impact of such outages and ensure 
continuity, we are exploring the possibility of leveraging Cassandra 
as a backup solution.


Our proposed approach involves storing messages in Cassandra during 
Kafka outages. Subsequently, we plan to implement a scheduler that 
will read from Cassandra and attempt to write these messages back into 
Kafka once it is operational again.


We believe that by adopting this strategy, we can achieve the 
following benefits:


1.

Improved Fault Tolerance: By having a backup mechanism in place,
we can reduce the risk of data loss and ensure continuity of
operations during Kafka outages.

2.

Enhanced Reliability: Cassandra's distributed architecture and
built-in replication features make it well-suited for storing data
reliably, even in the face of failures.

3.

Scalability: Both Cassandra and Kafka are designed to scale
horizontally, allowing us to handle increased loads seamlessly.

Before proceeding further with this approach, we would greatly 
appreciate any feedback, suggestions, or concerns from the community. 
Specifically, we are interested in hearing about:


  * Potential challenges or drawbacks of using Cassandra as a backup
solution for Kafka outages.
  * Best practices or recommendations for implementing such a backup
mechanism effectively.
  * Any alternative approaches or technologies that we should consider?

Your expertise and insights are invaluable to us, and we are eager to 
learn from your experiences and perspectives. Please feel free to 
share your thoughts or reach out to us with any questions or 
clarifications.


Thank you for taking the time to consider our proposal, and we look 
forward to hearing from you soon.


Thanks and regards,
Gowtham S

Re: Switching to Incremental Repair

2024-02-15 Thread Bowen Song via user

The gc_grace_seconds, which default to 10 days, is the maximal safe 
interval between repairs. How much data gets written during that period 
of time? Will your nodes run out of disk space because of the new data 
written during that time? If so, it sounds like your nodes are 
dangerously close to running out of disk space, and you should address 
that issue first before even considering upgrading Cassandra.

On 15/02/2024 18:49, Kristijonas Zalys wrote:

Hi folks,

One last question regarding incremental repair.

What would be a safe approach to temporarily stop running incremental 
repair on a cluster (e.g.: during a Cassandra major version upgrade)? 
My understanding is that if we simply stop running incremental repair, 
the cluster's nodes can, in the worst case, double in disk size as the 
repaired dataset will not get compacted with the unrepaired dataset. 
Similar to Sebastian, we have nodes where the disk usage is multiple 
TiBs so significant growth can be quite dangerous in our case. Would 
the only safe choice be to mark all SSTables as unrepaired before 
stopping regular incremental repair?

Thanks,
Kristijonas

On Wed, Feb 7, 2024 at 4:33 PM Bowen Song via user 
 wrote:

The over-streaming is only problematic for the repaired SSTables,
but it
can be triggered by inconsistencies within the unrepaired SSTables
during an incremental repair session. This is because although an
incremental repair will only compare the unrepaired SSTables, but it
will stream both the unrepaired and repaired SSTables for the
inconsistent token ranges. Keep in mind that the source SSTables for
streaming is selected based on the token ranges, not the
repaired/unrepaired state.

Base on the above, I'm unsure running an incremental repair before a
full repair can fully avoid the over-streaming issue.

On 07/02/2024 22:41, Sebastian Marsching wrote:
> Thank you very much for your explanation.
>
> Streaming happens on the token range level, not the SSTable
level, right? So, when running an incremental repair before the
full repair, the problem that “some unrepaired SSTables are being
marked as repaired on one node but not on another” should not
exist any longer. Now this data should be marked as repaired on
all nodes.
>
> Thus, when repairing the SSTables that are marked as repaired,
this data should be included on all nodes when calculating the
Merkle trees and no overstreaming should happen.
>
> Of course, this means that running an incremental repair *first*
after marking SSTables as repaired and only running the full
repair *after* that is critical. I have to admit that previously I
wasn’t fully aware of how critical this step is.
>
>> Am 07.02.2024 um 20:22 schrieb Bowen Song via user
:
>>
>> Unfortunately repair doesn't compare each partition
individually. Instead, it groups multiple partitions together and
calculate a hash of them, stores the hash in a leaf of a merkle
tree, and then compares the merkle trees between replicas during a
repair session. If any one of the partitions covered by a leaf is
inconsistent between replicas, the hash values in these leaves
will be different, and all partitions covered by the same leaf
will need to be streamed in full.
>>
>> Knowing that, and also know that your approach can create a
lots of inconsistencies in the repaired SSTables because some
unrepaired SSTables are being marked as repaired on one node but
not on another, you would then understand why over-streaming can
happen. The over-streaming is only problematic for the repaired
SSTables, because they are much bigger than the unrepaired.
>>
>>
>> On 07/02/2024 17:00, Sebastian Marsching wrote:
>>>> Caution, using the method you described, the amount of data
streamed at the end with the full repair is not the amount of data
written between stopping the first node and the last node, but
depends on the table size, the number of partitions written, their
distribution in the ring and the 'repair_session_space' value. If
the table is large, the writes touch a large number of partitions
scattered across the token ring, and the value of
'repair_session_space' is small, you may end up with a very
expensive over-streaming.
>>> Thanks for the warning. In our case it worked well (obviously
we tested it on a test cluster before applying it on the
production clusters), but it is good to know that this might not
always be the case.
>>>
>>> Maybe I misunderstand how full and incremental repairs work in
C* 4.x. I would appreciate if you could clarify this for me.
>>>
>>> So far, I assumed that a full repair on a cluster that is also

Re: Switching to Incremental Repair

2024-02-07 Thread Bowen Song via user

The over-streaming is only problematic for the repaired SSTables, but it 
can be triggered by inconsistencies within the unrepaired SSTables 
during an incremental repair session. This is because although an 
incremental repair will only compare the unrepaired SSTables, but it 
will stream both the unrepaired and repaired SSTables for the 
inconsistent token ranges. Keep in mind that the source SSTables for 
streaming is selected based on the token ranges, not the 
repaired/unrepaired state.


Base on the above, I'm unsure running an incremental repair before a 
full repair can fully avoid the over-streaming issue.


On 07/02/2024 22:41, Sebastian Marsching wrote:

Thank you very much for your explanation.

Streaming happens on the token range level, not the SSTable level, right? So, 
when running an incremental repair before the full repair, the problem that 
“some unrepaired SSTables are being marked as repaired on one node but not on 
another” should not exist any longer. Now this data should be marked as 
repaired on all nodes.

Thus, when repairing the SSTables that are marked as repaired, this data should 
be included on all nodes when calculating the Merkle trees and no overstreaming 
should happen.

Of course, this means that running an incremental repair *first* after marking 
SSTables as repaired and only running the full repair *after* that is critical. 
I have to admit that previously I wasn’t fully aware of how critical this step 
is.


Am 07.02.2024 um 20:22 schrieb Bowen Song via user :

Unfortunately repair doesn't compare each partition individually. Instead, it 
groups multiple partitions together and calculate a hash of them, stores the 
hash in a leaf of a merkle tree, and then compares the merkle trees between 
replicas during a repair session. If any one of the partitions covered by a 
leaf is inconsistent between replicas, the hash values in these leaves will be 
different, and all partitions covered by the same leaf will need to be streamed 
in full.

Knowing that, and also know that your approach can create a lots of 
inconsistencies in the repaired SSTables because some unrepaired SSTables are 
being marked as repaired on one node but not on another, you would then 
understand why over-streaming can happen. The over-streaming is only 
problematic for the repaired SSTables, because they are much bigger than the 
unrepaired.


On 07/02/2024 17:00, Sebastian Marsching wrote:

Caution, using the method you described, the amount of data streamed at the end 
with the full repair is not the amount of data written between stopping the 
first node and the last node, but depends on the table size, the number of 
partitions written, their distribution in the ring and the 
'repair_session_space' value. If the table is large, the writes touch a large 
number of partitions scattered across the token ring, and the value of 
'repair_session_space' is small, you may end up with a very expensive 
over-streaming.

Thanks for the warning. In our case it worked well (obviously we tested it on a 
test cluster before applying it on the production clusters), but it is good to 
know that this might not always be the case.

Maybe I misunderstand how full and incremental repairs work in C* 4.x. I would 
appreciate if you could clarify this for me.

So far, I assumed that a full repair on a cluster that is also using 
incremental repair pretty much works like on a cluster that is not using 
incremental repair at all, the only difference being that the set of repaired 
und unrepaired data is repaired separately, so the Merkle trees that are 
calculated for repaired and unrepaired data are completely separate.

I also assumed that incremental repair only looks at unrepaired data, which is 
why it is so fast.

Is either of these two assumptions wrong?

If not, I do not quite understand how a lot of overstreaming might happen, as 
long as (I forgot to mention this step in my original e-mail) I run an 
incremental repair directly after restarting the nodes and marking all data as 
repaired.

I understand that significant overstreaming might happen during this first 
repair (in the worst case streaming all the unrepaired data that a node 
stores), but due to the short amount of time between starting to mark data as 
repaired and running the incremental repair, the whole set of unrepaired data 
should be rather small, so this overstreaming should not cause any issues.

 From this point on, the unrepaired data on the different nodes should be in 
sync and discrepancies in the repaired data during the full repair should not 
be bigger than they had been if I had run a full repair without marking an data 
as repaired.

I would really appreciate if you could point out the hole in this reasoning. 
Maybe I have a fundamentally wrong understanding of the repair process, and if 
I do I would like to correct this.

Re: Switching to Incremental Repair

2024-02-07 Thread Bowen Song via user

Unfortunately repair doesn't compare each partition individually. 
Instead, it groups multiple partitions together and calculate a hash of 
them, stores the hash in a leaf of a merkle tree, and then compares the 
merkle trees between replicas during a repair session. If any one of the 
partitions covered by a leaf is inconsistent between replicas, the hash 
values in these leaves will be different, and all partitions covered by 
the same leaf will need to be streamed in full.


Knowing that, and also know that your approach can create a lots of 
inconsistencies in the repaired SSTables because some unrepaired 
SSTables are being marked as repaired on one node but not on another, 
you would then understand why over-streaming can happen. The 
over-streaming is only problematic for the repaired SSTables, because 
they are much bigger than the unrepaired.



On 07/02/2024 17:00, Sebastian Marsching wrote:

Caution, using the method you described, the amount of data streamed at the end 
with the full repair is not the amount of data written between stopping the 
first node and the last node, but depends on the table size, the number of 
partitions written, their distribution in the ring and the 
'repair_session_space' value. If the table is large, the writes touch a large 
number of partitions scattered across the token ring, and the value of 
'repair_session_space' is small, you may end up with a very expensive 
over-streaming.

Thanks for the warning. In our case it worked well (obviously we tested it on a 
test cluster before applying it on the production clusters), but it is good to 
know that this might not always be the case.

Maybe I misunderstand how full and incremental repairs work in C* 4.x. I would 
appreciate if you could clarify this for me.

So far, I assumed that a full repair on a cluster that is also using 
incremental repair pretty much works like on a cluster that is not using 
incremental repair at all, the only difference being that the set of repaired 
und unrepaired data is repaired separately, so the Merkle trees that are 
calculated for repaired and unrepaired data are completely separate.

I also assumed that incremental repair only looks at unrepaired data, which is 
why it is so fast.

Is either of these two assumptions wrong?

If not, I do not quite understand how a lot of overstreaming might happen, as 
long as (I forgot to mention this step in my original e-mail) I run an 
incremental repair directly after restarting the nodes and marking all data as 
repaired.

I understand that significant overstreaming might happen during this first 
repair (in the worst case streaming all the unrepaired data that a node 
stores), but due to the short amount of time between starting to mark data as 
repaired and running the incremental repair, the whole set of unrepaired data 
should be rather small, so this overstreaming should not cause any issues.

 From this point on, the unrepaired data on the different nodes should be in 
sync and discrepancies in the repaired data during the full repair should not 
be bigger than they had been if I had run a full repair without marking an data 
as repaired.

I would really appreciate if you could point out the hole in this reasoning. 
Maybe I have a fundamentally wrong understanding of the repair process, and if 
I do I would like to correct this.

Re: Switching to Incremental Repair

2024-02-07 Thread Bowen Song via user

Caution, using the method you described, the amount of data streamed at 
the end with the full repair is not the amount of data written between 
stopping the first node and the last node, but depends on the table 
size, the number of partitions written, their distribution in the ring 
and the 'repair_session_space' value. If the table is large, the writes 
touch a large number of partitions scattered across the token ring, and 
the value of 'repair_session_space' is small, you may end up with a very 
expensive over-streaming.


On 07/02/2024 12:33, Sebastian Marsching wrote:
Full repair running for an entire week sounds excessively long. Even 
if you've got 1 TB of data per node, 1 week means the repair speed is 
less than 2 MB/s, that's very slow. Perhaps you should focus on 
finding the bottleneck of the full repair speed and work on that instead.


We store about 3–3.5 TB per node on spinning disks (time-series data), 
so I don’t think it is too surprising.


Not disabling auto-compaction may result in repaired SSTables getting 
compacted together with unrepaired SSTables before the repair state 
is set on them, which leads to mismatch in the repaired data between 
nodes, and potentially very expensive over-streaming in a future full 
repair. You should follow the documented and tested steps and not 
improvise or getting creative if you value your data and time.


There is a different method that we successfully used on three 
clusters, but I agree that anti-entropy repair is a tricky business 
and one should be cautious with trying less tested methods.


Due to the long time for a full repair (see my earlier explanation), 
disabling anticompaction while running the full repair wasn’t an 
option for us. It was previously suggested that one could run the 
repair per node instead of the full cluster, but I don’t think that 
this will work, because only marking the SSTables on a single node as 
repaired would lead to massive overstreaming when running the full 
repair for the next node that shares data with the first one.


So, I want to describe the method that we used, just in case someone 
is in the same situation:


Going around the ring, we temporarily stopped each node and marked all 
of its SSTables as repaired. Then we immediately ran a full repair, so 
that any inconsistencies in the data that was now marked as repaired 
but not actually repaired were fixed.


Using this approach, the amount over over-streaming is minimal (at 
least for not too large clusters, where the rolling restart can be 
done in less than an hour or so), because the only difference between 
the “unrepaired” SSTables on the different nodes will be the data that 
was written before stopping the first node and stopping the last node.


Any inconsistencies that might exist in the SSTables that were marked 
as repaired should be caught in the full repair, so I do not think it 
is too dangerous either. However, I agree that for clusters where a 
full repair is quick (e.g. finishes in a few hours), using the 
well-tested and frequently used approach is probably better.

Re: Switching to Incremental Repair

2024-02-07 Thread Bowen Song via user

Just one more thing. Make sure you run 'nodetool repair -full' instead
of just 'nodetool repair'. That's because the command's default was
changed in Cassandra 2.x. The default was full repair before that
change, but the new default now is incremental repair.

On 07/02/2024 10:28, Bowen Song via user wrote:

Not disabling auto-compaction may result in repaired SSTables getting
compacted together with unrepaired SSTables before the repair state is
set on them, which leads to mismatch in the repaired data between
nodes, and potentially very expensive over-streaming in a future full
repair. You should follow the documented and tested steps and not
improvise or getting creative if you value your data and time.

On 06/02/2024 23:55, Kristijonas Zalys wrote:

Hi folks,

Thank you all for your insight, this has been very helpful.

I was going through the migration process here
<https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/operations/opsRepairNodesMigration.html>and
I’m not entirely sure why disabling autocompaction on the node is
required? Could anyone clarify what would be the side effects of not
disabling autocompaction and starting with step 2 of the migration?

Thanks,

Kristijonas

On Sun, Feb 4, 2024 at 12:18 AM Alexander DEJANOVSKI
wrote:

Hi Sebastian,

That's a feature we need to implement in Reaper. I think
disallowing the start of the new incremental repair would be
easier to manage than pausing the full repair that's already
running. It's also what I think I'd expect as a user.

I'll create an issue to track this.

Le sam. 3 févr. 2024, 16:19, Sebastian Marsching
a écrit :

Hi,

2. use an orchestration tool, such as Cassandra Reaper, to
take care of that for you. You will still need monitor and
alert to ensure the repairs are run successfully, but fixing
a stuck or failed repair is not very time sensitive, you can
usually leave it till Monday morning if it happens at Friday
night.

Does anyone know how such a schedule can be created in
Cassandra Reaper?

I recently learned the hard way that running both a full and
an incremental repair for the same keyspace and table in
parallel is not a good idea (it caused a very unpleasant
overload situation on one of our clusters).

At the moment, we have one schedule for the full repairs
(every 90 days) and another schedule for the incremental
repairs (daily). But as full repairs take much longer than a
day (about a week, in our case), the two schedules collide.
So, Cassandra Reaper starts an incremental repair while the
full repair is still in process.

Does anyone know how to avoid this? Optimally, the full
repair would be paused (no new segments started) for the
duration of the incremental repair. The second best option
would be inhibiting the incremental repair while a full
repair is in progress.

Best regards,
Sebastian

Re: Switching to Incremental Repair

2024-02-07 Thread Bowen Song via user

Not disabling auto-compaction may result in repaired SSTables getting 
compacted together with unrepaired SSTables before the repair state is 
set on them, which leads to mismatch in the repaired data between nodes, 
and potentially very expensive over-streaming in a future full repair. 
You should follow the documented and tested steps and not improvise or 
getting creative if you value your data and time.


On 06/02/2024 23:55, Kristijonas Zalys wrote:


Hi folks,


Thank you all for your insight, this has been very helpful.


I was going through the migration process here 
and 
I’m not entirely sure why disabling autocompaction on the node is 
required? Could anyone clarify what would be the side effects of not 
disabling autocompaction and starting with step 2 of the migration?



Thanks,

Kristijonas



On Sun, Feb 4, 2024 at 12:18 AM Alexander DEJANOVSKI 
 wrote:


Hi Sebastian,

That's a feature we need to implement in Reaper. I think
disallowing the start of the new incremental repair would be
easier to manage than pausing the full repair that's already
running. It's also what I think I'd expect as a user.

I'll create an issue to track this.

Le sam. 3 févr. 2024, 16:19, Sebastian Marsching
 a écrit :

Hi,


2. use an orchestration tool, such as Cassandra Reaper, to
take care of that for you. You will still need monitor and
alert to ensure the repairs are run successfully, but fixing
a stuck or failed repair is not very time sensitive, you can
usually leave it till Monday morning if it happens at Friday
night.


Does anyone know how such a schedule can be created in
Cassandra Reaper?

I recently learned the hard way that running both a full and
an incremental repair for the same keyspace and table in
parallel is not a good idea (it caused a very unpleasant
overload situation on one of our clusters).

At the moment, we have one schedule for the full repairs
(every 90 days) and another schedule for the incremental
repairs (daily). But as full repairs take much longer than a
day (about a week, in our case), the two schedules collide.
So, Cassandra Reaper starts an incremental repair while the
full repair is still in process.

Does anyone know how to avoid this? Optimally, the full repair
would be paused (no new segments started) for the duration of
the incremental repair. The second best option would be
inhibiting the incremental repair while a full repair is in
progress.

Best regards,
Sebastian

Re: Switching to Incremental Repair

2024-02-03 Thread Bowen Song via user

Full repair running for an entire week sounds excessively long. Even if 
you've got 1 TB of data per node, 1 week means the repair speed is less 
than 2 MB/s, that's very slow. Perhaps you should focus on finding the 
bottleneck of the full repair speed and work on that instead.



On 03/02/2024 16:18, Sebastian Marsching wrote:

Hi,


2. use an orchestration tool, such as Cassandra Reaper, to take care 
of that for you. You will still need monitor and alert to ensure the 
repairs are run successfully, but fixing a stuck or failed repair is 
not very time sensitive, you can usually leave it till Monday morning 
if it happens at Friday night.



Does anyone know how such a schedule can be created in Cassandra Reaper?

I recently learned the hard way that running both a full and an 
incremental repair for the same keyspace and table in parallel is not 
a good idea (it caused a very unpleasant overload situation on one of 
our clusters).


At the moment, we have one schedule for the full repairs (every 90 
days) and another schedule for the incremental repairs (daily). But as 
full repairs take much longer than a day (about a week, in our case), 
the two schedules collide. So, Cassandra Reaper starts an incremental 
repair while the full repair is still in process.


Does anyone know how to avoid this? Optimally, the full repair would 
be paused (no new segments started) for the duration of the 
incremental repair. The second best option would be inhibiting the 
incremental repair while a full repair is in progress.


Best regards,
Sebastian

Re: Switching to Incremental Repair

2024-02-03 Thread Bowen Song via user

Hi Kristijonas,

It is not possible to run two repairs, regardless whether they are
incremental or full, for the same token range and on the same table
concurrently. You have two options:

1. create a schedule that's don't overlap, e.g. run incremental repair
daily except the 1st of each month, and run full repair on the 1st of
each month. If you choose to do this, make sure you setup a monitor and
alert system for it and have someone respond to the alerts in weekends
or public holidays. If a repair took longer than usual and is at the
risk of overlapping with the next repair, a timely human intervention is
required to prevent that - either kill the currently running repair or
skip the next one.

2. use an orchestration tool, such as Cassandra Reaper, to take care of
that for you. You will still need monitor and alert to ensure the
repairs are run successfully, but fixing a stuck or failed repair is not
very time sensitive, you can usually leave it till Monday morning if it
happens at Friday night.

Personally I would recommend the 2nd option, because getting back to
your laptop at 10 pm on Friday night after you have had a few beers is
not fun.

Cheers,
Bowen

On 03/02/2024 01:59, Kristijonas Zalys wrote:

Hi Bowen,

Thank you for your help!

So given that we would need to run both incremental and full repair
for a given cluster, is it safe to have both types of repair running
for the same token ranges at the same time? Would it not create a race
condition?

Thanks,
Kristijonas

On Fri, Feb 2, 2024 at 3:36 PM Bowen Song via user
wrote:

Hi Kristijonas,

To answer your questions:

1. It's still necessary to run full repair on a cluster on which
incremental repair is run periodically. The frequency of full
repair is more of an art than science. Generally speaking, the
less reliable the storage media, the more frequently full repair
should be run. The documentation on this topic is available here

<https://cassandra.apache.org/doc/stable/cassandra/operating/repair.html#incremental-and-full-repairs>

2. Run incremental repair for the first time on an existing
cluster does cause Cassandra to re-compact all SSTables, and can
lead to disk usage spikes. This can be avoided by following the
steps mentioned here

<https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/operations/opsRepairNodesMigration.html>

I hope that helps.

Cheers,
Bowen

On 02/02/2024 20:57, Kristijonas Zalys wrote:

Hi folks,

I am working on switching from full to incremental repair in
Cassandra v4.0.6 (soon to be v4.1.3) and I have a few questions.

Is it necessary to run regular full repair on a cluster if I
already run incremental repair? If yes, what frequency would
you recommend for full repair?

Has anyone experienced disk usage spikes while using
incremental repair? I have noticed temporary disk footprint
increases of up to 2x (from ~15 GiB to ~30 GiB) caused by
anti-compaction while testing and am wondering how likely
that is to happen in bigger real world use cases?

Thank you all in advance!

Kristijonas

Re: Switching to Incremental Repair

2024-02-02 Thread Bowen Song via user


Hi Kristijonas,

To answer your questions:

1. It's still necessary to run full repair on a cluster on which 
incremental repair is run periodically. The frequency of full repair is 
more of an art than science. Generally speaking, the less reliable the 
storage media, the more frequently full repair should be run. The 
documentation on this topic is available here 



2. Run incremental repair for the first time on an existing cluster does 
cause Cassandra to re-compact all SSTables, and can lead to disk usage 
spikes. This can be avoided by following the steps mentioned here 
 



I hope that helps.

Cheers,
Bowen

On 02/02/2024 20:57, Kristijonas Zalys wrote:


Hi folks,


I am working on switching from full to incremental repair in Cassandra 
v4.0.6 (soon to be v4.1.3) and I have a few questions.



1.

Is it necessary to run regular full repair on a cluster if I
already run incremental repair? If yes, what frequency would you
recommend for full repair?

2.

Has anyone experienced disk usage spikes while using incremental
repair? I have noticed temporary disk footprint increases of up to
2x (from ~15 GiB to ~30 GiB) caused by anti-compaction while
testing and am wondering how likely that is to happen in bigger
real world use cases?


Thank you all in advance!

Kristijonas

Re: Tests failing for ppc64le architecture.

2024-01-30 Thread Bowen Song via user


Hi Sunidhi,

In case you haven't noticed, this is the Cassandra user mailing list, 
not the dev mailing list. Most people in this mailing list have never 
attempted to built Cassandra from the source code. IMHO you should try 
the Cassandra dev mailing list for this type of things.


Cheers,
Bowen


On 30/01/2024 13:00, Sunidhi Gaonkar via user wrote:


Hi team, any thoughts on this?

Thank you and Regards,

Sunidhi Gaonkar.



*From:* Sunidhi Gaonkar
*Sent:* Thursday, January 11, 2024 7:19 PM
*To:* user@cassandra.apache.org 
*Subject:* Tests failing for ppc64le architecture.

Hi Team,

I am working on validating Cassandra on ppc64le architecture, I have 
followed the following steps to build Cassandra from source:


1. Install java-17, python3.7, ant, cmake,ninja.

2. Build netty-tcnative and transport-native-epoll from source since 
jars are not available for ppc64le.


3. Clone Cassandra repository, checked out to cassandra-5.0-beta1.

4. Command used to build: ant

5. Command used to test: ant test

5 tests mentioned below are failing:

FullQueryLoggerTest

SSTableReaderTest

FileTest

SystemPropertiesBasedFileSystemOwnershipCheckTest

YamlBasedFileSystemOwnershipCheckTest

I have observed same tests failing on x86 architecture. Please find 
attached the logs for the failing tests below.


Additionals details:

OS: Red Hat Enterprise Linux 8.6

Any suggestions and pointers regarding the same will be helpful.

Thank you and Regards,

Sunidhi Gaonkar.

Re: Over streaming in one node during repair.

2024-01-24 Thread Bowen Song via user


Some common causes of over-streaming:

 * "repair_session_space" is too small (either manually specified, or
   heap size is small and data on disk is large)
 * Manually deleting SSTable files
 * Unexpected foreign (e.g. from a backup) SSTable files
 * Marking SSTable as repaired or unrepaired inconsistently across nodes
 * Disk/filesystem corruption
 * A node that has been down for a very long time comes back

That's all I can think of, other people may have more to add.

To troubleshoot this, you may find the "nodetool getsstables", 
"sstablemetadata" and "sstabledump" commands handy.



On 23/01/2024 18:07, manish khandelwal wrote:
In one of our two datacenter setup(3+3), one Cassndra node is getting 
lot of data streamed from other nodes during repair to the extent that 
it fills up and ends with full disk. I am not able to understand what 
could be the reason that this node is misbehaving in the cluster. 
Cassandra version is 3.11.2


System logs show every node sending data to this node.Any pointer 
where should I look at would be helpful.


Regards
Manish

Re: COMMERCIAL:Re: COMMERCIAL:Re: COMMERCIAL:Re: system_schema.tables id and table uuid on disk mismatch

2024-01-18 Thread Bowen Song via user

Without knowing the cause of the issue, it's hard to tell what are the 
correct steps to recover from it. I would recommend you have a look at 
the logs and figure out what was the cause of the issue, and then make a 
recovery plan and also put preventive measure in place to stop it from 
happening again.



Also, I'm not comfortable with the idea of manually creating data 
directories for Cassandra unless I know the system well, as this may 
lead to filesystem ownership and permission issues. The involvement of 
security tools like AppArmor or SELinux may make it even more complicated.



On 18/01/2024 15:46, ENES ATABERK wrote:


ok thank you!

What do you think about the following approach:

 1.  creating empty correct table id directories in linux filesystem
with respect to the system_schema.tables id column
 2. importing data with nodetool import from incorrect directory
 3. removing the incorrect directory afterwards


*From:* Bowen Song via user 
*Sent:* Thursday, January 18, 2024 5:34:57 PM
*To:* user@cassandra.apache.org
*Cc:* Bowen Song
*Subject:* COMMERCIAL:Re: COMMERCIAL:Re: COMMERCIAL:Re: 
system_schema.tables id and table uuid on disk mismatch


I know dropping a table and then creating a new table with the same 
name can lead to that result, which is expected. If that wasn't what 
happened, it may be a bug in Cassandra. If you can reproduce the 
behaviour, you should raise a Jira ticket for it.



On 18/01/2024 14:44, ENES ATABERK wrote:


It has same mismatch id in all nodes not just one node.



*From:* Bowen Song via user 
*Sent:* Thursday, January 18, 2024 3:18:11 PM
*To:* user@cassandra.apache.org
*Cc:* Bowen Song
*Subject:* COMMERCIAL:Re: COMMERCIAL:Re: system_schema.tables id and 
table uuid on disk mismatch


Was the table ID mismatching only on one node or all nodes? 
Mismatching on one node is usually the result of a racing condition, 
but on all nodes isn't. The solution I mentioned earlier only applies 
to the one node situation.



On 18/01/2024 13:14, ENES ATABERK wrote:


Hi all,

Thanks for your responses.

The version is Cassandra 4.1.3

After I restarted all the nodes one-by-one cassandra created 
corrected-id folder and keep the incorrect one as you said.


But then I cannot see the data from cqlsh it gives me no result. 
After i have imported the data from incorrect-id-folder i see the data



nodetool import keyspace_name table_name 
/full_path_of_old(incorrect)_folder.



now my questions are like:

First question; before i have restarted the nodes how can i search 
the data although there is a mismatch between system_schema.tables 
and actual directories in all nodes.



Second one; is nodetool import a safe way to load data from 
incorrect folder on a write heavy system. Because I cannot be sure 
if I miss any data during the import operation. Or do i need to run 
a repair for those tables instead of import? In my opinion that 
may not work because i cannot see any data before nodetool import.



Thanks again.



*From:* Bowen Song via user 
*Sent:* Thursday, January 18, 2024 1:17:11 PM
*To:* user@cassandra.apache.org
*Cc:* Bowen Song
*Subject:* COMMERCIAL:Re: system_schema.tables id and table uuid on 
disk mismatch


It sounds like you have done some concurrent table creation/deletion 
in the past (e.g. CREATE TABLE IF NOT EXISTS from multiple clients), 
which resulted in this mismatch. After you restarted the node, 
Cassandra corrected it by discarding the old table ID and any data 
associated with it. This is the expected behaviour. This issue has 
already been fixed, and you can safely delete the data directory 
with the incorrect table ID as it is no longer used by Cassandra. 
You should now run a full repair on this node to ensure it has all 
the data it owns. If you are /absolutely/ certain that the table 
with different IDs have identical schema, and the gc_grace_seconds 
hasn't past, you may move the data from the wrong data directory to 
the correct data directory, and then restart the node or run 
"nodetool refresh  " on the node before running the 
full repair, this may save you some streaming time. However, if the 
table schema is different, this may cause a havoc.



On 18/01/2024 05:21, ENES ATABERK wrote:


Hi all,

we have detected that table-uuid in linux file directory is 
different from system_schema.tables id.


I have executed nodetool describe cluster and see only one schema 
version in the cluster.


How we can fix this issue do anyone has any idea? Restarting the 
nodes only create a new empty directory with 
name system_schema.tables id directory but in this case i have two 
directories old one has sstables with incorrect uuid new one has 
correct uuid but empty.


thanks in advance




<https://www.turkcell.co

Re: COMMERCIAL:Re: COMMERCIAL:Re: system_schema.tables id and table uuid on disk mismatch

2024-01-18 Thread Bowen Song via user

I know dropping a table and then creating a new table with the same name 
can lead to that result, which is expected. If that wasn't what 
happened, it may be a bug in Cassandra. If you can reproduce the 
behaviour, you should raise a Jira ticket for it.



On 18/01/2024 14:44, ENES ATABERK wrote:


It has same mismatch id in all nodes not just one node.



*From:* Bowen Song via user 
*Sent:* Thursday, January 18, 2024 3:18:11 PM
*To:* user@cassandra.apache.org
*Cc:* Bowen Song
*Subject:* COMMERCIAL:Re: COMMERCIAL:Re: system_schema.tables id and 
table uuid on disk mismatch


Was the table ID mismatching only on one node or all nodes? 
Mismatching on one node is usually the result of a racing condition, 
but on all nodes isn't. The solution I mentioned earlier only applies 
to the one node situation.



On 18/01/2024 13:14, ENES ATABERK wrote:


Hi all,

Thanks for your responses.

The version is Cassandra 4.1.3

After I restarted all the nodes one-by-one cassandra created 
corrected-id folder and keep the incorrect one as you said.


But then I cannot see the data from cqlsh it gives me no result. 
After i have imported the data from incorrect-id-folder i see the data



nodetool import keyspace_name table_name 
/full_path_of_old(incorrect)_folder.



now my questions are like:

First question; before i have restarted the nodes how can i search 
the data although there is a mismatch between system_schema.tables 
and actual directories in all nodes.



Second one; is nodetool import a safe way to load data from incorrect 
folder on a write heavy system. Because I cannot be sure if I miss 
any data during the import operation. Or do i need to run a repair 
for those tables instead of import? In my opinion that may not work 
because i cannot see any data before nodetool import.



Thanks again.



*From:* Bowen Song via user 
*Sent:* Thursday, January 18, 2024 1:17:11 PM
*To:* user@cassandra.apache.org
*Cc:* Bowen Song
*Subject:* COMMERCIAL:Re: system_schema.tables id and table uuid on 
disk mismatch


It sounds like you have done some concurrent table creation/deletion 
in the past (e.g. CREATE TABLE IF NOT EXISTS from multiple clients), 
which resulted in this mismatch. After you restarted the node, 
Cassandra corrected it by discarding the old table ID and any data 
associated with it. This is the expected behaviour. This issue has 
already been fixed, and you can safely delete the data directory with 
the incorrect table ID as it is no longer used by Cassandra. You 
should now run a full repair on this node to ensure it has all the 
data it owns. If you are /absolutely/ certain that the table with 
different IDs have identical schema, and the gc_grace_seconds hasn't 
past, you may move the data from the wrong data directory to the 
correct data directory, and then restart the node or run "nodetool 
refresh  " on the node before running the full 
repair, this may save you some streaming time. However, if the table 
schema is different, this may cause a havoc.



On 18/01/2024 05:21, ENES ATABERK wrote:


Hi all,

we have detected that table-uuid in linux file directory is 
different from system_schema.tables id.


I have executed nodetool describe cluster and see only one schema 
version in the cluster.


How we can fix this issue do anyone has any idea? Restarting the 
nodes only create a new empty directory with 
name system_schema.tables id directory but in this case i have two 
directories old one has sstables with incorrect uuid new one has 
correct uuid but empty.


thanks in advance




<https://www.turkcell.com.tr/kurumsal/isturkcell-plus>

Bu elektronik posta ve onunla iletilen butun dosyalar sadece 
gondericisi tarafindan almasi amaclanan yetkili gercek ya da tuzel 
kisinin kullanimi icindir. Eger soz konusu yetkili alici degilseniz 
bu elektronik postanin icerigini aciklamaniz, kopyalamaniz, 
yonlendirmeniz ve kullanmaniz kesinlikle yasaktir ve bu elektronik 
postayi derhal silmeniz gerekmektedir.


TURKCELL bu mesajin icerdigi bilgilerin doğruluğu veya eksiksiz 
oldugu konusunda herhangi bir garanti vermemektedir. Bu nedenle bu 
bilgilerin ne sekilde olursa olsun iceriginden, iletilmesinden, 
alinmasindan ve saklanmasindan sorumlu degildir. Bu mesajdaki 
gorusler yalnizca gonderen kisiye aittir ve TURKCELLin goruslerini 
yansitmayabilir


Bu e-posta bilinen butun bilgisayar viruslerine karsi taranmistir.



This e-mail and any files transmitted with it are confidential and 
intended solely for the use of the individual or entity to whom they 
are addressed. If you are not the intended recipient you are hereby 
notified that any dissemination, forwarding, copying or use of any 
of the information is strictly prohibited, and the e-mail should 
immediately be del

Re: COMMERCIAL:Re: system_schema.tables id and table uuid on disk mismatch

2024-01-18 Thread Bowen Song via user

Was the table ID mismatching only on one node or all nodes? Mismatching 
on one node is usually the result of a racing condition, but on all 
nodes isn't. The solution I mentioned earlier only applies to the one 
node situation.



On 18/01/2024 13:14, ENES ATABERK wrote:


Hi all,

Thanks for your responses.

The version is Cassandra 4.1.3

After I restarted all the nodes one-by-one cassandra created 
corrected-id folder and keep the incorrect one as you said.


But then I cannot see the data from cqlsh it gives me no result. After 
i have imported the data from incorrect-id-folder i see the data



nodetool import keyspace_name table_name 
/full_path_of_old(incorrect)_folder.



now my questions are like:

First question; before i have restarted the nodes how can i search the 
data although there is a mismatch between system_schema.tables and 
actual directories in all nodes.



Second one; is nodetool import a safe way to load data from incorrect 
folder on a write heavy system. Because I cannot be sure if I miss any 
data during the import operation. Or do i need to run a repair for 
those tables instead of import? In my opinion that may not work 
because i cannot see any data before nodetool import.



Thanks again.



*From:* Bowen Song via user 
*Sent:* Thursday, January 18, 2024 1:17:11 PM
*To:* user@cassandra.apache.org
*Cc:* Bowen Song
*Subject:* COMMERCIAL:Re: system_schema.tables id and table uuid on 
disk mismatch


It sounds like you have done some concurrent table creation/deletion 
in the past (e.g. CREATE TABLE IF NOT EXISTS from multiple clients), 
which resulted in this mismatch. After you restarted the node, 
Cassandra corrected it by discarding the old table ID and any data 
associated with it. This is the expected behaviour. This issue has 
already been fixed, and you can safely delete the data directory with 
the incorrect table ID as it is no longer used by Cassandra. You 
should now run a full repair on this node to ensure it has all the 
data it owns. If you are /absolutely/ certain that the table with 
different IDs have identical schema, and the gc_grace_seconds hasn't 
past, you may move the data from the wrong data directory to the 
correct data directory, and then restart the node or run "nodetool 
refresh  " on the node before running the full 
repair, this may save you some streaming time. However, if the table 
schema is different, this may cause a havoc.



On 18/01/2024 05:21, ENES ATABERK wrote:


Hi all,

we have detected that table-uuid in linux file directory is different 
from system_schema.tables id.


I have executed nodetool describe cluster and see only one schema 
version in the cluster.


How we can fix this issue do anyone has any idea? Restarting the 
nodes only create a new empty directory with 
name system_schema.tables id directory but in this case i have two 
directories old one has sstables with incorrect uuid new one has 
correct uuid but empty.


thanks in advance




<https://www.turkcell.com.tr/kurumsal/isturkcell-plus>

Bu elektronik posta ve onunla iletilen butun dosyalar sadece 
gondericisi tarafindan almasi amaclanan yetkili gercek ya da tuzel 
kisinin kullanimi icindir. Eger soz konusu yetkili alici degilseniz 
bu elektronik postanin icerigini aciklamaniz, kopyalamaniz, 
yonlendirmeniz ve kullanmaniz kesinlikle yasaktir ve bu elektronik 
postayi derhal silmeniz gerekmektedir.


TURKCELL bu mesajin icerdigi bilgilerin doğruluğu veya eksiksiz 
oldugu konusunda herhangi bir garanti vermemektedir. Bu nedenle bu 
bilgilerin ne sekilde olursa olsun iceriginden, iletilmesinden, 
alinmasindan ve saklanmasindan sorumlu degildir. Bu mesajdaki 
gorusler yalnizca gonderen kisiye aittir ve TURKCELLin goruslerini 
yansitmayabilir


Bu e-posta bilinen butun bilgisayar viruslerine karsi taranmistir.



This e-mail and any files transmitted with it are confidential and 
intended solely for the use of the individual or entity to whom they 
are addressed. If you are not the intended recipient you are hereby 
notified that any dissemination, forwarding, copying or use of any of 
the information is strictly prohibited, and the e-mail should 
immediately be deleted.


TURKCELL makes no warranty as to the accuracy or completeness of any 
information contained in this message and hereby excludes any 
liability of any kind for the information contained therein or for 
the information transmission, reception, storage or use of such in 
any way whatsoever. The opinions expressed in this message belong to 
sender alone and may not necessarily reflect the opinions of TURKCELL.


This e-mail has been scanned for all known computer viruses.

Re: system_schema.tables id and table uuid on disk mismatch

2024-01-18 Thread Bowen Song via user

It sounds like you have done some concurrent table creation/deletion in 
the past (e.g. CREATE TABLE IF NOT EXISTS from multiple clients), which 
resulted in this mismatch. After you restarted the node, Cassandra 
corrected it by discarding the old table ID and any data associated with 
it. This is the expected behaviour. This issue has already been fixed, 
and you can safely delete the data directory with the incorrect table ID 
as it is no longer used by Cassandra. You should now run a full repair 
on this node to ensure it has all the data it owns. If you are 
/absolutely/ certain that the table with different IDs have identical 
schema, and the gc_grace_seconds hasn't past, you may move the data from 
the wrong data directory to the correct data directory, and then restart 
the node or run "nodetool refresh  " on the node before 
running the full repair, this may save you some streaming time. However, 
if the table schema is different, this may cause a havoc.



On 18/01/2024 05:21, ENES ATABERK wrote:


Hi all,

we have detected that table-uuid in linux file directory is different 
from system_schema.tables id.


I have executed nodetool describe cluster and see only one schema 
version in the cluster.


How we can fix this issue do anyone has any idea? Restarting the nodes 
only create a new empty directory with name system_schema.tables id 
directory but in this case i have two directories old one has sstables 
with incorrect uuid new one has correct uuid but empty.


thanks in advance






Bu elektronik posta ve onunla iletilen butun dosyalar sadece 
gondericisi tarafindan almasi amaclanan yetkili gercek ya da tuzel 
kisinin kullanimi icindir. Eger soz konusu yetkili alici degilseniz bu 
elektronik postanin icerigini aciklamaniz, kopyalamaniz, 
yonlendirmeniz ve kullanmaniz kesinlikle yasaktir ve bu elektronik 
postayi derhal silmeniz gerekmektedir.


TURKCELL bu mesajin icerdigi bilgilerin doğruluğu veya eksiksiz oldugu 
konusunda herhangi bir garanti vermemektedir. Bu nedenle bu bilgilerin 
ne sekilde olursa olsun iceriginden, iletilmesinden, alinmasindan ve 
saklanmasindan sorumlu degildir. Bu mesajdaki gorusler yalnizca 
gonderen kisiye aittir ve TURKCELLin goruslerini yansitmayabilir


Bu e-posta bilinen butun bilgisayar viruslerine karsi taranmistir.



This e-mail and any files transmitted with it are confidential and 
intended solely for the use of the individual or entity to whom they 
are addressed. If you are not the intended recipient you are hereby 
notified that any dissemination, forwarding, copying or use of any of 
the information is strictly prohibited, and the e-mail should 
immediately be deleted.


TURKCELL makes no warranty as to the accuracy or completeness of any 
information contained in this message and hereby excludes any 
liability of any kind for the information contained therein or for the 
information transmission, reception, storage or use of such in any way 
whatsoever. The opinions expressed in this message belong to sender 
alone and may not necessarily reflect the opinions of TURKCELL.


This e-mail has been scanned for all known computer viruses.

Re: About Map column

2023-12-18 Thread Bowen Song via user


Hi Sebastien,

It's a bit more complicated than that.

To begin with, the first-class citizen in Cassandra is partition, not 
row. All map fields in the same row are in the same partition, and all 
rows with the same partition key but different clustering keys are also 
in the same partition. During a compaction, Cassandra does its best not 
to split a partition into multiple SSTables, unless it must, e.g. when 
dealing with repaired vs unrepaired data. That means regardless it's a 
map field in a row or multiple rows within same partition, they get 
compacted into the same number of SSTables.


A map type field's data may live in one column, but definitely not just 
one blob of data from the server's perspective, unless it's frozen. 
Reading such data is no cheaper than reading multiple columns and rows 
within the same partition, as each components of it, a key or a value, 
needs to be deserialised individually from the on-disk SSTable format, 
and then serialised again for the network protocol (often called the 
native protocol, NTP, or binary protocol) when it is read by a CQL client.


There's no obvious performance benefit for reading key-value pairs from 
a map field in a row vs columns and rows in the same partition. However, 
each row can be read separately and selectively, but key-value pairs in 
a map cannot. All data in a map field must be fetched all at once. So if 
you ever need to selectively read the data, reading multiple columns and 
rows in the same partition filtered by clustering keys will actually 
perform better than reading all key-value pairs from a large map type 
field and then discarding the unwanted data.


If you really want better server-side read performance and always read 
the whole thing, you should consider use a frozen map or frozen UDT 
instead. Of course, there's a cost to freeze them. A frozen data cannot 
be partially modified (e.g. add, remove or update a value in it), it can 
only be deleted or overwritten with new data at once. Which means it may 
not be suitable for your use case.


I can see you also mentioned big partitions. Large partitions in 
Cassandra usually is a bad idea, regardless it's a single row with a few 
columns or many rows with many columns. There's some exceptions that may 
work well, but generally you should avoid creating large partitions if 
possible. The problem with large partitions is usually the JVM heap and 
GC pauses, rarely CPU or disk resources.


Regards,
Bowen


On 18/12/2023 17:00, Sébastien Rebecchi wrote:

Hello

If i have a colum of type Map, then with many insertions, the map 
grows, but after compation, as the full map is 1 column of a table, 
will it be contained fully in 1 SSTable?

I guess yes cause the map is contained in a single row. Am I right?
Versus if we use a clustering key + a standard column instead of a 
map, insertions will create many rows, 1 per clustering key value, so 
even after compaction the partition could be splitted in several SSTables.
Can you tell me if i understood correctly please? Because if it is 
right then it means the pb of big partitions can be enhanced using Map 
as it will induce much more CPU and disk resources to perform 
compaction (on the other hand you will have lower read amplification 
factor with map).


Thanks,

Sébastien

Re: Schema inconsistency in mixed-version cluster

2023-12-12 Thread Bowen Song via user


I don't recognise those names:

 * channel_data_id
 * control_system_type
 * server_id
 * decimation_levels

I assume these are column names of a non-system table.

From the stack trace, this looks like an error from a node which was 
running 4.1.3, and this node was not the coordinator for this query.


I did some research and found these bug reports which may be related:

 * CASSANDRA-15899
    Dropping a
   column can break queries until the schema is fully propagated
 * CASSANDRA-16735
    Adding
   columns via ALTER TABLE can generate corrupt sstables

The solution for CASSANDRA-16735 was to revert CASSANDRA-15899, 
according to the comments in the ticket.


This does look like CASSANDRA-15899 is back, but I can't see why it was 
only happening when the nodes were running mixed versions, and then 
stopped after all nodes were upgraded.



On 12/12/2023 16:28, Sebastian Marsching wrote:

Hi,

while upgrading our production cluster from C* 3.11.14 to 4.1.3, we experienced 
the issue that some SELECT queries failed due to supposedly no replica being 
available. The system logs on the C* nodes where full of messages like the 
following one:

ERROR [ReadStage-1] 2023-12-11 13:53:57,278 JVMStabilityInspector.java:68 - 
Exception in thread Thread[ReadStage-1,5,SharedPool]
java.lang.IllegalStateException: [channel_data_id, control_system_type, 
server_id, decimation_levels] is not a subset of [channel_data_id]
 at 
org.apache.cassandra.db.Columns$Serializer.encodeBitmap(Columns.java:593)
 at 
org.apache.cassandra.db.Columns$Serializer.serializeSubset(Columns.java:523)
 at 
org.apache.cassandra.db.rows.UnfilteredSerializer.serializeRowBody(UnfilteredSerializer.java:231)
 at 
org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:205)
 at 
org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:137)
 at 
org.apache.cassandra.db.rows.UnfilteredSerializer.serialize(UnfilteredSerializer.java:125)
 at 
org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.serialize(UnfilteredRowIteratorSerializer.java:140)
 at 
org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.serialize(UnfilteredRowIteratorSerializer.java:95)
 at 
org.apache.cassandra.db.rows.UnfilteredRowIteratorSerializer.serialize(UnfilteredRowIteratorSerializer.java:80)
 at 
org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$Serializer.serialize(UnfilteredPartitionIterators.java:308)
 at 
org.apache.cassandra.db.ReadResponse$LocalDataResponse.build(ReadResponse.java:201)
 at 
org.apache.cassandra.db.ReadResponse$LocalDataResponse.(ReadResponse.java:186)
 at 
org.apache.cassandra.db.ReadResponse$LocalDataResponse.(ReadResponse.java:182)
 at 
org.apache.cassandra.db.ReadResponse.createDataResponse(ReadResponse.java:48)
 at 
org.apache.cassandra.db.ReadCommand.createResponse(ReadCommand.java:337)
 at 
org.apache.cassandra.db.ReadCommandVerbHandler.doVerb(ReadCommandVerbHandler.java:63)
 at 
org.apache.cassandra.net.InboundSink.lambda$new$0(InboundSink.java:78)
 at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:97)
 at org.apache.cassandra.net.InboundSink.accept(InboundSink.java:45)
 at 
org.apache.cassandra.net.InboundMessageHandler$ProcessMessage.run(InboundMessageHandler.java:430)
 at 
org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133)
 at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:142)
 at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
 at java.base/java.lang.Thread.run(Thread.java:829)

This problem only persisted while the cluster had a mix of 3.11.14 and 4.1.3 
nodes. As soon as the last node was updated, the problem disappeared 
immediately, so I suspect that it was somehow caused by the unavoidable schema 
inconsistency during the upgrade.

I just wanted to give everyone who hasn’t upgraded yet a heads up, so that they 
are aware that this problem might exist. Interestingly, it seems like not all 
queries involving the affected table were affected by this problem. As far as I 
am aware, no schema changes have ever been made to the affected table, so I am 
pretty certain that the schema inconsistencies were purely related to the 
upgrade process.

We hadn’t noticed this problem when testing the upgrade on our test cluster 
because there we first did the upgrade and then ran the test workload. So, if 
you are worried you might be affected by this problem as well, you might want 
to run your workload on the test cluster while having mixed versions.

I did not investigate the cause further because simply completing the upgrade 
process seemed like the quickest option to

Re: Remove folders of deleted tables

2023-12-07 Thread Bowen Song via user

There's no requirement for the partition key to contain the date/time 
for a TWCS table. The important thing is data need to be written to the 
table in chronological order (i.e. do not use the "USING TIMESTAMP" in 
the CQL queries) and the same TTL is used for all partitions. TWCS was 
introduced many years ago to replace DTCS. Could it be that you have had 
some bad past experiences with DTCS?


I previously mentioned "date in the partition key" only because you 
currently are querying by table names, which would almost certainly 
contain a coarse date/time, and the equivalent in a TWCS table would 
need to have that coarse date/time from the table name in the partition 
key instead. If you don't need the ability to query by date/time, you 
can have partition key(s) without any date/time in them.


On 07/12/2023 09:08, Sébastien Rebecchi wrote:
Thanks Bowen, I also thought about using TTL and TWCS, but in my past 
experience with Cassandra I have had a lot of issues with data models 
using TTL and creating many tombstones. I was probably not using the 
right compaction at that time, but this experiences has a great impact 
on me and I would say they made me very cautious about TTL, 
even today, several years after ^^
Anyway, in the current 2 cases I can not put a date as partition key 
alone.
In the first one I have a pair of columns acting as partition key, one 
is a customer id and the other is a date (more precisely a second 
"block" of time to group all events of the same second in the same 
partition, to pre-group data). The customer id is mandatory in the 
partition key in order not to have too wide partitions, and I never 
have cases where I need to fetch cross-customer data. Is TWCS still 
suited? As I read from doc, you don't need to have a time window as 
partition key alone, as long as many partitions will die approx in the 
same time it could work fine cause sometimes entire SSTables will be 
deleted during compaction rather than rewritten in disk.
As for my second use case that creates many tables on demand, I can 
not have the time window in the PK, cause I think it would lead to 
degraded read performance, I have to put a visitor id in the PK and a 
timestamp as CK in order to fetch all event of a visitor inside a time 
window in 1 query. I could probably put a time window as PK (e.g. a 
second block like the 1st use case, and then perform a small number of 
queries to merge pre-results client-side) and in that case TTL+TWCS 
would probably apply, it remains the same question as above.

Thanks for your time :)

Sébastien.


Le mer. 6 déc. 2023 à 15:46, Bowen Song via user 
 a écrit :


There are many different ways to avoid or minimise the chance of
schema disagreements, the easiest way is to always send DDL
queries to the same node in the cluster. This is very easy to
implement and avoids schema disagreements at the cost of creating
a single point of failure for DDL queries. More sophisticated
methods also exist, such as locking and centralised schema
modification, and you should consider which one is more suitable
for your use case. Ignoring the schema disagreements problem is
not recommended, as this is not a tested state for the cluster,
you are likely to run into some known and unknown (and possibly
severe) issues later.

The system_schema.columns table will almost certainly have more
tombstones created than the number of tables deleted, unless each
deleted table had only one column. I doubt creating and deleting 8
tables per day will be a problem, but I would recommend you find a
way to test it before doing that on a production system, because I
don't know anyone else is using Cassandra in this way.

From the surface, it does sound like TWCS with the date in in the
partition key may fit your use case better than creating and
deleting tables every day.


On 06/12/2023 08:26, Sébastien Rebecchi wrote:

Hello Jeff, Bowen

Thanks for your answer.
Now I understand that there is a bug in Cassandra that can not
handle concurrent schema modifications, I was not aware of that
severity, I thought that temporary schema mismatches were
eventually resolved smartly, by a kind of "merge" mechanism.
For my use cases, keyspaces and tables are created "on-demand",
when receiving exceptions for invalid KS or table on insert (then
the KS and table are created and the insert is retried). I can
not afford to centralize schema modifications in a bottleneck,
but I can afford the data inconsistencies, waiting for the fix in
Cassandra.
I'm more worried about tombstones in system tables, I assume that
8 tombstones per day (or even more, but in the order of no more
than some dozens) is reasonable, can you confirm (or invalidate)
that please?

Sébastien.

    Le mer. 6 déc. 2023 à 03:00, Bowen Song via user

Re: Remove folders of deleted tables

2023-12-06 Thread Bowen Song via user

There are many different ways to avoid or minimise the chance of schema 
disagreements, the easiest way is to always send DDL queries to the same 
node in the cluster. This is very easy to implement and avoids schema 
disagreements at the cost of creating a single point of failure for DDL 
queries. More sophisticated methods also exist, such as locking and 
centralised schema modification, and you should consider which one is 
more suitable for your use case. Ignoring the schema disagreements 
problem is not recommended, as this is not a tested state for the 
cluster, you are likely to run into some known and unknown (and possibly 
severe) issues later.


The system_schema.columns table will almost certainly have more 
tombstones created than the number of tables deleted, unless each 
deleted table had only one column. I doubt creating and deleting 8 
tables per day will be a problem, but I would recommend you find a way 
to test it before doing that on a production system, because I don't 
know anyone else is using Cassandra in this way.


From the surface, it does sound like TWCS with the date in in the 
partition key may fit your use case better than creating and deleting 
tables every day.



On 06/12/2023 08:26, Sébastien Rebecchi wrote:

Hello Jeff, Bowen

Thanks for your answer.
Now I understand that there is a bug in Cassandra that can not handle 
concurrent schema modifications, I was not aware of that severity, I 
thought that temporary schema mismatches were eventually resolved 
smartly, by a kind of "merge" mechanism.
For my use cases, keyspaces and tables are created "on-demand", when 
receiving exceptions for invalid KS or table on insert (then the KS 
and table are created and the insert is retried). I can not afford to 
centralize schema modifications in a bottleneck, but I can afford the 
data inconsistencies, waiting for the fix in Cassandra.
I'm more worried about tombstones in system tables, I assume that 8 
tombstones per day (or even more, but in the order of no more than 
some dozens) is reasonable, can you confirm (or invalidate) that please?


Sébastien.

Le mer. 6 déc. 2023 à 03:00, Bowen Song via user 
 a écrit :


The same table name with two different CF IDs is not just
"temporary schema disagreements", it's much worse than that. This
breaks the eventual consistency guarantee, and leads to silent
data corruption. It's silently happening in the background, and
you don't realise it until you suddenly do, and then everything
seems to blow up at the same time. You need to sort this out ASAP.


On 05/12/2023 19:57, Sébastien Rebecchi wrote:

Hi Bowen,

Thanks for your answer.

I was thinking of extreme use cases, but as far as I am concerned
I can deal with creation and deletion of 2 tables every 6 hours
for a keyspace. So it lets around 8 folders of deleted tables per
day - sometimes more cause I can see sometimes 2 folders created
for a same table name, with 2 different ids, caused by temporary
schema disagreements I guess.
Basically it means 20 years before the KS folder has 65K
subfolders, so I would say I have time to think of redesigning
the data model ^^
Nevertheless, does it sound too much in terms of thombstones in
the systems tables (with the default GC grace period of 10 days)?

Sébastien.

    Le mar. 5 déc. 2023, 12:19, Bowen Song via user
 a écrit :

Please rethink your use case. Create and delete tables
concurrently often lead to schema disagreement. Even doing so
on a single node sequentially will lead to a large number of
tombstones in the system tables.

On 04/12/2023 19:55, Sébastien Rebecchi wrote:

Thank you Dipan.

Do you know if there is a good reason for Cassandra to let
tables folder even when there is no snapshot?

I'm thinking of use cases where there is the need to create
and delete small tables at a high rate. You could quickly
end with more than 65K (limit of ext4) subdirectories in the
KS directory, while 99.9.. % of them are residual of deleted
tables.

That looks quite dirty from Cassandra to not clean its own
"garbage" by itself, and quite dangerous for the end user to
have to do it alone, don't you think so?

Thanks,

Sébastien.

Le lun. 4 déc. 2023, 11:28, Dipan Shah
 a écrit :

Hello Sebastien,

There are no inbuilt tools that will automatically
remove folders of deleted tables.

Thanks,

Dipan Shah



*From:* Sébastien Rebecchi 
*Sent:* 04 December 2023 13:54
*To:* user@cassandra.apache.org 
*Subject:* Remove folders of deleted tables
Hello,

When w

Re: Remove folders of deleted tables

2023-12-05 Thread Bowen Song via user

The same table name with two different CF IDs is not just "temporary 
schema disagreements", it's much worse than that. This breaks the 
eventual consistency guarantee, and leads to silent data corruption. 
It's silently happening in the background, and you don't realise it 
until you suddenly do, and then everything seems to blow up at the same 
time. You need to sort this out ASAP.



On 05/12/2023 19:57, Sébastien Rebecchi wrote:

Hi Bowen,

Thanks for your answer.

I was thinking of extreme use cases, but as far as I am concerned I 
can deal with creation and deletion of 2 tables every 6 hours for a 
keyspace. So it lets around 8 folders of deleted tables per day - 
sometimes more cause I can see sometimes 2 folders created for a same 
table name, with 2 different ids, caused by temporary schema 
disagreements I guess.
Basically it means 20 years before the KS folder has 65K subfolders, 
so I would say I have time to think of redesigning the data model ^^
Nevertheless, does it sound too much in terms of thombstones in the 
systems tables (with the default GC grace period of 10 days)?


Sébastien.

Le mar. 5 déc. 2023, 12:19, Bowen Song via user 
 a écrit :


Please rethink your use case. Create and delete tables
concurrently often lead to schema disagreement. Even doing so on a
single node sequentially will lead to a large number of tombstones
in the system tables.

On 04/12/2023 19:55, Sébastien Rebecchi wrote:

Thank you Dipan.

Do you know if there is a good reason for Cassandra to let tables
folder even when there is no snapshot?

I'm thinking of use cases where there is the need to create and
delete small tables at a high rate. You could quickly end with
more than 65K (limit of ext4) subdirectories in the KS directory,
while 99.9.. % of them are residual of deleted tables.

That looks quite dirty from Cassandra to not clean its own
"garbage" by itself, and quite dangerous for the end user to have
to do it alone, don't you think so?

Thanks,

Sébastien.

Le lun. 4 déc. 2023, 11:28, Dipan Shah  a
écrit :

Hello Sebastien,

There are no inbuilt tools that will automatically remove
folders of deleted tables.

Thanks,

Dipan Shah


*From:* Sébastien Rebecchi 
*Sent:* 04 December 2023 13:54
*To:* user@cassandra.apache.org 
*Subject:* Remove folders of deleted tables
Hello,

When we delete a table with Cassandra, it lets the folder of
that table on file system, even if there is no snapshot (auto
snapshots disabled).
So we end with the empty folder {data folder}/{keyspace
name}/{table name-table id} containing only 1  subfolder,
backups, which is itself empty.
Is there a way to automatically remove folders of deleted tables?

Sébastien.

Re: Remove folders of deleted tables

2023-12-05 Thread Bowen Song via user

Please rethink your use case. Create and delete tables concurrently 
often lead to schema disagreement. Even doing so on a single node 
sequentially will lead to a large number of tombstones in the system tables.


On 04/12/2023 19:55, Sébastien Rebecchi wrote:

Thank you Dipan.

Do you know if there is a good reason for Cassandra to let tables 
folder even when there is no snapshot?


I'm thinking of use cases where there is the need to create and delete 
small tables at a high rate. You could quickly end with more than 65K 
(limit of ext4) subdirectories in the KS directory, while 99.9.. % of 
them are residual of deleted tables.


That looks quite dirty from Cassandra to not clean its own "garbage" 
by itself, and quite dangerous for the end user to have to do it 
alone, don't you think so?


Thanks,

Sébastien.

Le lun. 4 déc. 2023, 11:28, Dipan Shah  a écrit :

Hello Sebastien,

There are no inbuilt tools that will automatically remove folders
of deleted tables.

Thanks,

Dipan Shah


*From:* Sébastien Rebecchi 
*Sent:* 04 December 2023 13:54
*To:* user@cassandra.apache.org 
*Subject:* Remove folders of deleted tables
Hello,

When we delete a table with Cassandra, it lets the folder of that
table on file system, even if there is no snapshot (auto snapshots
disabled).
So we end with the empty folder {data folder}/{keyspace
name}/{table name-table id} containing only 1  subfolder, backups,
which is itself empty.
Is there a way to automatically remove folders of deleted tables?

Sébastien.

Re: Migrating to incremental repair in C* 4.x

2023-11-27 Thread Bowen Song via user

Hi Jeff,

Does subrange repair mark the SSTable as repaired? From my memory, it
doesn't.

Regards,
Bowen

On 27/11/2023 16:47, Jeff Jirsa wrote:
I don’t work for datastax, thats not my blog, and I’m on a phone and
potentially missing nuance, but I’d never try to convert a cluster to
IR by disabling auto compaction . It sounds very much out of date or
its optimized for fixing one node in a cluster somehow. It didn’t make
sense in the 4.0 era.

Instead I’d leave compaction running and slowly run incremental repair
across parts of the token range, slowing down as pending compactions
increase

I’d choose token ranges such that you’d repair 5-10% of the data on
each node at a time

On Nov 23, 2023, at 11:31 PM, Sebastian Marsching
wrote:

Hi,

we are currently in the process of migrating from C* 3.11 to C* 4.1
and we want to start using incremental repairs after the upgrade has
been completed. It seems like all the really bad bugs that made using
incremental repairs dangerous in C* 3.x have been fixed in 4.x, and
for our specific workload, incremental repairs should offer a
significant performance improvement.

Therefore, I am currently devising a plan how we could migrate to
using incremental repairs. I am aware of the guide from DataStax
(https://docs.datastax.com/en/cassandra-oss/3.0/cassandra/operations/opsRepairNodesMigration.html),
but this guide is quite old and was written with C* 3.0 in mind, so I
am not sure whether this still fully applies to C* 4.x.

In addition to that, I am not sure whether this approach fits our
workload. In particular, I am wary about disabling autocompaction for
an extended period of time (if you are interested in the reasons why,
they are at the end of this e-mail).

Therefore, I am wondering whether a slighly different process might
work better for us:

1. Run a full repair (we periodically run those anyway).
2. Mark all SSTables as repaired, even though they will include data
that has not been repaired yet because it was added while the repair
process was running.

3. Run another full repair.
4. Start using incremental repairs (and the occasional full repair in
order to handle bit rot etc.).

If I understood the interactions between full repairs and incremental
repairs correctly, step 3 should repair potential inconsistencies in
the SSTables that were marked as repaired in step 2 while avoiding
the problem of overstreaming that would happen when only marking
those SSTables as repaired that already existed before step 1.

Does anyone see a flaw in this concept or has experience with a
similar scenario (migrating to incremental repairs in an environment
with high-density nodes, where a single table contains most of the data)?

I am also interested in hearing about potential problems other C*
users experienced when migrating to incremental repairs, so that we
get a better idea what to expect.

Thanks,
Sebastian

Here is the explanation why I am being cautious:

More than 95 percent of our data is stored in a single table, and we
use high density nodes (storing about 3 TB of data per node). This
means that a full repair for the whole cluster takes about a week.

The reason for this layout is that most of our data is “cold”,
meaning that it is written once, never updated, and rarely deleted or
read. However, new data is added continuously, so disabling
autocompaction for the duration of a full repair would lead to a high
number of small SSTables accumulating over the course of the week,
and I am not sure how well the cluster would handle such a situation
(and the increased load when autocompaction is enabled again).

Re: Memory and caches

2023-11-27 Thread Bowen Song via user


Hi Sebastien,


What's your goal? Improving cache hit rate purely for the sake of having 
a higher hit rate is rarely a good goal, because higher cache hit rate 
doesn't always mean faster operations.


Do you have specific issues with performance? If so, can you please tell 
us more about it? This way, we can focus on that.



Cheers,
Bowen

On 27/11/2023 14:59, Sébastien Rebecchi wrote:

Hello

When I use nodetool info, it prints that relevant information

Heap Memory (MB)       : 14229.31 / 32688.00
Off Heap Memory (MB)   : 5390.57
Key Cache              : entries 670423, size 100 MiB, capacity 100 
MiB, 13152259 hits, 47205855 requests, 0.279 recent hit rate, 14400 
save period in seconds
Chunk Cache            : entries 63488, size 992 MiB, capacity 992 
MiB, 143250511 misses, 162302465 requests, 0.117 recent hit rate, 
2497.557 microseconds miss latency


Here I focus on lines relevant for that conversation. And the numbers 
are roughly the same for all nodes of the cluster.
The key and chunk caches are full and the hit rate is low. At the same 
time the heap memory is far from being used at full capacity.
I would say that I can significantly increase the sizes of those 
caches in order to increase hit rate and improve performance.
In cassandra.yaml, key_cache_size_in_mb has a blank value, so 100 MiB 
by default, and file_cache_size_in_mb is set to 1024.
I'm thinking about setting key_cache_size_in_mb to 1024 
and file_cache_size_in_mb to 2048. What would you recommend? Is anyone 
having good experience with tuning those parameters?


Thank you in advance.

Sébastien.

Re: Migrating to incremental repair in C* 4.x

2023-11-27 Thread Bowen Song via user

Hi Sebastian,

It's better to walk down the path on which others have walked before you
and had great success, than a path that nobody has ever walked. For the
former, you know it's relatively safe and it works. The same can hardly
be said for the later.

You said it takes a week to run the full repair for your entire cluster,
not each node. Depending on the number of nodes in your cluster, each
node should take significantly less time than that unless you have RF
set to the total number of nodes. Keep in mind that you only need to
disable the auto-compaction for the duration of a full repair on each
node, not the whole cluster.

Now, you asked, how do I know is that going to be an issue or not? That
depends on a few factors, such as:

* how long does it take for each node to complete a full repair for that
node
* how many SSTables currently exist on each node (try "find
/var/lib/cassandra/data -name '*-Data.db' | wc -l")

* how frequently is the memtable getting flushed on each node
* what's the number of open file descriptors limit (see "cat
/proc/[PID]/limits" and "sysctl fs.nr_open")

If the total number of SSTables (existing, plus the number of memtable
flushes when the auto-compaction is turned off) is going to be
significantly less than half of the number of open FD limit, you'll have
nothing to worry about. Otherwise, you may want to consider temporarily
increasing the open FD limit, reduce the memtable flush frequency (e.g.
increase the memtable size or reduce the number of write requests) or
reduce the existing number of SSTables (e.g. compaction), or just take
the risk and bet on that Cassandra is not going to open all the SSTables
at the same time (not recommended).

You may be wondering, why only half of the number of open FD limit?
That's because Cassandra usually keeps both the *-Index.db and *-Data.db
files open when an SSTable is in use.

I hope that helps.

Regards,
Bowen

On 23/11/2023 23:30, Sebastian Marsching wrote:

Hi,

Therefore, I am wondering whether a slighly different process might
work better for us:

3. Run another full repair.
4. Start using incremental repairs (and the occasional full repair in
order to handle bit rot etc.).

If I understood the interactions between full repairs and incremental
repairs correctly, step 3 should repair potential inconsistencies in
the SSTables that were marked as repaired in step 2 while avoiding the
problem of overstreaming that would happen when only marking those
SSTables as repaired that already existed before step 1.

I am also interested in hearing about potential problems other C*
users experienced when migrating to incremental repairs, so that we
get a better idea what to expect.

Thanks,
Sebastian

Here is the explanation why I am being cautious:

The reason for this layout is that most of our data is “cold”, meaning
that it is written once, never updated, and rarely deleted or read.
However, new data is added continuously, so disabling autocompaction
for the duration of a full repair would lead to a high number of small
SSTables accumulating over the course of the week, and I am not sure
how well the cluster would handle such a situation (and the increased
load when autocompaction is enabled again).

Re: Cassandra stopped responding but still alive

2023-11-01 Thread Bowen Song via user

What do you mean by saying "Cassandra stopped responding ... to nodetool 
requests"? Is it a specific nodetool command (e.g. "nodetool status") or 
all nodetool commands? What's the issue? Was it an error message, such 
as connection refused? Or freezes/unresponsive?


It's common to see Cassandra shutdown the gossip and native transport 
due to disk IO errors or data corruption with the default disk failure 
policy "stop", but that should not shutdown the JMX port used by 
nodetool. In "nodetool info" output, it will clearly say both gossip and 
native transport are not active if this is the case.


I can't help to notice that the data file, commit log, etc. directories 
are all under the "/data" directory, which makes me want to ask you, is 
this directory on a shared storage, such as NFS or SAN? If this is the 
case, a storage failure may lead to multiple nodes stop working.


In addition to the above, are you sure you are looking at the correct 
log files? The timestamp on the 3 log files you provided don't match. 
The last line of log in cassandra.log ended on 23 Oct, and the logs in 
the system.log are between 15:00 and 15:31 on 30 Oct, yet the first line 
of log in the gc.log started at 15:41. There's no overlapping time 
window between any of the log files.


On 31/10/2023 23:56, Ben Klein wrote:
On October 30, 2023, at approximately 15:38 UTC, Cassandra stopped 
responding to TCP pings on port 9042 and to nodetool requests. 
However, systemd reported that it was still online. The first node to 
fail was the seed node (192.168.0.44), followed within the next couple 
minutes by the other two (192.168.0.15 and 192.168.0.20). Looking 
through the logs on the first node, I did not see anything out of the 
ordinary. When the service was restarted (through systemd), it came up 
with no problem, but this is the second time this has happened in the 
last month.


I have attached all of the log files from the primary node. I have 
also attached the cassandra.yaml file, which is the same on all three 
nodes.


What could possibly be causing this? Is there anything else that I 
should be looking at?

Re: java driver with cassandra proxies (option: -Dcassandra.join_ring=false)

2023-10-12 Thread Bowen Song via user

I'm not 100% sure, but it's worth trying to disable the token metadata 
, 
because the driver needs to read the "system.peers_v2" table for 
populating the token metadata.


On 11/10/2023 19:15, Regis Le Bretonnic wrote:

Hi (also posted in dev mailing list but not sure I can publish on it),

We use datastax cassandra java driver v4.15.0 and we want to limit connexion 
only to Cassandra proxy nodes (Nodes with no data started with option: 
-Dcassandra.join_ring=false).
For that:
  - we configured the driver to have only proxy hosts in the contact-points 
(datastax-java-driver.basic.contact-points).
  - we added a custom configuration containing "whitelisted host" (same list as 
contact-points)
  - we implemented a custom NodeFilter Class to limit allowed nodes to 
whitelisted one

If we look at opened TCP connexions between client and Cassandra cluster we see 
only 2:
  - one to one of the proxy listed in the contact-points (coordinator connexion)
  - another one the the same proxy (query connexion)

We expected to have an opened connexion to each proxy listed in contact-points 
/ whitelisted hosts.
We found that it is not the case because during cluster discovery the driver execute a query in 
table "system.peers" or "system.peers_v2" (made in DefaultTopologyMonitor 
class) and proxy nodes are not in this table.

Why are proxy nodes lot listed in system.peers and why the discovery checks in this table 
? Is it possible to bypass this control or add these nodes in table "peers" ?
Is there a way to implement a custom version of TopologyMonitor interface to 
bypass this mechanism ?
Is there another way to do this ?

Thanks in advance
Regards

Re: [HELP] Cassandra 4.1.1 Repeated Bootstrapping Failure

2023-09-11 Thread Bowen Song via user


Hi Scott,


Thank you for pointing this out. I found it too, but I deemed it to be 
irrelevant because the following reasons:


 * it was fixed in 4.1.1, as you have correctly pointed out; and
 * the error message is slightly different, "writevAddresses" vs
   "writeAddress"; and
 * it actually got stuck for 15 minutes without any logs related to the
   streaming, but in my case everything worked fine up until it
   suddenly times out.

Therefore I did not mention it in the email.


Regards,

Bowen


On 11/09/2023 22:24, C. Scott Andreas wrote:

Bowen, thanks for reaching out.

My mind immediately jumped to a ticket which has very similar 
pathology: "CASSANDRA-18110 
<https://issues.apache.org/jira/browse/CASSANDRA-18110>: Streaming 
progress virtual table lock contention can trigger TCP_USER_TIMEOUT 
and fail streaming" -- but I see this was fixed in 4.1.1.


On Sep 11, 2023, at 2:09 PM, Bowen Song via user 
 wrote:



  *Description*

When adding a new node to an existing cluster, the new node 
bootstrapping fails with the 
"io.netty.channel.unix.Errors$NativeIoException: writeAddress(..) 
failed: Connection timed out" error from the streaming source node. 
Resuming the bootstrap with "nodetool bootstrap resume" works, but 
the resumed bootstrap can fail too. We often need to run "nodetool 
bootstrap resume" a couple of times to complete the bootstrapping on 
a joining node.



  Steps that produced the error

(I'm hesitant to say "step to reproduce", because I failed to 
reproduce the error on a testing cluster)
Install Cassandra 4.1.1 on new servers, using two of the existing 
nodes as seed nodes, start the new node and let it join the cluster. 
Watch the logs.



  Environment

All nodes, existing or new, have the same software versions as below.

Cassandra: version 4.1.1
Java: OpenJDK 11
OS: Debian 11

Existing nodes each has 1TB SSD, 64GB memory and 6 cores CPU, and 
num_tokens is set to 4
New nodes each has 2TB SSD, 128GB memory and 16 cores CPU, and 
num_tokens is set to 8


Cassandra is in a single DC, single rack setup with about 130 nodes, 
and all non-system keyspaces have RF=3


Relevant config options:

stream_throughput_outbound: 15MiB/s
  streaming_connections_per_host: 2
  auto_bootstrap: not set, default to true
  internode_tcp_user_timeout: not set, default to 30 seconds
  internode_streaming_tcp_user_timeout: not set, default to 5 minutes
  streaming_keep_alive_period: not set, default to 5 minutes
  streaming_state_expires: not set, default to 3 days
  streaming_state_size: not set, default to 40MiB
  streaming_stats_enabled: not set, default to true
  uuid_sstable_identifiers_enabled: true (turned on after
upgraded to 4.1 last year)


  What we have tried

*Tried*: checking the hardware and network
*Result*: everything appears to be fine

*Tried*: Google searching for the error message 
"io.netty.channel.unix.Errors$NativeIoException: writeAddress(..) 
failed: Connection timed out"
*Result*: only one matching result was found, and it points to 
CASSANDRA-16143 
<https://issues.apache.org/jira/browse/CASSANDRA-16143>. That 
certainly doesn't apply in our case, as it was fixed in 4.0, and I 
also don't believe our data centre grade SSDs are that slow.


*Tried*: reducing the stream_throughput_outbound from 30 to 15 MiB/s
*Result*: did not help, no sign of any improvement

*Tried*: analyse the logs from the joining node and the streaming 
source nodes
*Result*: the error says the write connection timed out on the 
sending end, but a few seconds before that, both sending and 
receiving ends of the connection were still communicating with each 
other. I couldn't make sense of it.


*Tried*: bootstrapping a different node of the same spec
*Result*: same error reproduced

*Tried*: attempting to reproduce the error on a testing cluster
*Result*: unable to reproduce this error on a smaller testing cluster 
with less nodes, less powerful hardware, same Cassandar, Java and OS 
version, same config, same schema, less data and same mixed number of 
vnodes.


*Tried*: keep retrying with "nodetool bootstrap resume"
*Result*: this works and unblocked us from adding new nodes to the 
cluster, but this obviously is not how it should be done.



  What do I expect from posting this

I'm suspecting that this is a bug in Cassandra, but lack the evidence 
to support that, and lacks the expertise in debugging Cassandra (or 
any other Java application).
It would be much appreciated if anyone could offer me some help on 
this, or point me to a direction that may lead to the solution.



  Relevant logs

Note: IP address, keyspace and table names are reducted. The IP 
address ending in 111 is the joining node, and the IP address ending 
in 182 was one of the streaming source node.


The logs from the joining node (IP: xxx.xxx

[HELP] Cassandra 4.1.1 Repeated Bootstrapping Failure

2023-09-11 Thread Bowen Song via user



 *Description*

When adding a new node to an existing cluster, the new node 
bootstrapping fails with the 
"io.netty.channel.unix.Errors$NativeIoException: writeAddress(..) 
failed: Connection timed out" error from the streaming source node. 
Resuming the bootstrap with "nodetool bootstrap resume" works, but the 
resumed bootstrap can fail too. We often need to run "nodetool bootstrap 
resume" a couple of times to complete the bootstrapping on a joining node.



 Steps that produced the error

(I'm hesitant to say "step to reproduce", because I failed to reproduce 
the error on a testing cluster)
Install Cassandra 4.1.1 on new servers, using two of the existing nodes 
as seed nodes, start the new node and let it join the cluster. Watch the 
logs.



 Environment

All nodes, existing or new, have the same software versions as below.

   Cassandra: version 4.1.1
   Java: OpenJDK 11
   OS: Debian 11

Existing nodes each has 1TB SSD, 64GB memory and 6 cores CPU, and 
num_tokens is set to 4
New nodes each has 2TB SSD, 128GB memory and 16 cores CPU, and 
num_tokens is set to 8


Cassandra is in a single DC, single rack setup with about 130 nodes, and 
all non-system keyspaces have RF=3


Relevant config options:

  stream_throughput_outbound: 15MiB/s
  streaming_connections_per_host: 2
  auto_bootstrap: not set, default to true
  internode_tcp_user_timeout: not set, default to 30 seconds
  internode_streaming_tcp_user_timeout: not set, default to 5 minutes
  streaming_keep_alive_period: not set, default to 5 minutes
  streaming_state_expires: not set, default to 3 days
  streaming_state_size: not set, default to 40MiB
  streaming_stats_enabled: not set, default to true
  uuid_sstable_identifiers_enabled: true (turned on after upgraded
   to 4.1 last year)


 What we have tried

*Tried*: checking the hardware and network
*Result*: everything appears to be fine

*Tried*: Google searching for the error message 
"io.netty.channel.unix.Errors$NativeIoException: writeAddress(..) 
failed: Connection timed out"
*Result*: only one matching result was found, and it points to 
CASSANDRA-16143 . 
That certainly doesn't apply in our case, as it was fixed in 4.0, and I 
also don't believe our data centre grade SSDs are that slow.


*Tried*: reducing the stream_throughput_outbound from 30 to 15 MiB/s
*Result*: did not help, no sign of any improvement

*Tried*: analyse the logs from the joining node and the streaming source 
nodes
*Result*: the error says the write connection timed out on the sending 
end, but a few seconds before that, both sending and receiving ends of 
the connection were still communicating with each other. I couldn't make 
sense of it.


*Tried*: bootstrapping a different node of the same spec
*Result*: same error reproduced

*Tried*: attempting to reproduce the error on a testing cluster
*Result*: unable to reproduce this error on a smaller testing cluster 
with less nodes, less powerful hardware, same Cassandar, Java and OS 
version, same config, same schema, less data and same mixed number of 
vnodes.


*Tried*: keep retrying with "nodetool bootstrap resume"
*Result*: this works and unblocked us from adding new nodes to the 
cluster, but this obviously is not how it should be done.



 What do I expect from posting this

I'm suspecting that this is a bug in Cassandra, but lack the evidence to 
support that, and lacks the expertise in debugging Cassandra (or any 
other Java application).
It would be much appreciated if anyone could offer me some help on this, 
or point me to a direction that may lead to the solution.



 Relevant logs

Note: IP address, keyspace and table names are reducted. The IP address 
ending in 111 is the joining node, and the IP address ending in 182 was 
one of the streaming source node.


The logs from the joining node (IP: xxx.xxx.xxx.111):

   DEBUG [Stream-Deserializer-/xxx.xxx.xxx.182:7000-e0e09450]
   2023-09-09 15:59:13,555 StreamDeserializingTask.java:74 - [Stream
   #69de5e80-4f21-11ee-abc5-1de0bb481b0e channel: e0e09450] Received
   Prepare SYNACK ( 440 files}
   INFO  [Stream-Deserializer-/xxx.xxx.xxx.182:7000-e0e09450]
   2023-09-09 15:59:13,556 StreamResultFuture.java:187 - [Stream
   #69de5e80-4f21-11ee-abc5-1de0bb481b0e ID#0] Prepare completed.
   Receiving 440 files(38.941GiB), sending 0 files(0.000KiB)
   DEBUG [Stream-Deserializer-/xxx.xxx.xxx.182:7000-e0e09450]
   2023-09-09 15:59:13,556 StreamCoordinator.java:148 - Connecting next
   session 69de5e80-4f21-11ee-abc5-1de0bb481b0e with /95.217.36.91:7000.
   INFO  [Stream-Deserializer-/xxx.xxx.xxx.182:7000-e0e09450]
   2023-09-09 15:59:13,556 StreamSession.java:368 - [Stream
   #69de5e80-4f21-11ee-abc5-1de0bb481b0e] Starting streaming to
   95.217.36.91:7000
   DEBUG [Stream-Deserializer-/xxx.xxx.xxx.182:7000-e0e09450]
   2023-09-09 15:59:13,556 StreamingMultiplexedChannel.java:167 -

Re: Big Data Question

2023-08-17 Thread Bowen Song via user

I don't have experience with Cassandra on Kubernetes, so I can't comment 
on that.


For repairs, may I interest you with incremental repairs? It will make 
repairs hell of a lot faster. Of course, occasional full repair is still 
needed, but that's another story.



On 17/08/2023 21:36, Joe Obernberger wrote:

Thank you.  Enjoying this conversation.
Agree on blade servers, where each blade has a small number of SSDs.  
Yeh/Nah to a kubernetes approach assuming fast persistent storage?  I 
think that might be easier to manage.


In my current benchmarks, the performance is excellent, but the 
repairs are painful.  I come from the Hadoop world where it was all 
about large servers with lots of disk.
Relatively small number of tables, but some have a high number of 
rows, 10bil + - we use spark to run across all the data.


-Joe

On 8/17/2023 12:13 PM, Bowen Song via user wrote:
The optimal node size largely depends on the table schema and 
read/write pattern. In some cases 500 GB per node is too large, but 
in some other cases 10TB per node works totally fine. It's hard to 
estimate that without benchmarking.


Again, just pointing out the obvious, you did not count the off-heap 
memory and page cache. 1TB of RAM for 24GB heap * 40 instances is 
definitely not enough. You'll most likely need between 1.5 and 2 TB 
memory for 40x 24GB heap nodes. You may be better off with blade 
servers than single server with gigantic memory and disk sizes.



On 17/08/2023 15:46, Joe Obernberger wrote:

Thanks for this - yeah - duh - forgot about replication in my example!
So - is 2TBytes per Cassandra instance advisable?  Better to use 
more/less?  Modern 2u servers can be had with 24 3.8TBtyte SSDs; so 
assume 80Tbytes per server, you could do:
(1024*3)/80 = 39 servers, but you'd have to run 40 instances of 
Cassandra on each server; maybe 24G of heap per instance, so a 
server with 1TByte of RAM would work.

Is this what folks would do?

-Joe

On 8/17/2023 9:13 AM, Bowen Song via user wrote:
Just pointing out the obvious, for 1PB of data on nodes with 2TB 
disk each, you will need far more than 500 nodes.


1, it is unwise to run Cassandra with replication factor 1. It 
usually makes sense to use RF=3, so 1PB data will cost 3PB of 
storage space, minimal of 1500 such nodes.


2, depending on the compaction strategy you use and the write 
access pattern, there's a disk space amplification to consider. For 
example, with STCS, the disk usage can be many times of the actual 
live data size.


3, you will need some extra free disk space as temporary space for 
running compactions.


4, the data is rarely going to be perfectly evenly distributed 
among all nodes, and you need to take that into consideration and 
size the nodes based on the node with the most data.


5, enough of bad news, here's a good one. Compression will save you 
(a lot) of disk space!


With all the above considered, you probably will end up with a lot 
more than the 500 nodes you initially thought. Your choice of 
compaction strategy and compression ratio can dramatically affect 
this calculation.



On 16/08/2023 16:33, Joe Obernberger wrote:
General question on how to configure Cassandra.  Say I have 1PByte 
of data to store.  The general rule of thumb is that each node (or 
at least instance of Cassandra) shouldn't handle more than 2TBytes 
of disk.  That means 500 instances of Cassandra.


Assuming you have very fast persistent storage (such as a NetApp, 
PorterWorx etc.), would using Kubernetes or some orchestration 
layer to handle those nodes be a viable approach? Perhaps the 
worker nodes would have enough RAM to run 4 instances (pods) of 
Cassandra, you would need 125 servers.
Another approach is to build your servers with 5 (or more) SSD 
devices - one for OS, four for each instance of Cassandra running 
on that server.  Then build some scripts/ansible/puppet that would 
manage Cassandra start/stops, and other maintenance items.


Where I think this runs into problems is with repairs, or 
sstablescrubs that can take days to run on a single instance. How 
is that handled 'in the real world'?  With seed nodes, how many 
would you have in such a configuration?

Thanks for any thoughts!

-Joe

Re: Big Data Question

2023-08-17 Thread Bowen Song via user

From my experience, that's not entirely true. For large nodes, the 
bottleneck is usually the JVM garbage collector. The the GC pauses can 
easily get out of control on very large heaps, and long STW pauses may 
also result in nodes flip up and down from other nodes' perspective, 
which often renders the entire cluster unstable.

Using RF=1 is also strongly discouraged, even with reliable and durable 
storage. By going with RF=1, you don't only lose the data replication, 
but also the high-availability. If any node becomes unavailable in the 
cluster, it will render the entire token range(s) owned by that node 
inaccessible, causing (some or all) CQL queries to fail. This means many 
routine maintenance tasks, such as upgrading and restarting nodes, are 
going to introduce downtime for the cluster. To ensure strong 
consistency and HA, RF=3 is recommended.

On 17/08/2023 20:40, daemeon reiydelle wrote:
A lot of (actually all) seem to be based on local nodes with 1gb 
networks of spinning rust. Much of what is mentioned below is TOTALLY 
wrong for cloud. So clarify whether you are "real world" or rusty slow 
data center world (definitely not modern DC either).

E.g. should not handle more than 2tb of ACTIVE disk, and that was for 
spinning rust with maybe 1gb networks. 10tb of modern high speed SSD 
is more typical with 10 or 40gb networks. If data is persisted to 
cloud storage, replication should be 1, vm's fail over to new 
hardware. Obviously if your storage is ephemeral, you have a different 
discussion. More of a monologue with an idiot in Finance, but 

/./
/Arthur C. Clarke famously said that "technology sufficiently advanced 
is indistinguishable from magic." Magic is coming, and it's coming for 
all of us/

/
/
*Daemeon Reiydelle*
*email: daeme...@gmail.com*
*LI: https://www.linkedin.com/in/daemeonreiydelle/*
*San Francisco 1.415.501.0198/Skype daemeon.c.m.reiydelle*

On Thu, Aug 17, 2023 at 6:13 AM Bowen Song via user 
 wrote:

Just pointing out the obvious, for 1PB of data on nodes with 2TB disk
each, you will need far more than 500 nodes.

1, it is unwise to run Cassandra with replication factor 1. It
usually
makes sense to use RF=3, so 1PB data will cost 3PB of storage space,
minimal of 1500 such nodes.

2, depending on the compaction strategy you use and the write access
pattern, there's a disk space amplification to consider. For example,
with STCS, the disk usage can be many times of the actual live
data size.

3, you will need some extra free disk space as temporary space for
running compactions.

4, the data is rarely going to be perfectly evenly distributed
among all
nodes, and you need to take that into consideration and size the
nodes
based on the node with the most data.

5, enough of bad news, here's a good one. Compression will save
you (a
lot) of disk space!

With all the above considered, you probably will end up with a lot
more
than the 500 nodes you initially thought. Your choice of compaction
strategy and compression ratio can dramatically affect this
calculation.

On 16/08/2023 16:33, Joe Obernberger wrote:
> General question on how to configure Cassandra.  Say I have
1PByte of
> data to store.  The general rule of thumb is that each node (or at
> least instance of Cassandra) shouldn't handle more than 2TBytes of
> disk.  That means 500 instances of Cassandra.
>
> Assuming you have very fast persistent storage (such as a NetApp,
> PorterWorx etc.), would using Kubernetes or some orchestration
layer
> to handle those nodes be a viable approach?  Perhaps the worker
nodes
> would have enough RAM to run 4 instances (pods) of Cassandra, you
> would need 125 servers.
> Another approach is to build your servers with 5 (or more) SSD
devices
> - one for OS, four for each instance of Cassandra running on that
> server.  Then build some scripts/ansible/puppet that would manage
> Cassandra start/stops, and other maintenance items.
>
> Where I think this runs into problems is with repairs, or
> sstablescrubs that can take days to run on a single instance. 
How is
> that handled 'in the real world'?  With seed nodes, how many
would you
> have in such a configuration?
> Thanks for any thoughts!
>
> -Joe
>
>

Re: Big Data Question

2023-08-17 Thread Bowen Song via user

The optimal node size largely depends on the table schema and read/write 
pattern. In some cases 500 GB per node is too large, but in some other 
cases 10TB per node works totally fine. It's hard to estimate that 
without benchmarking.


Again, just pointing out the obvious, you did not count the off-heap 
memory and page cache. 1TB of RAM for 24GB heap * 40 instances is 
definitely not enough. You'll most likely need between 1.5 and 2 TB 
memory for 40x 24GB heap nodes. You may be better off with blade servers 
than single server with gigantic memory and disk sizes.



On 17/08/2023 15:46, Joe Obernberger wrote:

Thanks for this - yeah - duh - forgot about replication in my example!
So - is 2TBytes per Cassandra instance advisable?  Better to use 
more/less?  Modern 2u servers can be had with 24 3.8TBtyte SSDs; so 
assume 80Tbytes per server, you could do:
(1024*3)/80 = 39 servers, but you'd have to run 40 instances of 
Cassandra on each server; maybe 24G of heap per instance, so a server 
with 1TByte of RAM would work.

Is this what folks would do?

-Joe

On 8/17/2023 9:13 AM, Bowen Song via user wrote:
Just pointing out the obvious, for 1PB of data on nodes with 2TB disk 
each, you will need far more than 500 nodes.


1, it is unwise to run Cassandra with replication factor 1. It 
usually makes sense to use RF=3, so 1PB data will cost 3PB of storage 
space, minimal of 1500 such nodes.


2, depending on the compaction strategy you use and the write access 
pattern, there's a disk space amplification to consider. For example, 
with STCS, the disk usage can be many times of the actual live data 
size.


3, you will need some extra free disk space as temporary space for 
running compactions.


4, the data is rarely going to be perfectly evenly distributed among 
all nodes, and you need to take that into consideration and size the 
nodes based on the node with the most data.


5, enough of bad news, here's a good one. Compression will save you 
(a lot) of disk space!


With all the above considered, you probably will end up with a lot 
more than the 500 nodes you initially thought. Your choice of 
compaction strategy and compression ratio can dramatically affect 
this calculation.



On 16/08/2023 16:33, Joe Obernberger wrote:
General question on how to configure Cassandra.  Say I have 1PByte 
of data to store.  The general rule of thumb is that each node (or 
at least instance of Cassandra) shouldn't handle more than 2TBytes 
of disk.  That means 500 instances of Cassandra.


Assuming you have very fast persistent storage (such as a NetApp, 
PorterWorx etc.), would using Kubernetes or some orchestration layer 
to handle those nodes be a viable approach? Perhaps the worker nodes 
would have enough RAM to run 4 instances (pods) of Cassandra, you 
would need 125 servers.
Another approach is to build your servers with 5 (or more) SSD 
devices - one for OS, four for each instance of Cassandra running on 
that server.  Then build some scripts/ansible/puppet that would 
manage Cassandra start/stops, and other maintenance items.


Where I think this runs into problems is with repairs, or 
sstablescrubs that can take days to run on a single instance. How is 
that handled 'in the real world'?  With seed nodes, how many would 
you have in such a configuration?

Thanks for any thoughts!

-Joe

Re: Big Data Question

2023-08-17 Thread Bowen Song via user

Just pointing out the obvious, for 1PB of data on nodes with 2TB disk 
each, you will need far more than 500 nodes.


1, it is unwise to run Cassandra with replication factor 1. It usually 
makes sense to use RF=3, so 1PB data will cost 3PB of storage space, 
minimal of 1500 such nodes.


2, depending on the compaction strategy you use and the write access 
pattern, there's a disk space amplification to consider. For example, 
with STCS, the disk usage can be many times of the actual live data size.


3, you will need some extra free disk space as temporary space for 
running compactions.


4, the data is rarely going to be perfectly evenly distributed among all 
nodes, and you need to take that into consideration and size the nodes 
based on the node with the most data.


5, enough of bad news, here's a good one. Compression will save you (a 
lot) of disk space!


With all the above considered, you probably will end up with a lot more 
than the 500 nodes you initially thought. Your choice of compaction 
strategy and compression ratio can dramatically affect this calculation.



On 16/08/2023 16:33, Joe Obernberger wrote:
General question on how to configure Cassandra.  Say I have 1PByte of 
data to store.  The general rule of thumb is that each node (or at 
least instance of Cassandra) shouldn't handle more than 2TBytes of 
disk.  That means 500 instances of Cassandra.


Assuming you have very fast persistent storage (such as a NetApp, 
PorterWorx etc.), would using Kubernetes or some orchestration layer 
to handle those nodes be a viable approach?  Perhaps the worker nodes 
would have enough RAM to run 4 instances (pods) of Cassandra, you 
would need 125 servers.
Another approach is to build your servers with 5 (or more) SSD devices 
- one for OS, four for each instance of Cassandra running on that 
server.  Then build some scripts/ansible/puppet that would manage 
Cassandra start/stops, and other maintenance items.


Where I think this runs into problems is with repairs, or 
sstablescrubs that can take days to run on a single instance.  How is 
that handled 'in the real world'?  With seed nodes, how many would you 
have in such a configuration?

Thanks for any thoughts!

-Joe

Re: 2 nodes marked as '?N' in 5 node cluster

2023-08-17 Thread Bowen Song via user

The first thing to look is the logs, specifically, the 
/var/log/cassandra/system.log file on each node.


5 seconds time drift is enough to cause Cassandra to fail. You should 
ensure the time difference between Cassandra nodes is very low by ensure 
time sync is working correctly, otherwise cross node timeout may happen, 
and a node with time relatively slightly behind may think everything is 
fine, but a node with time relatively slightly ahead will think the 
other nodes are down.



On 08/08/2023 03:54, vishnu vanced wrote:

Hi All,

I am very new to Cassandra. I have a 5 nodes cluster setup in Centos 
servers for our internal team testing, couple of days ago our network 
team has asked us to stop 3 of the nodes let's say C1,C2,C3 for OS 
patching activity. After the activity I started the nodes again but 
now interestingly in C1 node it was showing as C2 node was down and in 
C2 node it was showing C1 as down. But in remaining all three nodes 
everything is UN.i have tried disabling gossip and enabling it. 
Restarting all the nodes nothing changed. So I stopped this cluster 
and tried to build freshly. But C1 and C2 only join the cluster if 
other node is not present. So I first added C1 to the cluster and C2 
only joins when I mention it as seed node. Now in C1 nodetool status 
C2 is showing as '?N' and vice-versa. But in other nodes showing all 
as 'UN'. I have checked connectivity between all the servers and 
everything is fine. NTP in the three stopped servers differs by 5 
secs, could that be the problem? But C3 node is not showing any issues.


Due to this while creating schemas and getting errors like schema 
version mismatch repairs are failing. Can anyone give any solution as 
to how this can be fixed? Thanks!


P.S are there any telegram/whatsapp groups for Cassandra?

Regards
Vishnu

Re: Survey about the parsing of the tooling's output

2023-07-10 Thread Bowen Song via user

We parse the output of the following nodetool sub-commands in our custom
scripts:

* status
* netstats
* tpstats
* ring

We don't mind the output format change between major releases as long as
all the following are true:

1. major releases are not too frequent
e.g. no more frequent than once every couple of years
2. the changes are clearly documented in the CHANGES.txt and mentioned
in the NEWS.txt
e.g. clearly specify that "someStatistic:" in "nodetool somecommand"
is renamed to "Some Statistic:"
3. the functionality is not lost
e.g. remove a value from the output with no obvious alternative
4. it doesn't become a lot harder to parse
e.g. split a value into multiple values with different units, and
the new values need to be added up together to get the original one

We have Ansible palybooks, shell scripts, Python scripts, etc. parsing
the output, and to my best knowledge, all of them are trivial to rework
for minor cosmetic changes like the one given in the example.

Parsing JSON or YAML in vanilla POSIX shell (i.e. without tools such as
jq installed) can be much harder, we would rather not to have to deal
with that. For Ansible and Python script, it's a nonissue, but given the
fact that we are already parsing the default output and it works fine,
we are unlikely to change them to use JSON or YAML instead, unless the
pain of dealing with breaking changes is too much and too often.

Querying via CQL is harder, and we would rather not to do that for the
reasons below:

* it requires Cassandra credentials, instead the credential-less
nodetool command on localhost
* for shell scripts, the cqlsh command output is harder to parse than
the nodetool command, because its output is a human-friendly table
with header, dynamic indentations, field separators, etc., which
makes it a less attractive candidate than the nodetool
* for Ansible and Python scripts, using the CQL interface will require
extra modules/libraries. The extra installation steps required make
the scripts themselves less portable between different
servers/environment, so we may still prefer the more portable
nodetool approach where the localhost access is possible

On 10/07/2023 10:35, Miklosovic, Stefan wrote:

Hi Cassandra users,

I am a Cassandra developer and we in Cassandra project would love to know if
there are users out there for whom the output of the tooling, like, nodetool,
is important when it comes to parsing it.

We are elaborating on the consequences when nodetool's output for various
commands is changed - we are not completely sure if users are parsing this
output in some manner in their custom scripts so us changing the output would
break their scripts which are parsing it.

Additionally, how big of a problem the output change would be if it was happening
only between major Cassandra versions? E.g. 4.0 -> 5.0 or 5.0 -> 6.0 only. In
other words, there would be a guarantee that no breaking changes in minor versions
would ever occur. Only in majors.

Is somebody out there who is relying on the output of some particular nodetool commands (or any
command in tools/bin) in production? How often do you rely on the parsing of nodetool's output and
how much work it would be for you to rework some minor changes? For example, when the tool output
prints "someStatistic: 10" and we would rework it to "Some Statistic: 10".

Would you be OK if the output changed but you would have a way how to get e.g.
JSON or YAML output instead by some flag on nodetool command so it would be
irrelevant what the default output would be?

It would be appreciated a lot if you gave us more feedback on this. I
understand that not all questions are relatable to everyone.

Even you are not relying on the output of the tooling in some custom scripts
where you parse it, please tell us so. We are progressively trying to provide
CQL way how to query the internal state of Cassandra, via virtual tables, for
example.

Regards

Stefan Miklosovic

Re: 4.0 upgrade

2023-07-09 Thread Bowen Song via user

You should not make DDL (e.g. TRUNCATE, ALTER TABLE) or DCL (e.g. GRANT, 
ALTER ROLE) operations or run repair on a mixed version cluster. Source: 
https://www.datastax.com/learn/whats-new-for-cassandra-4/migrating-cassandra-4x


You should also ensure the gc_grace_seconds value is large enough to 
allow for the time to upgrade DC1, wait, upgrade DC2 and then complete a 
repair, or you may end up with resurrected data.


You also must ensure you do not enable any new features on new version 
nodes in a mixed version cluster. You may enable new features after all 
nodes in the cluster are upgraded.


On 07/07/2023 20:50, Runtian Liu wrote:

Hi,

We are upgrading our Cassandra clusters from 3.0.27 to 4.0.6 and we 
observed some error related to repair: j.l.IllegalArgumentException: 
Unknown verb id 32


We have two datacenters for each Cassandra cluster and when we are 
doing an upgrade, we want to upgrade 1 datacenter first and monitor 
the upgrade datacenter for some time (1 week) to make sure there is no 
issue, then we will upgrade the second datacenter for that cluster.


We have some automated repair jobs running, is it expected to have 
repair stuck if we have 1 datacenter on 4.0 and 1 datacenter on 3.0?


Do you have any suggestions on how we should do the upgrade, is 
waiting for 1 week between two datacenters too long?


Thanks,
Runtian

Re: Upgrade from 3.11.5 to 4.1.x

2023-07-09 Thread Bowen Song via user

Assuming "do it in one go" means a rolling upgrade from 3.11.5 to 4.1.2 
skipping all version numbers between these two, the answer is yes, you 
can "do it in one go".


On 08/07/2023 01:14, Surbhi Gupta wrote:

Hi,

We have to upgrade from 3.11.5 to 4.1.x .
Can we do it in one go ?
Or do we have to go to an intermediate version first?

Thanks
Surbhi

Re: Issue while node addition on cassandra 4.0.7

2023-06-29 Thread Bowen Song via user

Talking about telnet, a closer look at the sequence 'FF F4 FF FD' makes 
me thinking about telnet commands code. Based on RFC 854 
<https://www.rfc-editor.org/rfc/rfc854.html>, the sequence is IAC, 
Interrupt Process, IAC, DO, which is basically the key sequence 'ctrl-c' 
in telnet.


On 29/06/2023 12:42, Bowen Song wrote:


Did anyone connecting to the servers' storage port via telnet, nc 
(netcat) or something similar? 218762506 is 0x0D0A0D0A, which is two 
newlines.



On 29/06/2023 11:49, MyWorld wrote:

When checked in the source nodes, we got similar errors.

Forgot to mention, we also received below error message :
ERROR [Messaging-EventLoop-3-3] 2023-06-27 18:57:09,128 
InboundConnectionInitiator.java:360 - Failed to properly handshake 
with peer /10.127.2.10:58490 <http://10.127.2.10:58490>. Closing the 
channel.
io.netty.handler.codec.DecoderException: 
org.apache.cassandra.net.Message$InvalidLegacyProtocolMagic: Read 
218762506, Expected -900387334


On Thu, Jun 29, 2023 at 2:57 PM Bowen Song via user 
 wrote:


The expected value "-900387334" is the little endian decimal
representation of the PROTOCOL_MAGIC value 0xCA552DFA defined in
the net/Message.java

<https://github.com/apache/cassandra/blob/c579faa488ec156a59ed8e15dd6db55759b9c942/src/java/org/apache/cassandra/net/Message.java#L393>
file.

The read value "-720899" converts to hex is 0xFFF4FFFD, that's
not a valid TLS header which should start with 0x16, so I don't
think has anything to do with the server encryption related
options. It also does not look like a valid version number from
pre-4.0 Cassandra, so we can rule that out too. Since it's nether
a valid Cassandra 4.0+ magic, a TLS header nor a pre-4.0 version
number, I have reason to believe the connection was not initiated
by another Cassandra server for inter-node communication, but
from another program. Can you follow the source IP and port
number back to the originating host, and find out what is that
program? or indeed it was one of the servers in the cluster, not
something else, which could indicate a misconfiguration of the
firewall rules.


On 29/06/2023 01:26, MyWorld wrote:

Hi all,
We are currently using Apache cassandra 4.0.7 in our
environment. While adding a new node in the existing 3-node DC,
we found below error.
This error is observed multiple times when the node was in the
UJ (up and joining) state.

Our current server-to-server internode encryption settings are
default.
server_encryption_options:
    internode_encryption: none
    enable_legacy_ssl_storage_port: false
    require_client_auth: false
    require_endpoint_verification: false

Please help to debug the root cause of this error.
Is it a point to worry about or is it just a Warning issue?
Also, our API properties have received a few 5xx messages
"Operation timed out. received only 2 responses" during this
time(addition of new node), which we have not received when we
were on the 3.11.x version. What could be the possible reason?
However things are stable once the node comes to the UN state.

ERROR [Messaging-EventLoop-3-10] 2023-06-27 18:37:14,931
InboundConnectionInitiator.java:360 - Failed to properly
handshake with peer /x.x.x.x:35894. Closing the channel.
io.netty.handler.codec.DecoderException:
org.apache.cassandra.net.Message$InvalidLegacyProtocolMagic:
Read -720899, Expected -900387334
        at

io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:478)
        at

io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
        at

io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at

io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at

io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at

io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
        at

io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at

io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at

io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
        at

io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
        at
io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480)
        at
io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
        at

io.netty

Re: Issue while node addition on cassandra 4.0.7

2023-06-29 Thread Bowen Song via user

Did anyone connecting to the servers' storage port via telnet, nc 
(netcat) or something similar? 218762506 is 0x0D0A0D0A, which is two 
newlines.



On 29/06/2023 11:49, MyWorld wrote:

When checked in the source nodes, we got similar errors.

Forgot to mention, we also received below error message :
ERROR [Messaging-EventLoop-3-3] 2023-06-27 18:57:09,128 
InboundConnectionInitiator.java:360 - Failed to properly handshake 
with peer /10.127.2.10:58490 <http://10.127.2.10:58490>. Closing the 
channel.
io.netty.handler.codec.DecoderException: 
org.apache.cassandra.net.Message$InvalidLegacyProtocolMagic: Read 
218762506, Expected -900387334


On Thu, Jun 29, 2023 at 2:57 PM Bowen Song via user 
 wrote:


The expected value "-900387334" is the little endian decimal
representation of the PROTOCOL_MAGIC value 0xCA552DFA defined in
the net/Message.java

<https://github.com/apache/cassandra/blob/c579faa488ec156a59ed8e15dd6db55759b9c942/src/java/org/apache/cassandra/net/Message.java#L393>
file.

The read value "-720899" converts to hex is 0xFFF4FFFD, that's not
a valid TLS header which should start with 0x16, so I don't think
has anything to do with the server encryption related options. It
also does not look like a valid version number from pre-4.0
Cassandra, so we can rule that out too. Since it's nether a valid
Cassandra 4.0+ magic, a TLS header nor a pre-4.0 version number, I
have reason to believe the connection was not initiated by another
Cassandra server for inter-node communication, but from another
program. Can you follow the source IP and port number back to the
originating host, and find out what is that program? or indeed it
was one of the servers in the cluster, not something else, which
could indicate a misconfiguration of the firewall rules.


On 29/06/2023 01:26, MyWorld wrote:

Hi all,
We are currently using Apache cassandra 4.0.7 in our environment.
While adding a new node in the existing 3-node DC, we found below
error.
This error is observed multiple times when the node was in the UJ
(up and joining) state.

Our current server-to-server internode encryption settings are
default.
server_encryption_options:
    internode_encryption: none
    enable_legacy_ssl_storage_port: false
    require_client_auth: false
    require_endpoint_verification: false

Please help to debug the root cause of this error.
Is it a point to worry about or is it just a Warning issue?
Also, our API properties have received a few 5xx messages
"Operation timed out. received only 2 responses" during this
time(addition of new node), which we have not received when we
were on the 3.11.x version. What could be the possible reason?
However things are stable once the node comes to the UN state.

ERROR [Messaging-EventLoop-3-10] 2023-06-27 18:37:14,931
InboundConnectionInitiator.java:360 - Failed to properly
handshake with peer /x.x.x.x:35894. Closing the channel.
io.netty.handler.codec.DecoderException:
org.apache.cassandra.net.Message$InvalidLegacyProtocolMagic: Read
-720899, Expected -900387334
        at

io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:478)
        at

io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
        at

io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at

io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at

io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at

io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
        at

io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at

io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at

io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
        at

io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
        at
io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480)
        at
io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
        at

io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at

io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
        at java.base/java.l

Re: Issue while node addition on cassandra 4.0.7

2023-06-29 Thread Bowen Song via user

The expected value "-900387334" is the little endian decimal 
representation of the PROTOCOL_MAGIC value 0xCA552DFA defined in the 
net/Message.java 
 
file.


The read value "-720899" converts to hex is 0xFFF4FFFD, that's not a 
valid TLS header which should start with 0x16, so I don't think has 
anything to do with the server encryption related options. It also does 
not look like a valid version number from pre-4.0 Cassandra, so we can 
rule that out too. Since it's nether a valid Cassandra 4.0+ magic, a TLS 
header nor a pre-4.0 version number, I have reason to believe the 
connection was not initiated by another Cassandra server for inter-node 
communication, but from another program. Can you follow the source IP 
and port number back to the originating host, and find out what is that 
program? or indeed it was one of the servers in the cluster, not 
something else, which could indicate a misconfiguration of the firewall 
rules.



On 29/06/2023 01:26, MyWorld wrote:

Hi all,
We are currently using Apache cassandra 4.0.7 in our environment. 
While adding a new node in the existing 3-node DC, we found below error.
This error is observed multiple times when the node was in the UJ (up 
and joining) state.


Our current server-to-server internode encryption settings are default.
server_encryption_options:
    internode_encryption: none
    enable_legacy_ssl_storage_port: false
    require_client_auth: false
    require_endpoint_verification: false

Please help to debug the root cause of this error.
Is it a point to worry about or is it just a Warning issue?
Also, our API properties have received a few 5xx messages "Operation 
timed out. received only 2 responses" during this time(addition of new 
node), which we have not received when we were on the 3.11.x version. 
What could be the possible reason?

However things are stable once the node comes to the UN state.

ERROR [Messaging-EventLoop-3-10] 2023-06-27 18:37:14,931 
InboundConnectionInitiator.java:360 - Failed to properly handshake 
with peer /x.x.x.x:35894. Closing the channel.
io.netty.handler.codec.DecoderException: 
org.apache.cassandra.net.Message$InvalidLegacyProtocolMagic: Read 
-720899, Expected -900387334
        at 
io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:478)
        at 
io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at 
io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357)
        at 
io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379)
        at 
io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365)
        at 
io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
        at 
io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:795)
        at 
io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:480)
        at 
io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:378)
        at 
io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
        at 
io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
        at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)

        at java.base/java.lang.Thread.run(Thread.java:829)
Caused by: 
org.apache.cassandra.net.Message$InvalidLegacyProtocolMagic: Read 
-720899, Expected -900387334
        at 
org.apache.cassandra.net.Message.validateLegacyProtocolMagic(Message.java:340)
        at 
org.apache.cassandra.net.HandshakeProtocol$Initiate.maybeDecode(HandshakeProtocol.java:167)
        at 
org.apache.cassandra.net.InboundConnectionInitiator$Handler.initiate(InboundConnectionInitiator.java:242)
        at 
org.apache.cassandra.net.InboundConnectionInitiator$Handler.decode(InboundConnectionInitiator.java:235)
        at 
io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:508)
        at 
io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:447)

        ... 15 common frames omitted

Regards,
Ashish

Re: Impact of column names on storage

2023-06-12 Thread Bowen Song via user

Actually, I was wrong. The column names are not stored in the *-Data.db 
files, but stored in the *-Statistics.db files. Cassandra only stores 
one copy of the column names per SSTable data file, therefore the disk 
space usage is negligible.



On 12/06/2023 14:31, Bowen Song wrote:


The SSTable compression will take care of the storage space usage, 
which means users usually don't need to worry about the length of 
column names, unless they are ridiculously long and hard to compress, 
or if SSTable compression is turned off.



On 12/06/2023 13:55, Dimpal Gurabani wrote:

Hi all,

We have a table with 15 columns and ~1M rows. Looking at the output 
of the sstabledump tool, it seems like column names are stored in the 
cell for each row. Is this understanding correct or just sstabledump 
showing verbose output? If yes, is it recommended to have small 
column names to save on space usage?


--
Thanks and Regards,
Dimpal

Re: Impact of column names on storage

2023-06-12 Thread Bowen Song via user

The SSTable compression will take care of the storage space usage, which 
means users usually don't need to worry about the length of column 
names, unless they are ridiculously long and hard to compress, or if 
SSTable compression is turned off.



On 12/06/2023 13:55, Dimpal Gurabani wrote:

Hi all,

We have a table with 15 columns and ~1M rows. Looking at the output of 
the sstabledump tool, it seems like column names are stored in the 
cell for each row. Is this understanding correct or just sstabledump 
showing verbose output? If yes, is it recommended to have small column 
names to save on space usage?


--
Thanks and Regards,
Dimpal

Re: Is cleanup is required if cluster topology changes

2023-05-09 Thread Bowen Song via user

Because an operator will need to check and ensure the schema is 
consistent across the cluster before running "nodetool cleanup". At the 
moment, it's the operator's responsibility to ensure bad things don't 
happen.


On 09/05/2023 06:20, Jaydeep Chovatia wrote:

One clarification question Jeff.
AFAIK, the /nodetool cleanup/ also internally goes through the same 
compaction path as the regular compaction. Then why do we have to wait 
for CEP-21 to clean up unowned data in the regular compaction path? 
Wouldn't it be as simple as regular compaction just invoke the code of 
/nodetool cleanup/?
In other words, without CEP-21, why is /nodetool cleanup/ a safer 
operation but doing the same in the regular compaction isn't?


Jaydeep

On Fri, May 5, 2023 at 11:58 AM Jaydeep Chovatia 
 wrote:


Thanks, Jeff, for the detailed steps and summary.
We will keep the community (this thread) up to date on how it
plays out in our fleet.

Jaydeep

On Fri, May 5, 2023 at 9:10 AM Jeff Jirsa  wrote:

Lots of caveats on these suggestions, let me try to hit most
of them.

Cleanup in parallel is good and fine and common. Limit number
of threads in cleanup if you're using lots of vnodes, so each
node runs one at a time and not all nodes use all your cores
at the same time.
If a host is fully offline, you can ALSO use replace address
first boot. It'll stream data right to that host with the same
token assignments you had before, and no cleanup is needed
then. Strictly speaking, to avoid resurrection here, you'd
want to run repair on the replicas of the down host (for
vnodes, probably the whole cluster), but your current process
doesnt guarantee that either (decom + bootstrap may resurrect,
strictly speaking).
Dropping vnodes will reduce the replicas that have to be
cleaned up, but also potentially increase your imbalance on
each replacement.

Cassandra should still do this on its own, and I think once
CEP-21 is committed, this should be one of the first
enhancement tickets.

Until then, LeveledCompactionStrategy really does make cleanup
fast and cheap, at the cost of higher IO the rest of the time.
If you can tolerate that higher IO, you'll probably appreciate
LCS anyway (faster reads, faster data deletion than STCS).
It's a lot of IO compared to STCS though.


On Fri, May 5, 2023 at 9:02 AM Jaydeep Chovatia
 wrote:

Thanks all for your valuable inputs. We will try some of
the suggested methods in this thread, and see how it goes.
We will keep you updated on our progress.
Thanks a lot once again!

Jaydeep

On Fri, May 5, 2023 at 8:55 AM Bowen Song via user
 wrote:

Depending on the number of vnodes per server, the
probability and severity (i.e. the size of the
affected token ranges) of an availability degradation
due to a server failure during node replacement may be
small. You also have the choice of increasing the RF
if that's still not acceptable.

Also, reducing number of vnodes per server can limit
the number of servers affected by replacing a single
server, therefore reducing the amount of time required
to run "nodetool cleanup" if it is run sequentially.

Finally, you may choose to run "nodetool cleanup"
concurrently on multiple nodes to reduce the amount of
time required to complete it.


On 05/05/2023 16:26, Runtian Liu wrote:

We are doing the "adding a node then decommissioning
a node" to achieve better availability. Replacing a
node need to shut down one node first, if another
node is down during the node replacement period, we
will get availability drop because most of our use
case is local_quorum with replication factor 3.

        On Fri, May 5, 2023 at 5:59 AM Bowen Song via user
 wrote:

Have you thought of using
"-Dcassandra.replace_address_first_boot=..." (or
"-Dcassandra.replace_address=..." if you are
using an older version)? This will not result in
a topology change, which means "nodetool cleanup"
is not needed after the operation is completed.

On 05/05/2023 05:24, Jaydeep Chovatia wrote:

Thanks, Jeff!
But in our environment we replace nodes quite
often for various optimization purpose

Re: Is cleanup is required if cluster topology changes

2023-05-05 Thread Bowen Song via user

Depending on the number of vnodes per server, the probability and 
severity (i.e. the size of the affected token ranges) of an availability 
degradation due to a server failure during node replacement may be 
small. You also have the choice of increasing the RF if that's still not 
acceptable.


Also, reducing number of vnodes per server can limit the number of 
servers affected by replacing a single server, therefore reducing the 
amount of time required to run "nodetool cleanup" if it is run sequentially.


Finally, you may choose to run "nodetool cleanup" concurrently on 
multiple nodes to reduce the amount of time required to complete it.



On 05/05/2023 16:26, Runtian Liu wrote:
We are doing the "adding a node then decommissioning a node" to 
achieve better availability. Replacing a node need to shut down one 
node first, if another node is down during the node replacement 
period, we will get availability drop because most of our use case is 
local_quorum with replication factor 3.


On Fri, May 5, 2023 at 5:59 AM Bowen Song via user 
 wrote:


Have you thought of using
"-Dcassandra.replace_address_first_boot=..." (or
"-Dcassandra.replace_address=..." if you are using an older
version)? This will not result in a topology change, which means
"nodetool cleanup" is not needed after the operation is completed.

On 05/05/2023 05:24, Jaydeep Chovatia wrote:

Thanks, Jeff!
But in our environment we replace nodes quite often for various
optimization purposes, etc. say, almost 1 node per day (node
/addition/ followed by node /decommission/, which of course
changes the topology), and we have a cluster of size 100 nodes
with 300GB per node. If we have to run cleanup on 100 nodes after
every replacement, then it could take forever.
What is the recommendation until we get this fixed in Cassandra
itself as part of compaction (w/o externally triggering /cleanup/)?

Jaydeep

On Thu, May 4, 2023 at 8:14 PM Jeff Jirsa  wrote:

Cleanup is fast and cheap and basically a no-op if you
haven’t changed the ring

After cassandra has transactional cluster metadata to make
ring changes strongly consistent, cassandra should do this in
every compaction. But until then it’s left for operators to
run when they’re sure the state of the ring is correct .




On May 4, 2023, at 7:41 PM, Jaydeep Chovatia
 wrote:


Isn't this considered a kind of *bug* in Cassandra because
as we know /cleanup/ is a lengthy and unreliable operation,
so relying on the /cleanup/ means higher chances of data
resurrection?
Do you think we should discard the unowned token-ranges as
part of the regular compaction itself? What are the pitfalls
of doing this as part of compaction itself?

Jaydeep

On Thu, May 4, 2023 at 7:25 PM guo Maxwell
 wrote:

compact ion will just merge duplicate data and remove
delete data in this node .if you add or remove one node
for the cluster, I think clean up is needed. if clean up
failed, I think we should come to see the reason.

Runtian Liu  于2023年5月5日周五
06:37写道：

Hi all,

Is cleanup the sole method to remove data that does
not belong to a specific node? In a cluster, where
nodes are added or decommissioned from time to time,
failure to run cleanup may lead to data resurrection
issues, as deleted data may remain on the node that
lost ownership of certain partitions. Or is it true
that normal compactions can also handle data removal
for nodes that no longer have ownership of certain data?

Thanks,
Runtian



-- 
you are the apple of my eye !

Re: Is cleanup is required if cluster topology changes

2023-05-05 Thread Bowen Song via user

Have you thought of using "-Dcassandra.replace_address_first_boot=..." 
(or "-Dcassandra.replace_address=..." if you are using an older 
version)? This will not result in a topology change, which means 
"nodetool cleanup" is not needed after the operation is completed.


On 05/05/2023 05:24, Jaydeep Chovatia wrote:

Thanks, Jeff!
But in our environment we replace nodes quite often for various 
optimization purposes, etc. say, almost 1 node per day (node 
/addition/ followed by node /decommission/, which of course changes 
the topology), and we have a cluster of size 100 nodes with 300GB per 
node. If we have to run cleanup on 100 nodes after every replacement, 
then it could take forever.
What is the recommendation until we get this fixed in Cassandra itself 
as part of compaction (w/o externally triggering /cleanup/)?


Jaydeep

On Thu, May 4, 2023 at 8:14 PM Jeff Jirsa  wrote:

Cleanup is fast and cheap and basically a no-op if you haven’t
changed the ring

After cassandra has transactional cluster metadata to make ring
changes strongly consistent, cassandra should do this in every
compaction. But until then it’s left for operators to run when
they’re sure the state of the ring is correct .




On May 4, 2023, at 7:41 PM, Jaydeep Chovatia
 wrote:


Isn't this considered a kind of *bug* in Cassandra because as we
know /cleanup/ is a lengthy and unreliable operation, so relying
on the /cleanup/ means higher chances of data resurrection?
Do you think we should discard the unowned token-ranges as part
of the regular compaction itself? What are the pitfalls of doing
this as part of compaction itself?

Jaydeep

On Thu, May 4, 2023 at 7:25 PM guo Maxwell 
wrote:

compact ion will just merge duplicate data and remove delete
data in this node .if you add or remove one node for the
cluster, I think clean up is needed. if clean up failed, I
think we should come to see the reason.

Runtian Liu  于2023年5月5日周五 06:37写道：

Hi all,

Is cleanup the sole method to remove data that does not
belong to a specific node? In a cluster, where nodes are
added or decommissioned from time to time, failure to run
cleanup may lead to data resurrection issues, as deleted
data may remain on the node that lost ownership of
certain partitions. Or is it true that normal compactions
can also handle data removal for nodes that no longer
have ownership of certain data?

Thanks,
Runtian



-- 
you are the apple of my eye !

Re: Optimization for partitions with high number of rows

2023-04-16 Thread Bowen Song via user

Using a frozen UDT for all the non-key columns is a good starting point. 
You can go a step further and use frozen UDTs for the partition keys and 
clustering keys too if appropriate. This alone will dramatically reduce 
the number of cells per row from from 13 to 3, and save 77% of 
deserialisation work for Cassandra.


If the performance is still suboptimal after you've done the above, you 
should consider creating a batch process that reads the smaller rows 
from this table and combining them into bigger rows, and then storing 
the new row in another table which has the same partition key but each 
row is a frozen list that contains many original rows. If you combine 
all rows from each partition of the old table into a single row in the 
new table, the read speed should be much faster. Keep in mind that this 
may not work if the partition size of the original table is too large 
(approximately >16MB), as the mutation size is limited to up to half of 
the commitlog segment size.



On 12/04/2023 06:14, Gil Ganz wrote:

Is there something I can do to speed up the deserialisation ?
In this example I did a count query, but in reality I need the actual 
data.
Write pattern in this table is such that all data for a given row is 
written at the same time, so I know I can use frozen udt instead of 
this, making it faster, but I wonder if there is another way.


On Tue, Apr 11, 2023 at 9:06 PM Bowen Song via user 
 wrote:


Reading 4MB from 70k rows and 13 columns (0.91 million cells) from
disk in 120ms doesn't sound bad. That's a lots of deserialisation
to do. If you want it to be faster, you can store the number of
rows elsewhere if that's the only thing you need.

On 11/04/2023 07:13, Gil Ganz wrote:

Hey
I have a 4.0.4 cluster, with reads of partitions that are a bit
on the bigger side, taking longer than I would expect.
Reading entire partition that has ~7 rows, total partition
size of 4mb, takes 120ms, I would expect it to take less.

This is after major compaction, so there is only one sstables.
local_one consistency level, no tombstones, and reading the
entire partition in one fetch.
Cluster is not doing much else at the time, nvme disk. I can see
most of the time is spent on getting the data from the sstable.

Is there any specific optimization one can do to speed up cases
like this?
I would expect fetching 4mb to take less, I assume if this was
one blob of 4mb that would be the case.

Table definition :

CREATE TABLE ks1.item_to_item (
    x1 bigint,
    x2 bigint,
    x3 int,
    x4 int,
    y1 bigint,
    y2 bigint,
    y3 bigint,
    y4 bigint,
    y5 bigint,
    metadata text,
    m_metadata_created_at timestamp,
    m_metadata_created_by bigint,
    m_metadata_updated_at timestamp,
    PRIMARY KEY ((x1, x2, x3, x4), y1, y2, y3, y4, y5)
) WITH CLUSTERING ORDER BY (y1 ASC, y2 ASC, y3 ASC, y4 ASC, y5 ASC)



cqlsh> select count(0) from  ks1.item_to_item where x1=4 and
x2=7 and x4=0 and x3=1;

 count
---
 7

(1 rows)

Tracing session: 6356d290-d785-11ed-aba5-ab86979f2f58

 activity                           | timestamp    | source     |
source_elapsed | client

++++---
 Execute CQL3 query                                      |
2023-04-10 09:52:21.561000 | 172.25.0.4 |              0 | 127.0.0.1
 Parsing   [Native-Transport-Requests-1]                    
                | 2023-04-10 09:52:21.561000 | 172.25.0.4 |      
     428 | 127.0.0.1
                            Preparing statement
[Native-Transport-Requests-1]            | 2023-04-10
09:52:21.562000 | 172.25.0.4 |  973 | 127.0.0.1
                                   Acquiring sstable references
[ReadStage-2]            | 2023-04-10 09:52:21.563000 |
172.25.0.4 |           2255 | 127.0.0.1
 Skipped 0/1 non-slice-intersecting sstables, included 0 due to
tombstones [ReadStage-2] | 2023-04-10 09:52:21.563000 |
172.25.0.4 | 2524 | 127.0.0.1
                                    Key cache hit for sstable 9
[ReadStage-2]            | 2023-04-10 09:52:21.563000 |
172.25.0.4 |           2692 | 127.0.0.1
                      Merged data from memtables and 1 sstables
[ReadStage-2]            | 2023-04-10 09:52:21.651000 |
172.25.0.4 |          90405 | 127.0.0.1
                     Read 7 live rows and 0 tombstone cells
[ReadStage-2]            | 2023-04-10 09:52:21.651000 |
172.25.0.4 |  90726 | 127.0.0.1
         Request complete            | 2023-04-10 09:52:21.

Re: Optimization for partitions with high number of rows

2023-04-11 Thread Bowen Song via user

Reading 4MB from 70k rows and 13 columns (0.91 million cells) from disk 
in 120ms doesn't sound bad. That's a lots of deserialisation to do. If 
you want it to be faster, you can store the number of rows elsewhere if 
that's the only thing you need.


On 11/04/2023 07:13, Gil Ganz wrote:

Hey
I have a 4.0.4 cluster, with reads of partitions that are a bit on the 
bigger side, taking longer than I would expect.
Reading entire partition that has ~7 rows, total partition size of 
4mb, takes 120ms, I would expect it to take less.


This is after major compaction, so there is only one sstables. 
local_one consistency level, no tombstones,  and reading the entire 
partition in one fetch.
Cluster is not doing much else at the time, nvme disk. I can see most 
of the time is spent on getting the data from the sstable.


Is there any specific optimization one can do to speed up cases like this?
I would expect fetching 4mb to take less, I assume if this was one 
blob of 4mb that would be the case.


Table definition :

CREATE TABLE ks1.item_to_item (
    x1 bigint,
    x2 bigint,
    x3 int,
    x4 int,
    y1 bigint,
    y2 bigint,
    y3 bigint,
    y4 bigint,
    y5 bigint,
    metadata text,
    m_metadata_created_at timestamp,
    m_metadata_created_by bigint,
    m_metadata_updated_at timestamp,
    PRIMARY KEY ((x1, x2, x3, x4), y1, y2, y3, y4, y5)
) WITH CLUSTERING ORDER BY (y1 ASC, y2 ASC, y3 ASC, y4 ASC, y5 ASC)



cqlsh> select count(0) from  ks1.item_to_item where x1=4 and x2=7 
and x4=0 and x3=1;


 count
---
 7

(1 rows)

Tracing session: 6356d290-d785-11ed-aba5-ab86979f2f58

 activity                   | timestamp                  | source | 
source_elapsed | client

++++---
 Execute CQL3 query                              | 2023-04-10 
09:52:21.561000 | 172.25.0.4 |              0 | 127.0.0.1
 Parsing   [Native-Transport-Requests-1]                           
  | 2023-04-10 09:52:21.561000 | 172.25.0.4 |            428 | 127.0.0.1
                            Preparing statement 
[Native-Transport-Requests-1]            | 2023-04-10 09:52:21.562000 
| 172.25.0.4 |            973 | 127.0.0.1
                                   Acquiring sstable references 
[ReadStage-2]            | 2023-04-10 09:52:21.563000 | 172.25.0.4 |   
        2255 | 127.0.0.1
 Skipped 0/1 non-slice-intersecting sstables, included 0 due to 
tombstones [ReadStage-2] | 2023-04-10 09:52:21.563000 | 172.25.0.4 |   
        2524 | 127.0.0.1
                                    Key cache hit for sstable 9 
[ReadStage-2]            | 2023-04-10 09:52:21.563000 | 172.25.0.4 |   
        2692 | 127.0.0.1
                      Merged data from memtables and 1 sstables 
[ReadStage-2]            | 2023-04-10 09:52:21.651000 | 172.25.0.4 |   
       90405 | 127.0.0.1
                     Read 7 live rows and 0 tombstone cells 
[ReadStage-2]            | 2023-04-10 09:52:21.651000 | 172.25.0.4 |   
       90726 | 127.0.0.1
 Request complete            | 2023-04-10 09:52:21.682603 | 172.25.0.4 
|         121603 | 127.0.0.1



gil

Re: CAS operation result is unknown - proposal accepted by 1 but not a quorum

2023-04-11 Thread Bowen Song via user

That error message sounds like one of the nodes timed out in the paxos 
propose stage.  You can check the system.log and gc.log and see if you 
can find anything unusual in them, such as network errors, out of sync 
clocks or long stop-the-world GC pauses.



BTW, since you said you want it to be fast, I think it's worth 
mentioning that LWT comes with additional cost and is much slower than a 
straight forward INSERT/UPDATE. You should avoid using it if possible. 
For example, if all of the Cassandra clients (samba servers) are running 
on the same machine, it may be far more efficient to use a lock than LWT.



On 11/04/2023 18:18, Ralph Boehme wrote:

Hi folks!

Ralph here from the Samba team.

I'm currently doing research into Opensource distributed NoSQL 
key/value stores to be used by Samba as an more scalable alternative 
to Samba's own homegrown distributed key/value store called "ctdb" [1].


As an Opensource implementation of the SMB filesharing protocol from 
Microsoft, we have some specific requirements wrt to database behaviour:


- fast
- fast
- fast
- highly consistent, iow linearizable

We got away without a linearizable database as historically the SMB 
protocol and the SMB client implementations were built around the 
assumption that handle and session state at the server could be lost 
due to events like process or server crashes and client would 
implement a best effort strategy to recover client state.


Modern SMB3 offers stronger guarantees which require a strongly 
consistent ie linearizable database.


While prototyping a Python module for our pluggable database client in 
Samba I ran into the following issue with Cassandra:


  File "cassandra/cluster.py", line 2618, in 
cassandra.cluster.Session.execute
  File "cassandra/cluster.py", line 4901, in 
cassandra.cluster.ResponseFuture.result
cassandra.protocol.ErrorMessageSub: [Unknown] message="CAS operation result is unknown - proposal accepted 
by 1 but not a quorum.">


This happens when executing the following LWT:

    f'''
    INSERT INTO {dbname} (key, guid, owner, refcount)
    VALUES (?, ?, ?, ?)
    IF NOT EXISTS
    ''')

This is the first time I'm running Cassandra. I've just setup a three 
node test cluster and everything looks ok:


# nodetool status
Datacenter: datacenter1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address    Load    Tokens  Owns (effective)  Host ID 
  Rack
UN  172.18.200.21  360,09 KiB  16  100,0% 
4590f3a6-4ca5-466f-a24d-edc54afa36f0  rack1
UN  172.18.200.23  326,92 KiB  16  100,0% 
9175fd4e-4d84-4899-878a-dd5266132ff8  rack1
UN  172.18.200.22  335,32 KiB  16  100,0% 
35e05369-cc8a-4642-b98d-a5fcc326502f  rack1


Can anyone shed some light on what I might be doing wrong?

Thanks!
-slow

[1]

Re: Issues during Install/Remove Cassandra ver 4.0.x

2023-04-05 Thread Bowen Song via user

Since you have already downloaded the RPM file, you may install it with 
"yum install cassandra-4.0.7-1.noarch.rpm" command. This will install 
the package with all of its dependencies.


BTW, you can even run "yum install 
https://redhat.cassandra.apache.org/40x/cassandra-4.0.7-1.noarch.rpm; to 
download and install the package with just one command.



On 05/04/2023 14:45, MyWorld wrote:

Hi all,
We are facing one issue in installing cassandra-4.0.7.

### We started with*yum installation.* We setup repo "cassandra.repo" 
as below:

[cassandra]
name=Apache Cassandra
baseurl=https://redhat.cassandra.apache.org/40x/noboolean/
gpgcheck=1
repo_gpgcheck=1
gpgkey=https://downloads.apache.org/cassandra/KEYS

On doing "yum list cassandra", *it shows ver 4.0.8 but not 4.0.7*.
Further using showduplicates "yum --showduplicates list cassandra", 
*still it shows ver 4.0.8 but not 4.0.7*.


*How can we get earlier versions here ??*

###Next, we tried *using rpm,*
sudo curl -OL 
https://redhat.cassandra.apache.org/40x/cassandra-4.0.7-1.noarch.rpm


On running "sudo rpm -ivh cassandra-4.0.7-1.noarch.rpm",
It gives below error,
error: Failed dependencies:         (jre-1.8.0 or jre-11) is needed by 
cassandra-4.0.7-1.noarch         rpmlib(RichDependencies) <= 4.12.0-1 
is needed by cassandra-4.0.7-1.noarch
Then, i solve this by using "sudo rpm --nodeps -ivh 
cassandra-4.0.7-1.noarch.rpm"

and version was installed successfully

*Is skipping dependencies with  --nodeps a right approach ??*

###Next, i tried to *uninstall the version* using
"yum remove cassandra"
It gives error: *Invalid version flag: or*

Refer complete trace below:
# yum remove cassandra
Loaded plugins: fastestmirror
Resolving Dependencies
--> Running transaction check
---> Package cassandra.noarch 0:4.0.7-1 will be erased
--> Finished Dependency Resolution

Dependencies Resolved
===
 Package                     Arch                     Version         
           Repository                   Size

===
Removing:
 cassandra                   noarch                   4.0.7-1         
           installed                    55 M


Transaction Summary
===
Remove  1 Package

Installed size: 55 M
Is this ok [y/N]: y
Downloading packages:
Running transaction check
Running transaction test
Transaction test succeeded
Running transaction

Invalid version flag: or

*How to solve this issue ??*

Regards,
Ashish Gupta

Re: Reads not returning data after adding node

2023-04-05 Thread Bowen Song via user

It is not necessary, but recommended to run repair before adding nodes.
That's because deleted data may be resurrected if the time between two
repair runs is longer than the gc_grace_period, and adding nodes can
take a lots of time.

Running nodetool cleanup is also not required, but recommended. Without
this, the disk space on existing nodes will not be freed up. If you are
adding multiple new nodes, and aren't facing immediate free disk space
crisis, it would make more sense to run cleanup once after *all* new
nodes are added than run it once after *each* new node is added.

On 05/04/2023 05:24, David Tinker wrote:
The Datastax doc says to run cleanup one node at a time after
bootstrapping has completed. The myadventuresincoding post says to run
a repair on each node first. Is it necessary to run the repairs first?
Thanks.

On Tue, Apr 4, 2023 at 1:11 PM Bowen Song via user
wrote:

Perhaps have a read here?

https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/operations/opsAddNodeToCluster.html

On 04/04/2023 06:41, David Tinker wrote:

Ok. Have to psych myself up to the add node task a bit. Didn't go
well the first time round!

Tasks
- Make sure the new node is not in seeds list!
- Check cluster name, listen address, rpc address
- Give it its own rack in cassandra-rackdc.properties
- Delete cassandra-topology.properties if it exists
- Make sure no compactions are on the go
- rm -rf /var/lib/cassandra/*
- rm /data/cassandra/commitlog/* (this is on different disk)
- systemctl start cassandra

And it should start streaming data from the other nodes and join
the cluster. Anything else I have to watch out for? Tx.

On Tue, Apr 4, 2023 at 5:25 AM Jeff Jirsa wrote:

Because executing “removenode” streamed extra data from live
nodes to the “gaining” replica

Oversimplified (if you had one token per node)

If you start with A B C

Then add D

D should bootstrap a range from each of A B and C, but at the
end, some of the data that was A B C becomes B C D

When you removenode, you tell B and C to send data back to A.

A B and C will eventually contact that data away. Eventually.

If you get around to adding D again, running “cleanup” when
you’re done (successfully) will remove a lot of it.

On Apr 3, 2023, at 8:14 PM, David Tinker
wrote:

Looks like the remove has sorted things out. Thanks.

One thing I am wondering about is why the nodes are
carrying a lot more data? The loads were about 2.7T
before, now 3.4T.

# nodetool status
Datacenter: dc1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective)
Host ID Rack
UN xxx.xxx.xxx.105 3.4 TiB 256 100.0%
afd02287-3f88-4c6f-8b27-06f7a8192402 rack3
UN xxx.xxx.xxx.253 3.34 TiB 256 100.0%
e1af72be-e5df-4c6b-a124-c7bc48c6602a rack2
UN xxx.xxx.xxx.107 3.44 TiB 256 100.0%
ab72f017-be96-41d2-9bef-a551dec2c7b5 rack1

On Mon, Apr 3, 2023 at 5:42 PM Bowen Song via user
wrote:

That's correct. nodetool removenode is strongly
preferred when your node is already down. If the node is
still functional, use nodetool decommission on the node
instead.

On 03/04/2023 16:32, Jeff Jirsa wrote:

FWIW, `nodetool decommission` is strongly preferred.
`nodetool removenode` is designed to be run when a host
is offline. Only decommission is guaranteed to maintain
consistency / correctness, and removemode probably
streams a lot more data around than decommission.

On Mon, Apr 3, 2023 at 6:47 AM Bowen Song via user
wrote:

Use nodetool removenode is strongly preferred in
most circumstances, and only resort to assassinate
if you do not care about data consistency or you
know there won't be any consistency issue (e.g. no
new writes and did not run nodetool cleanup).

Since the size of data on the new node is small,
nodetool removenode should finish fairly quickly
and bring your cluster back.

Next time when you are doing something like this
again, please test it out on a non-production
environment, make sure everything works as expected
before moving onto the production.

On 03/04/2023 06:28, David Tinker wrote:

Should I use assassinate or removenode? Given that
there is some data on the node. Or will that be
found on the other nodes? Sorry

Re: When are sstables that were compacted deleted?

2023-04-05 Thread Bowen Song via user

It may be useful to attach the output from the nodetool tpstats, 
nodetool compactionstats and nodetool netstats commands output.


If any SSTable involved in a transaction is being compacted, repaired or 
streamed, etc., the transaction clean up will be delayed. This is the 
expected behaviour.



On 05/04/2023 04:50, Gil Ganz wrote:
In this case the removal process has already finished hours before, so 
nothing is streaming anymore (but looking at the list of the 
transaction logs left behind, could be some of the transaction 
finished while decommission was still running).

We don't have secondary indexes,and no repairs were running.
I will try to reproduce it, are there any debug flags you think would 
help next time?


On Wed, Apr 5, 2023 at 1:03 AM Jeff Jirsa  wrote:

You will DEFINITELY not remove sstables obsoleted by compaction if
they are being streamed out to neighbors. It would also not
surprise me that if you have something holding a background
reference to one of the sstables in the oldest/older compaction
transaction logs, that the whole process may block waiting on the
tidier to clean that up.

Things that may hold references:
- Validation compactions (repair)
- Index build/rebuild
- Streaming (repair, bootstrap, move, decom)

If you have repairs running, you can try pausing/cancelling them
and/or stopping validation/index_build compactions.



On Tue, Apr 4, 2023 at 2:29 PM Gil Ganz  wrote:

If it was just one instance I would just bounce it but the
thing is this happens when we join/remove nodes, and we have a
lot of servers with this issue (while before the join/remove
we are on ~50% disk usage).
We found ourselves fighting with compaction to make sure we
don't run out of space.
Will open a ticket, thanks.


On Wed, Apr 5, 2023 at 12:10 AM Jeff Jirsa 
wrote:

If you restart the node, it'll process/purge those
compaction logs on startup, but you want them to
purge/process now.

I genuinely dont know when the tidier runs, but it's
likely the case that you're too busy compaction to purge
(though I don't know what exactly "too busy" means).

Since you're close to 95% disk full, bounce one instance
at a time to recover the space, but we probably need a
JIRA to understand exactly what's blocking the tidier from
running.



On Tue, Apr 4, 2023 at 1:55 PM Gil Ganz
 wrote:

More information - from another node in the cluster

I can see many txn files although I only have two
compactions running.
[user@server808
new_table-44263b406bf111ed8bd9df80ace3677a]# ls -l *txn*
-rw-r--r-- 1 cassandra cassandra 613 Apr  4 05:26
nb_txn_compaction_09e3aa40-d2a7-11ed-b76b-3b279f6334bc.log
-rw-r--r-- 1 cassandra cassandra 461 Apr  4 10:17
nb_txn_compaction_11433360-d265-11ed-b76b-3b279f6334bc.log
-rw-r--r-- 1 cassandra cassandra 463 Apr  4 09:48
nb_txn_compaction_593e5320-d265-11ed-b76b-3b279f6334bc.log
-rw-r--r-- 1 cassandra cassandra 614 Apr  3 22:47
nb_txn_compaction_701d62d0-d264-11ed-b76b-3b279f6334bc.log
-rw-r--r-- 1 cassandra cassandra 136 Apr  3 22:27
nb_txn_compaction_bb770b50-d26e-11ed-b76b-3b279f6334bc.log
-rw-r--r-- 1 cassandra cassandra 463 Apr  4 09:23
nb_txn_compaction_ce51bfe0-d264-11ed-b76b-3b279f6334bc.log
-rw-r--r-- 1 cassandra cassandra 134 Apr  4 10:31
nb_txn_compaction_d17c7380-d2d3-11ed-b76b-3b279f6334bc.log
-rw-r--r-- 1 cassandra cassandra 464 Apr  4 09:24
nb_txn_compaction_ed7fc650-d264-11ed-b76b-3b279f6334bc.log
-rw-r--r-- 1 cassandra cassandra 613 Apr  3 22:54
nb_txn_compaction_f456f3b0-d271-11ed-b76b-3b279f6334bc.log

Let's take for example the one from "Apr  4 09:24"
I can see the matching log message in system.log

INFO  [CompactionExecutor:142085] 2023-04-04
09:24:29,892 CompactionTask.java:241 - Compacted
(ed7fc650-d264-11ed-b76b-3b279f6334bc) 2 sstables to

[/var/lib/cassandra/data/disk1/kt_ks/new_table-44263b406bf111ed8bd9df80ace3677a/nb-12142-big,]
to level=0.  362.987GiB to 336.323GiB (~92% of
original) in 43,625,742ms.  Read Throughput =
8.520MiB/s, Write Throughput = 7.894MiB/s, Row
Throughput = ~-11,482/s.  3,755,353,838 total
partitions merged to 3,479,484,261.  Partition merge
counts were {1:3203614684, 2:275869577, }


[user@server808

Re: Reads not returning data after adding node

2023-04-04 Thread Bowen Song via user

Perhaps have a read here?
https://docs.datastax.com/en/cassandra-oss/3.x/cassandra/operations/opsAddNodeToCluster.html

On 04/04/2023 06:41, David Tinker wrote:
Ok. Have to psych myself up to the add node task a bit. Didn't go well
the first time round!

And it should start streaming data from the other nodes and join the
cluster. Anything else I have to watch out for? Tx.

On Tue, Apr 4, 2023 at 5:25 AM Jeff Jirsa wrote:

Because executing “removenode” streamed extra data from live nodes
to the “gaining” replica

Oversimplified (if you had one token per node)

If you start with A B C

Then add D

D should bootstrap a range from each of A B and C, but at the end,
some of the data that was A B C becomes B C D

When you removenode, you tell B and C to send data back to A.

A B and C will eventually contact that data away. Eventually.

If you get around to adding D again, running “cleanup” when you’re
done (successfully) will remove a lot of it.

On Apr 3, 2023, at 8:14 PM, David Tinker
wrote:

Looks like the remove has sorted things out. Thanks.

One thing I am wondering about is why the nodes are carrying a
lot more data? The loads were about 2.7T before, now 3.4T.

# nodetool status
Datacenter: dc1
===
Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
-- Address Load Tokens Owns (effective) Host ID
Rack
UN xxx.xxx.xxx.105 3.4 TiB 256 100.0%
afd02287-3f88-4c6f-8b27-06f7a8192402 rack3
UN xxx.xxx.xxx.253 3.34 TiB 256 100.0%
e1af72be-e5df-4c6b-a124-c7bc48c6602a rack2
UN xxx.xxx.xxx.107 3.44 TiB 256 100.0%
ab72f017-be96-41d2-9bef-a551dec2c7b5 rack1

On Mon, Apr 3, 2023 at 5:42 PM Bowen Song via user
wrote:

That's correct. nodetool removenode is strongly preferred
when your node is already down. If the node is still
functional, use nodetool decommission on the node instead.

On 03/04/2023 16:32, Jeff Jirsa wrote:

FWIW, `nodetool decommission` is strongly preferred.
`nodetool removenode` is designed to be run when a host is
offline. Only decommission is guaranteed to maintain
consistency / correctness, and removemode probably streams a
lot more data around than decommission.

On Mon, Apr 3, 2023 at 6:47 AM Bowen Song via user
wrote:

Use nodetool removenode is strongly preferred in most
circumstances, and only resort to assassinate if you do
not care about data consistency or you know there won't
be any consistency issue (e.g. no new writes and did not
run nodetool cleanup).

Since the size of data on the new node is small,
nodetool removenode should finish fairly quickly and
bring your cluster back.

Next time when you are doing something like this again,
please test it out on a non-production environment, make
sure everything works as expected before moving onto the
production.

On 03/04/2023 06:28, David Tinker wrote:

Should I use assassinate or removenode? Given that
there is some data on the node. Or will that be found
on the other nodes? Sorry for all the questions but I
really don't want to mess up.

On Mon, Apr 3, 2023 at 7:21 AM Carlos Diaz
wrote:

That's what nodetool assassinte will do.

On Sun, Apr 2, 2023 at 10:19 PM David Tinker
wrote:

Is it possible for me to remove the node from
the cluster i.e. to undo this mess and get the
cluster operating again?

On Mon, Apr 3, 2023 at 7:13 AM Carlos Diaz
wrote:

You can leave it in the seed list of the
other nodes, just make sure it's not
included in this node's seed list.
However, if you do decide to fix the issue
with the racks first assassinate this node
(nodetool assassinate ), and update the
rack name before you restart.

On Sun, Apr 2, 2023 at 10:06 PM David
Tinker wrote:

It is also in the seeds list for the
other nodes. Should I remove it from

Re: Reads not returning data after adding node

2023-04-03 Thread Bowen Song via user

That's correct. nodetool removenode is strongly preferred when your node 
is already down. If the node is still functional, use nodetool 
decommission on the node instead.


On 03/04/2023 16:32, Jeff Jirsa wrote:
FWIW, `nodetool decommission` is strongly preferred. `nodetool 
removenode` is designed to be run when a host is offline. Only 
decommission is guaranteed to maintain consistency / correctness, and 
removemode probably streams a lot more data around than decommission.



On Mon, Apr 3, 2023 at 6:47 AM Bowen Song via user 
 wrote:


Use nodetool removenode is strongly preferred in most
circumstances, and only resort to assassinate if you do not care
about data consistency or you know there won't be any consistency
issue (e.g. no new writes and did not run nodetool cleanup).

Since the size of data on the new node is small, nodetool
removenode should finish fairly quickly and bring your cluster back.

Next time when you are doing something like this again, please
test it out on a non-production environment, make sure everything
works as expected before moving onto the production.


On 03/04/2023 06:28, David Tinker wrote:

Should I use assassinate or removenode? Given that there is some
data on the node. Or will that be found on the other nodes? Sorry
for all the questions but I really don't want to mess up.

On Mon, Apr 3, 2023 at 7:21 AM Carlos Diaz 
wrote:

That's what nodetool assassinte will do.

On Sun, Apr 2, 2023 at 10:19 PM David Tinker
 wrote:

Is it possible for me to remove the node from the cluster
i.e. to undo this mess and get the cluster operating again?

On Mon, Apr 3, 2023 at 7:13 AM Carlos Diaz
 wrote:

You can leave it in the seed list of the other nodes,
just make sure it's not included in this node's seed
list.  However, if you do decide to fix the issue
with the racks first assassinate this node (nodetool
assassinate ), and update the rack name before
you restart.

On Sun, Apr 2, 2023 at 10:06 PM David Tinker
 wrote:

It is also in the seeds list for the other nodes.
Should I remove it from those, restart them one
at a time, then restart it?

/etc/cassandra # grep -i bootstrap *
doesn't show anything so I don't think I have
auto_bootstrap false.

Thanks very much for the help.


On Mon, Apr 3, 2023 at 7:01 AM Carlos Diaz
 wrote:

Just remove it from the seed list in the
cassandra.yaml file and restart the node. 
Make sure that auto_bootstrap is set to true
first though.

On Sun, Apr 2, 2023 at 9:59 PM David Tinker
 wrote:

So likely because I made it a seed node
when I added it to the cluster it didn't
do the bootstrap process. How can I
recover this?

On Mon, Apr 3, 2023 at 6:41 AM David
Tinker  wrote:

Yes replication factor is 3.

I ran nodetool repair -pr on all the
nodes (one at a time) and am still
having issues getting data back from
queries.

I did make the new node a seed node.

Re "rack4": I assumed that was just
an indication as to the physical
location of the server for
redundancy. This one is separate from
the others so I used rack4.

On Mon, Apr 3, 2023 at 6:30 AM Carlos
Diaz  wrote:

I'm assuming that your
replication factor is 3.  If
that's the case, did you
intentionally put this node in
rack 4?  Typically, you want to
add nodes in multiples of your
replication factor in order to
keep the "racks" balanced.  In
other words, this node should
have been added to rack 1, 2 or 3.

Having s

Re: Understanding rack in cassandra-rackdc.properties

2023-04-03 Thread Bowen Song via user

I just want to mention that the "rack" in Cassandra don't need to match 
the physical rack. As long as each "rack" in Cassandra fails independent 
of each other, it is fine.


That means if you have 6 physical servers each in an unique physical 
rack and Cassandra RF=3, you can have any of the following 
configurations, and each of them makes sense and all of them will work 
correctly:


1. 6 racks in Cassandra, each contains only 1 server

2. 3 racks in Cassandra, each contains 2 servers

3. 1 rack in Cassandra, with all 6 servers in it



On 03/04/2023 16:14, Jeff Jirsa wrote:
As long as the number of racks is already at/above the number of nodes 
/ replication factor, it's gonna be fine.


Where it tends to surprise people is if you have RF=3 and either 1 or 
2 racks, and then you add a third, that third rack gets one copy of 
"all" of the data, so you often run out of disk space.


If you're already at 3 nodes / 3 racks / RF=3, you're already evenly 
distributed, the next (4th, 5th, 6th) racks will just be randomly 
assigned based on the random token allocation.




On Mon, Apr 3, 2023 at 8:12 AM David Tinker  
wrote:


I have a 3 node cluster using the GossipingPropertyFileSnitch and
replication factor of 3. All nodes are leased hardware and more or
less the same. The cassandra-rackdc.properties files look like this:

dc=dc1
rack=rack1
(rack2 and rack3 for the other nodes)

Now I need to expand the cluster. I was going to use rack4 for the
next node, then rack5 and rack6 because the nodes are physically
all on different racks. Elsewhere on this list someone mentioned
that I should use rack1, rack2 and rack3 again.

Why is that?

Thanks
David

Re: Reads not returning data after adding node

2023-04-03 Thread Bowen Song via user

Use nodetool removenode is strongly preferred in most circumstances, and 
only resort to assassinate if you do not care about data consistency or 
you know there won't be any consistency issue (e.g. no new writes and 
did not run nodetool cleanup).


Since the size of data on the new node is small, nodetool removenode 
should finish fairly quickly and bring your cluster back.


Next time when you are doing something like this again, please test it 
out on a non-production environment, make sure everything works as 
expected before moving onto the production.



On 03/04/2023 06:28, David Tinker wrote:
Should I use assassinate or removenode? Given that there is some data 
on the node. Or will that be found on the other nodes? Sorry for all 
the questions but I really don't want to mess up.


On Mon, Apr 3, 2023 at 7:21 AM Carlos Diaz  wrote:

That's what nodetool assassinte will do.

On Sun, Apr 2, 2023 at 10:19 PM David Tinker
 wrote:

Is it possible for me to remove the node from the cluster i.e.
to undo this mess and get the cluster operating again?

On Mon, Apr 3, 2023 at 7:13 AM Carlos Diaz
 wrote:

You can leave it in the seed list of the other nodes, just
make sure it's not included in this node's seed list. 
However, if you do decide to fix the issue with the racks
first assassinate this node (nodetool assassinate ),
and update the rack name before you restart.

On Sun, Apr 2, 2023 at 10:06 PM David Tinker
 wrote:

It is also in the seeds list for the other nodes.
Should I remove it from those, restart them one at a
time, then restart it?

/etc/cassandra # grep -i bootstrap *
doesn't show anything so I don't think I have
auto_bootstrap false.

Thanks very much for the help.


On Mon, Apr 3, 2023 at 7:01 AM Carlos Diaz
 wrote:

Just remove it from the seed list in the
cassandra.yaml file and restart the node.  Make
sure that auto_bootstrap is set to true first though.

On Sun, Apr 2, 2023 at 9:59 PM David Tinker
 wrote:

So likely because I made it a seed node when I
added it to the cluster it didn't do the
bootstrap process. How can I recover this?

On Mon, Apr 3, 2023 at 6:41 AM David Tinker
 wrote:

Yes replication factor is 3.

I ran nodetool repair -pr on all the nodes
(one at a time) and am still having issues
getting data back from queries.

I did make the new node a seed node.

Re "rack4": I assumed that was just an
indication as to the physical location of
the server for redundancy. This one is
separate from the others so I used rack4.

On Mon, Apr 3, 2023 at 6:30 AM Carlos Diaz
 wrote:

I'm assuming that your replication
factor is 3.  If that's the case, did
you intentionally put this node in
rack 4?  Typically, you want to add
nodes in multiples of your replication
factor in order to keep the "racks"
balanced.  In other words, this node
should have been added to rack 1, 2 or 3.

Having said that, you should be able
to easily fix your problem by running
a nodetool repair -pr on the new node.

On Sun, Apr 2, 2023 at 8:16 PM David
Tinker  wrote:

Hi All

I recently added a node to my 3
node Cassandra 4.0.5 cluster and
now many reads are not returning
rows! What do I need to do to fix
this? There weren't any errors in
the logs or other problems that I
could see. I expected the cluster
to balance itself but this hasn't
happened (yet?). The nodes are
similar so I have num_tokens=256

Re: Nodetool command to pre-load the chunk cache

2023-03-21 Thread Bowen Song via user

It sounds like a bad policy, and you should push for that to be changed. 
Failing that, you have some options:


1. Use faster disks. This improves cold start performance, without 
relying on the caches.


2. Rely on row cache instead. It can be saved periodically and loaded at 
startup time.


3. Ensure read CL < RF, and rely on speculative retries. Note: you will 
need to avoid restarting two severs owning the same token range 
consecutively for this to work.


These are on top of my head, but I'm sure there's more ways to do it. 
You should decide based on your situation.


BTW, manually load the chunk cache is never going to work unless you 
know what the hot data is. Load a whole table into chunk cache makes no 
sense unless the table on each server can fit in 512 MB of memory, but 
then why do you even need Cassandra?



On 21/03/2023 17:15, Carlos Diaz wrote:

Hi Team,

We are heavy users of Cassandra at a pretty big bank. Security 
measures require us to constantly refresh our C* nodes every x number 
of days.  We normally do this in a rolling fashion, taking one node 
down at a time and then refreshing it with a new instance.  This 
process has been working for us great for the past few years.


However, we recently started having issues when a newly refreshed 
instance comes back online, our automation waits a few minutes for the 
node to become "ready (UN)" and then moves on to the next node.  The 
problem that we are facing is that when the node is ready, the chunk 
cache is still empty so when the node starts accepting new 
connections, queries that go to take much longer to respond and this 
causes errors for our apps.


I was thinking that it would be great if we had a nodetool command 
that would allow us to prefetch a certain table or a set of tables to 
preload the chunk cache.  Then we could simply add another check 
(nodetool info?), to ensure that the chunk cache has been preloaded 
enough to handle queries to this particular node.


Would love to hear others' feedback on the feasibility of this idea.

Thanks!

Re: New DC / token distribution not balanced

2023-03-17 Thread Bowen Song via user

Sorry to see you go.

To unsubscribe from this mailing list, please send an email to 
user-unsubscr...@cassandra.apache.org

On 17/03/2023 05:42, Mathieu Delsaut wrote:

unsubscribe
<http://www.univ-reunion.fr>  

Mathieu Delsaut

Studies Engineer @ DSIMB & ENERGY-Lab

<http://www.dsimb.inserm.fr/> <https://www.energylab.re>

Interne : 2357 | Externe : +262 262 93 86 08

.
<http://www.univ-reunion.fr/>

Le jeu. 16 mars 2023 à 21:02, Bowen Song via user 
 a écrit :

No, allocate_tokens_for_local_replication_factor does not exist in
Cassandra 3. It was introduced in Cassandra 4.0.

Now, may I interest you with an upgrade? Not only Cassandra 4
comes with
a lots of improvements and bug fixes, it's also a fairly painless
process. I find it much easier to upgrade Cassandra than other
databases
or similar distributed software.

For your question about allocate_tokens_* , you can read it here:
https://issues.apache.org/jira/browse/CASSANDRA-7032

In short, the old vnode token allocation algorithm can cause load
imbalance between nodes in large clusters, and the new algorithm
takes
the RF into consideration, resulting in a more load-balanced cluster.

On 16/03/2023 16:02, Max Campos wrote:
> Does this exist for Cassandra 3.x?  I know it was in DSE for
DSE’s 3.x
> equivalent, and seems to be in Cassandra 4.x cassandra.yaml.  I
don’t
> see it here, though:
>
>
https://github.com/apache/cassandra/blob/cassandra-3.11/conf/cassandra.yaml
>
> BTW:  Wow - what a difference allocate_tokens_* makes. Living in
the
> RF=3 with 3 nodes world for so many years, I had no idea.  :-)

Re: New DC / token distribution not balanced

2023-03-16 Thread Bowen Song via user

No, allocate_tokens_for_local_replication_factor does not exist in 
Cassandra 3. It was introduced in Cassandra 4.0.


Now, may I interest you with an upgrade? Not only Cassandra 4 comes with 
a lots of improvements and bug fixes, it's also a fairly painless 
process. I find it much easier to upgrade Cassandra than other databases 
or similar distributed software.



For your question about allocate_tokens_* , you can read it here: 
https://issues.apache.org/jira/browse/CASSANDRA-7032


In short, the old vnode token allocation algorithm can cause load 
imbalance between nodes in large clusters, and the new algorithm takes 
the RF into consideration, resulting in a more load-balanced cluster.



On 16/03/2023 16:02, Max Campos wrote:
Does this exist for Cassandra 3.x?  I know it was in DSE for DSE’s 3.x 
equivalent, and seems to be in Cassandra 4.x cassandra.yaml.  I don’t 
see it here, though:


https://github.com/apache/cassandra/blob/cassandra-3.11/conf/cassandra.yaml

BTW:  Wow - what a difference allocate_tokens_* makes.  Living in the 
RF=3 with 3 nodes world for so many years, I had no idea.  :-)

Re: New DC / token distribution not balanced

2023-03-16 Thread Bowen Song via user

You may find "allocate_tokens_for_local_replication_factor" more useful 
than "allocate_tokens_for_keyspace" when you are spinning up a new DC.


On 16/03/2023 06:25, Max Campos wrote:

Update:  I figured out the problem!

The “allocate_tokens_for_keyspace” value needs to be set for a 
keyspace that has RF=3 /for the DC being added/.  I just had the RF=3 
set for the existing DC.


I created a dummy keyspace with RF=3 for the new DC, set 
“allocate_tokens_for_keyspace=” and then added the nodes … 
voila!  Problem solved!



On Mar 15, 2023, at 10:50 pm, Max Campos  
wrote:


Hi All -

I’m having a lot of trouble adding a new DC and getting a balanced 
ring (i.e. every node has the same percentage of the token ring).


My config:

GossipingPropertyFileSnitch
allocate_tokens_for_keyspace: RF=3 keyspace in the existing DC>

num_tokens = 16

6 nodes in the new DC / 3 nodes in the existing DC
Cassandra 3.0.23

I add the nodes to the new DC one-by-one, waiting for “Startup 
complete” … then create a new test keyspace with RF=3:


create keyspace test_tokens with replication = {'class': 
'NetworkTopologyStrategy', 'ies3': '3'}


… but then when I run “nodetool status test_tokens”, i see that the 
“Owns (effective)” is way out of balance (see attached image — “ies3” 
is the new DC).

*.62 / node1 / rack1 - 71.8%
*.63 / node2 / rack2 - 91.4%
*.64 / node3 / rack3 - 91.6%
*.66 / node4 / rack1 - 28.2%
*.67 / node5 / rack2 - 8.6%
*.68 / node6 / rack3 - 8.4%

node1 & node2 are seed nodes, along with 2 nodes from the existing DC.

How can I get even token distribution — “Owns (effective) = 50%" (or 
1/6 of the token range for each node)?


Also: I’ve made several attempts to try to figure this out (ex: all 
nodes in 1 rack? each node has own rack?  2 nodes per rack?). 
 Between each attempt I’m running “nodetool decommission” one-by-one, 
 blowing away /var/lib/cassandra/*, etc.  Is it possible that the 
existing DC’s gossip is remembering the token range & thus causing 
problems when I recreate the new DC with some other configuration 
parameters?  Do I need to do something to clear out the gossip 
between attempts?


Thanks everyone.

- Max

Re: Adding an IPv6-only server to a dual-stack cluster

2022-11-18 Thread Bowen Song via user

Not that simple. By making a node listen on both IPv4 and IPv6, they 
will accept connections from both, but other nodes will still only 
trying to connect to this node on the address it is broadcasting. That 
means if a node's broadcasting a IPv4 address, then all other nodes in 
the cluster must be able to reach it on that IPv4 address. That's why 
you'll need NAT64 to make sure the IPv6-only nodes can reach the 
dual-stack nodes on their IPv4 addresses.


On 18/11/2022 09:12, Lapo Luchini wrote:
So basically listen_address=:: (which should accept both IPv4 and 
IPv6) is fine, as long as broadcast_address reports the same single 
IPv4 address that the node always reported previously?


The presence of broadcast_address removes the "different nodes in the 
cluster pick different addresses for you" case?


On 2022-11-16 14:03, Bowen Song via user wrote:
I would expect that you'll need NAT64 in order to have a cluster with 
mixed nodes between IPv6-only servers and dual-stack servers that's 
broadcasting their IPv4 addresses. Once all IPv4-broadcasting 
dual-stack nodes are replaced with nodes either IPv6-only or 
dual-stack but broadcasting IPv6 instead, the NAT64 can be removed.



On 09/11/2022 17:27, Lapo Luchini wrote:
I have a (3.11) cluster running on IPv4 addresses on a set of 
dual-stack servers; I'd like to add a new IPv6-only server to the 
cluster… is it possible to have the dual-stack ones answer on IPv6 
addresses as well (while keeping the single IPv4 address as 
broadcast_address, I guess)?


This sentence in cassandra.yaml suggests it's impossible:

    Setting listen_address to 0.0.0.0 is always wrong.

FAQ #1 also confirms that (is this true also with broadcast_address?):

    if different nodes in the cluster pick different addresses for you,
    Bad Things happen.

Is it possible to do this, or is my only chance to shutdown the 
entire cluster and launch it again as IPv6-only?

(IPv6 is available on each and every host)

And even in that case, is it possible for a cluster to go down from 
a set of IPv4 address and be recovered on a parallel set of IPv6 
addresses? (I guess gossip does not expect that)


thanks in advance for any suggestion,

Re: Adding an IPv6-only server to a dual-stack cluster

2022-11-16 Thread Bowen Song via user

I would expect that you'll need NAT64 in order to have a cluster with 
mixed nodes between IPv6-only servers and dual-stack servers that's 
broadcasting their IPv4 addresses. Once all IPv4-broadcasting dual-stack 
nodes are replaced with nodes either IPv6-only or dual-stack but 
broadcasting IPv6 instead, the NAT64 can be removed.



On 09/11/2022 17:27, Lapo Luchini wrote:
I have a (3.11) cluster running on IPv4 addresses on a set of 
dual-stack servers; I'd like to add a new IPv6-only server to the 
cluster… is it possible to have the dual-stack ones answer on IPv6 
addresses as well (while keeping the single IPv4 address as 
broadcast_address, I guess)?


This sentence in cassandra.yaml suggests it's impossible:

    Setting listen_address to 0.0.0.0 is always wrong.

FAQ #1 also confirms that (is this true also with broadcast_address?):

    if different nodes in the cluster pick different addresses for you,
    Bad Things happen.

Is it possible to do this, or is my only chance to shutdown the entire 
cluster and launch it again as IPv6-only?

(IPv6 is available on each and every host)

And even in that case, is it possible for a cluster to go down from a 
set of IPv4 address and be recovered on a parallel set of IPv6 
addresses? (I guess gossip does not expect that)


thanks in advance for any suggestion,

Re: Query drivertimeout PT2S

2022-11-08 Thread Bowen Song via user

This is a mailing list for the Apache Cassandra, and that's not the same 
as DataStax Enterprise Cassandra you are using. We may still be able to 
help here if you could provide more details, such as the queries, table 
schema, system stats (cpu, ram, disk io, network, and so on), logs, 
table stats, etc., but if it's a DSE Cassandra specific issue, you may 
have better luck contacting DataStax directly or posting it on the 
DataStax Community 
.


On 08/11/2022 14:58, Shagun Bakliwal wrote:

Hi All,

My application is frequently getting timeout errors since 2 weeks now. 
I'm using datastax Cassandra 4.14


Can someone help me here?

Thanks,
Shagun

Re: Upgrade

2022-11-08 Thread Bowen Song via user

You should take a snapshot before starting the upgrade process. You 
cannot achieve a snapshot of "the most current situation" in a live 
cluster anyway, as data are constantly written to the cluster even after 
a node is stopped for upgrading. So you've gotta to accept the outdated 
snapshots if you ever want to downgrade.


On the documentation side, I totally agree with you. The Apache 
Cassandra documentation on some common tasks, such as upgrading the 
cluster, is very lacking. I really hope this can be improved.



On 04/11/2022 08:13, Marc Hoppins wrote:


Hi all,

On a test setup I a looking to do an upgrade from 4.0.3 to 4.0.6.

Would one typically snapshot before DRAIN or after?

If DRAIN after snapshot, I would have to restart the service to 
snapshot and would this not then be accepting new operations/data?


If DRAIN before snapshot, would there be the possibility of not having 
the most current situation?  I realise that the latter option would 
render any changes fairly negligible even in a live environment.


Apologies if these questions seem redundant.  Apache documentation is 
not as comprehensive as it seems to be for (EG) HBASE.


Thanks

M

Re: Upgrade Pt2

2022-10-19 Thread Bowen Song via user

Please read
https://docs.datastax.com/en/upgrading/docs/datastax_enterprise/upgrdCstarToDSE.html#_general_restrictions

The document is written for DSE Cassandra, but must of it applies to
Apache Cassandra too.

In short, watch out for these:

Client side:

* Check client driver compatibility.
* Set the protocol version explicitly in your application.
* Ensure that the list of initial contact points contains only hosts
with the oldest Cassandra version or protocol version.

Server side:

* Do not enable new features.
* Do not run nodetool repair.
* During the upgrade, do not bootstrap new nodes or decommission
existing nodes.
* Do not enable Change Data Capture (CDC) on a mixed-version cluster.
Upgrade all nodes to DSE 5.1 (equivalent to Apache Cassandra 3.10)
or later before enabling CDC.
* Complete the cluster-wide upgrade before the expiration of
gc_grace_seconds (approximately 13 days) to ensure any repairs
complete successfully.
* Do not issue TRUNCATE or DDL related queries during the upgrade process.

In addition, if using relevant security features:

* Do not change security credentials or permissions until the upgrade
is complete on all nodes.
* If you are not already using Kerberos, do not set up Kerberos
authentication before upgrading. First upgrade the cluster, and then
set up Kerberos.

On 19/10/2022 11:36, Marc Hoppins wrote:

Hi all,

What (if any) problems could we expect from an upgrade?

Ie., If we have 12 nodes and I upgrade them one-at-a-time, some will
be on the new version and others on the old.

Assuming that daily operations continue during this process, could
problems occur with streaming replica from one node to another?

Marc

Re: Questions on the count and multiple index behaviour in cassandra

2022-09-28 Thread Bowen Song via user


It sounds like you are misusing/abusing Cassandra.

I've noticed the following Cassandra anti-patterns in your post:

1. Large or uneven partitions
   All rows in a table in a single partition is definitely an
   anti-pattern unless you only have a very small number of rows.
2. "SELECT COUNT(*) FROM ..." without providing a partition key
   In your case, since all rows are in a single partition, it's
   equivalent to without a partition key.
3. Wide table (too many columns)
   91 columns sounds excessive, and may lead to reduced performance and
   heightened JVM GC pressure

Cassandra is not a SQL database. You should design your table schema 
around the queries, not design your queries around the table schema. You 
may also need to store multiple copies of the same data with different 
keys to satisfy different queries.


On 28/09/2022 12:44, Karthik K wrote:

Hi,

We have two doubts on cassandra 3.11 features:

1) Need to get counts of row from a cassandra table.
We have 3 node clusters with Apache Cassandra 3.11 version.

We loaded a table in cassandra with 9lakh records. We have around 91 
columns in this table. Most of the records have text as datatype.

All these 9lakh records were part of a single partition key.

When we tried a select count(*) query with that partition key, the 
query was timing out.


However, we were able to retrieve counts through multiple calls by 
fetching only
1 lakh records in each call. The only disadvantage here is the time 
taken which

is around 1minute and 3 seconds.

Is there any other approach to get the row count faster in cassandra? 
Do we need to '
change the data modelling approach to achieve this? Suggestions are 
welcome



2) How to data model in cassandra to support usage of multiple filters.
 We may also need the count of rows for this multiple filter query.

Thanks & Regards,
Karthikeyan

Re: node decommission

2022-09-26 Thread Bowen Song via user

No, decommission does not decrease the load, as it only streams the data 
to other nodes, but doesn't remove them locally. However, decommission 
also shouldn't increase the load either. I can't offer an explanation 
for the load increase in your case.


On 26/09/2022 15:03, Marc Hoppins wrote:

Hulloa all,

I started a decommission. Node load was 1.08TiB.  After 6 or so hours the load 
is at 1.12TiB. Shouldn't it be DECREASING?

Re: Cassandra data sync time

2022-09-26 Thread Bowen Song via user

It looks like you have replication factor of 3 and total data size of 
1.43 GB per node. That's very small amount of data. Assuming the 
bottleneck is the network, not CPU or disk, and your 50 Mbps bandwidth 
is between each pair of servers across the two DCs (i.e. not the total 
bandwidth available between the DCs), the streaming process itself 
should only take minutes.


On 26/09/2022 12:14, Kaushal Shriyan wrote:




On Fri, Sep 23, 2022 at 8:39 PM Bowen Song via user 
 wrote:


What's your definition of "sync"? Streaming all the existing data
to the new DC? or the time lag between a write request is
completed in one DC and the other DC?

The former can be estimated based on a few facts about your setup
(number of nodes, data size, etc.) and some measured data
(streaming speed).

The latter is usually just slightly above the network latency, but
can spike up if and when the network between DCs suffer from
temporary connectivity issues.


Hi Bowen,

Thanks for the quick response. I was referring to streaming all the 
existing data to the new DC(DC2). We have



On 23/09/2022 15:58, Kaushal Shriyan wrote:

Hi,

Is there a way to measure cassandra nodes data sync time between
DC1 and DC2? Currently DC1 is the prod datacenter. I am adding
DC2 to the new data center by referring to
https://docs.apigee.com/private-cloud/v4.51.00/adding-data-center?hl=en.

https://docs.apigee.com/release/supported-software
Cassandra version :- 2.1.22

Is there a way to measure the time taken to sync the data in
current prod DC1 (Cassandra Node 1, 2 ,3) and the new DC2
(Cassandra Node 4, 5 ,6)?

Thanks in advance.

Best Regards,

Kaushal


Hi Bowen,

Thanks for the quick response. Streaming all the existing data from 
the current prod DC1 (Cassandra Node 1, 2 ,3) to the new DC2 
(Cassandra Node 4, 5 ,6). Data bandwidth between DC1 and DC2 is around 
50 Mbps. Please let me know if you need any additional details. Thanks 
in advance.


/opt/apigee/apigee-cassandra/bin/nodetool status
Datacenter: dc-1

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address        Load       Tokens  Owns (effective)  Host ID       
                        Rack
UN  192.198.11.4    1.43 GB    1       100.0% 
 dbfbd44f-kec5-4f91-bc7d-c31582aec35a  ra-1
UN  192.198.11.128  1.43 GB    1       100.0% 
 bc55019c-8ccb-4403-9dc4-481b90a262f6  ra-1
UN  192.198.11.3    1.43 GB    1       100.0% 
 4402901c-4562-4f0f-b14a-4eed40a9836c  ra-1


_On Node1_
du -ch /opt/apigee/data/apigee-cassandra/data
1.7G total

_On Node2
_
du -ch /opt/apigee/data/apigee-cassandra/data

_On Node3
_
du -ch /opt/apigee/data/apigee-cassandra/data

Best Regards,

Kaushal

Re: Cassandra data sync time

2022-09-23 Thread Bowen Song via user

What's your definition of "sync"? Streaming all the existing data to the 
new DC? or the time lag between a write request is completed in one DC 
and the other DC?


The former can be estimated based on a few facts about your setup 
(number of nodes, data size, etc.) and some measured data (streaming speed).


The latter is usually just slightly above the network latency, but can 
spike up if and when the network between DCs suffer from temporary 
connectivity issues.


On 23/09/2022 15:58, Kaushal Shriyan wrote:

Hi,

Is there a way to measure cassandra nodes data sync time between DC1 
and DC2? Currently DC1 is the prod datacenter. I am adding DC2 to the 
new data center by referring to 
https://docs.apigee.com/private-cloud/v4.51.00/adding-data-center?hl=en.


https://docs.apigee.com/release/supported-software
Cassandra version :- 2.1.22

Is there a way to measure the time taken to sync the data in current 
prod DC1 (Cassandra Node 1, 2 ,3) and the new DC2 (Cassandra Node 4, 5 
,6)?


Thanks in advance.

Best Regards,

Kaushal

Re: Restart Cassandra

2022-09-23 Thread Bowen Song via user

Even when a node has been stopped, it will still show up in the 
"nodetool status" output from other running nodes. While a node is 
starting, the status output from this node itself is pointless, because 
it may yet to receive the status from other nodes. You should ignore 
that until it's fully started.


The time between restarting each node depends on how quick a node 
starts. Replaying commit logs can take very long time (I have seen it 
took over 10 minutes). You should always check a restarting node's 
current status, ensure it has finished starting, and then wait for 
gossip to settle (sleep for a few minutes should do) before moving on to 
the next node.


On 23/09/2022 15:07, Marc Hoppins wrote:

I restarted 48 nodes and every one came up fine. I was just wondering why the 
status run on the restarted node has no ID until it has finished dealing with 
whatever it does when starting up but it shows up immediately when status is 
run on any other node.

I guess it prompts the question: how much time should elapse between restarting 
each node? It seems to be something <60 seconds but I suppose it would depend 
on whatever was lingering in the commit directory.

-Original Message-
From: Bowen Song via user 
Sent: Friday, September 23, 2022 3:47 PM
To: user@cassandra.apache.org
Subject: Re: Restart Cassandra

EXTERNAL


Did the node finish starting when you checked the "nodetool status"
output? Try "nodetool netstats" on the starting node, the output will show "Mode: NORMAL" if it has finished starting. 
It's also worth checking the "nodetool info" output, and make sure "Gossip active" and "Native Transport 
active" (unless you have disabled it) are "true".

On 23/09/2022 08:17, Marc Hoppins wrote:

Hi all,

Restarting the service on a node.  Checking status from a remote node, I see:

(prod) marc.hoppins.ipa@ba-cassandra01:~ $ /opt/cassandra/bin/nodetool status 
-r|grep 03
UN  ba-cassandra09   779.03 GiB   16  ? 
1fc8061d-2dd4-4b2c-97fa-e492063da495  SSW09
UN  ba-cassandra20   796.94 GiB   16  ? 
c6b43e76-bd5d-4672-a62a-83a06030578d  SSW09
UN  ba-cassandra10   750.84 GiB   16  ? 
c03ae9c6-89cb-4e65-a1ef-a56e2efc24da  SSW09
UN  ba-cassandra04   785.97 GiB   16  ? 
16dac20f-89fe-435c-8b49-d80a03fe239e  SSW09
DN  ba-cassandra03   729.43 GiB   16  ? 
8785b173-6b68-45a4-ad38-e9b4036ffaf5  SSW09
UN  dr1-cassandra18  738.9 GiB16  ? 
84348044-b6c6-44d0-9038-b1d49d39e496  SSW02
UN  dr1-cassandra03  783.04 GiB   16  ? 
21dac8e4-b556-48f1-873d-fb2876e2c349  SSW02

But when checking locally, I see:

?N  ba-cassandra08   ?   16  ? 
8520bdcd-1cfb-431f-a99c-15b8ca288e96  SSW09
?N  ba-cassandra04   ?   16  ? 
16dac20f-89fe-435c-8b49-d80a03fe239e  SSW09
?N  ba-cassandra11   ?   16  ? 
cf010de0-657c-4135-beec-7ba37cc3d8f4  SSW09
?N  ba-cassandra18   ?   16  ? 
994c67d7-e6f9-4419-a02b-b5296ec92cb0  SSW09
UN  ba-cassandra03   146.03 GiB  16  ?  
 SSW09
?N  ba-cassandra21   ?   16  ? 
7a8fc8c4-fb64-4bb7-ad5c-cb5112d9f783  SSW09
?N  ba-cassandra13   ?   16  ? 
853f095a-c780-473d-85f0-b8d047d745f1  SSW09

When I recheck it, I notice that the data count increases to the correct amount 
after some small time, the node ID appears for the local status, and the remote 
status shows as UN.  If the remote status shows the node ID, why is it missing 
locally?   Is the node ID only stored on the seeds?

Re: Restart Cassandra

2022-09-23 Thread Bowen Song via user

Did the node finish starting when you checked the "nodetool status" 
output? Try "nodetool netstats" on the starting node, the output will 
show "Mode: NORMAL" if it has finished starting. It's also worth 
checking the "nodetool info" output, and make sure "Gossip active" and 
"Native Transport active" (unless you have disabled it) are "true".


On 23/09/2022 08:17, Marc Hoppins wrote:

Hi all,

Restarting the service on a node.  Checking status from a remote node, I see:

(prod) marc.hoppins.ipa@ba-cassandra01:~ $ /opt/cassandra/bin/nodetool status 
-r|grep 03
UN  ba-cassandra09   779.03 GiB   16  ? 
1fc8061d-2dd4-4b2c-97fa-e492063da495  SSW09
UN  ba-cassandra20   796.94 GiB   16  ? 
c6b43e76-bd5d-4672-a62a-83a06030578d  SSW09
UN  ba-cassandra10   750.84 GiB   16  ? 
c03ae9c6-89cb-4e65-a1ef-a56e2efc24da  SSW09
UN  ba-cassandra04   785.97 GiB   16  ? 
16dac20f-89fe-435c-8b49-d80a03fe239e  SSW09
DN  ba-cassandra03   729.43 GiB   16  ? 
8785b173-6b68-45a4-ad38-e9b4036ffaf5  SSW09
UN  dr1-cassandra18  738.9 GiB16  ? 
84348044-b6c6-44d0-9038-b1d49d39e496  SSW02
UN  dr1-cassandra03  783.04 GiB   16  ? 
21dac8e4-b556-48f1-873d-fb2876e2c349  SSW02

But when checking locally, I see:

?N  ba-cassandra08   ?   16  ? 
8520bdcd-1cfb-431f-a99c-15b8ca288e96  SSW09
?N  ba-cassandra04   ?   16  ? 
16dac20f-89fe-435c-8b49-d80a03fe239e  SSW09
?N  ba-cassandra11   ?   16  ? 
cf010de0-657c-4135-beec-7ba37cc3d8f4  SSW09
?N  ba-cassandra18   ?   16  ? 
994c67d7-e6f9-4419-a02b-b5296ec92cb0  SSW09
UN  ba-cassandra03   146.03 GiB  16  ?  
 SSW09
?N  ba-cassandra21   ?   16  ? 
7a8fc8c4-fb64-4bb7-ad5c-cb5112d9f783  SSW09
?N  ba-cassandra13   ?   16  ? 
853f095a-c780-473d-85f0-b8d047d745f1  SSW09

When I recheck it, I notice that the data count increases to the correct amount 
after some small time, the node ID appears for the local status, and the remote 
status shows as UN.  If the remote status shows the node ID, why is it missing 
locally?   Is the node ID only stored on the seeds?

Re: Understanding multi region read query and latency

2022-08-09 Thread Bowen Song via user

Adding sleep to solve racing conditions is a bad practice, and should be 
avoided if possible. Instead, use read and write CL that guarantees 
strong consistency when it is required/needed.

On 09/08/2022 23:49, Jim Shaw wrote:

Raphael:
   Have you found  root cause ? If not, here are a few tips, based on 
what I experienced before, but may not  be same as your case, just 
hope it is helpful.

1) app side called wrong code module

get the cql from system.prepared_statements

cql statement is helpful to developers to search their code and find 
issue parts. In my case,  was function disabled but actually not, when 
they see cql statement, they realized.

2) app side code query immediately after write

from the trace, you have read time,  get this row write time by

select writetime ("any non-key column here") from "table_name_here" 
where ...;

if read time is too close to write time,  ask developers to add a 
sleep in code.

while earlier phase of projects using cassandra, developers still get 
used to rdbms style, forget cassandra is distributed database (i.e. in 
code, 10 cql statements in a logic order, they assume they will be 
executed in order, but actually in distributed system, no order, last 
line in code may execute 1st in cassandra cluster).

3) duplicate the case
use copy tables, testing data, by comparing the traces, duplicate the 
case, so know your debug direction right or not right.

Regards,

Jim

On Sun, Aug 7, 2022 at 5:14 PM Stéphane Alleaume 
 wrote:

You're right too, this option is not new, sorry.

Is this option can be useful ?

Le dim. 7 août 2022, 22:18, Bowen Song via user
 a écrit :

Do you mean "nodetool settraceprobability"? This is not
exactly new, I remember it was available on Cassandra 2.x.

On 07/08/2022 20:43, Stéphane Alleaume wrote:

I think perhaps you already know but i read you can now trace
only a % of all queries, i will look to retrieve the name of
this fonctionnality (in new Cassandra release).

Hope it will help
Kind regards
Stéphane

Le dim. 7 août 2022, 20:26, Raphael Mazelier
 a écrit :

> "Read repair is in the blocking read path for the
query, yep"

OK interesting. This is not what I understood from the
documentation. And I use localOne level consistency.

I enabled tracing (see in the attachment of my first
msg)/ but I didn't see read repair in the trace (and btw
I tried to completely disable it on my table setting both
read_repair_chance and local_dc_read_repair_chance to 0).

The problem when enabling trace in cqlsh is that I only
get slow result. For having fast answer I need to iterate
faster on my queries.

I can provide again trace for analysis. I got something
more readable in python.

Best,

--

Raphael

On 07/08/2022 19:30, C. Scott Andreas wrote:

> but still as I understand the documentation the read
repair should not be in the blocking path of a query ?

Read repair is in the blocking read path for the query,
yep. At quorum consistency levels, the read repair must
complete before returning a result to the client to
ensure the data returned would be visible on subsequent
reads that address the remainder of the quorum.

If you enable tracing - either for a single CQL
statement that is expected to be slow, or probabilistic
from the server side to catch a slow query in the act -
that will help identify what’s happening.

- Scott

On Aug 7, 2022, at 10:25 AM, Raphael Mazelier
 <mailto:r...@futomaki.net> wrote:

Nope. And what really puzzle me is in the trace we
really show the difference between queries. The fast
queries only request read from one replicas, while slow
queries request from multiple replicas (and not only
local to the dc).

On 07/08/2022 14:02, Stéphane Alleaume wrote:

Hi

Is there some GC which could affect coordinarir node ?

Kind regards
Stéphane

Le dim. 7 août 2022, 13:41, Raphael Mazelier
 a écrit :

Thanks for the answer but I was well aware of
this. I use localOne as consistency level.

My client connect to a local seeds, then choose a
local coordinator (as far I can understand the
trace log).

Then for a batch of request I got approximately
98% of request treated in 2/3ms in local DC with
one read request, and

Re: Understanding multi region read query and latency

2022-08-07 Thread Bowen Song via user

Do you mean "nodetool settraceprobability"? This is not exactly new, I 
remember it was available on Cassandra 2.x.

On 07/08/2022 20:43, Stéphane Alleaume wrote:
I think perhaps you already know but i read you can now trace only a % 
of all queries, i will look to retrieve the name of this 
fonctionnality (in new Cassandra release).

Hope it will help
Kind regards
Stéphane

Le dim. 7 août 2022, 20:26, Raphael Mazelier  a écrit :

> "Read repair is in the blocking read path for the query, yep"

OK interesting. This is not what I understood from the
documentation. And I use localOne level consistency.

I enabled tracing (see in the attachment of my first msg)/ but I
didn't see read repair in the trace (and btw I tried to completely
disable it on my table setting both read_repair_chance and
local_dc_read_repair_chance to 0).

The problem when enabling trace in cqlsh is that I only get slow
result. For having fast answer I need to iterate faster on my
queries.

I can provide again trace for analysis. I got something more
readable in python.

Best,

--

Raphael

On 07/08/2022 19:30, C. Scott Andreas wrote:

> but still as I understand the documentation the read repair
should not be in the blocking path of a query ?

Read repair is in the blocking read path for the query, yep. At
quorum consistency levels, the read repair must complete before
returning a result to the client to ensure the data returned
would be visible on subsequent reads that address the remainder
of the quorum.

If you enable tracing - either for a single CQL statement that is
expected to be slow, or probabilistic from the server side to
catch a slow query in the act - that will help identify what’s
happening.

- Scott

On Aug 7, 2022, at 10:25 AM, Raphael Mazelier
 <mailto:r...@futomaki.net> wrote:

Nope. And what really puzzle me is in the trace we really show
the difference between queries. The fast queries only request
read from one replicas, while slow queries request from multiple
replicas (and not only local to the dc).

On 07/08/2022 14:02, Stéphane Alleaume wrote:

Hi

Is there some GC which could affect coordinarir node ?

Kind regards
Stéphane

Le dim. 7 août 2022, 13:41, Raphael Mazelier
 a écrit :

Thanks for the answer but I was well aware of this. I use
localOne as consistency level.

My client connect to a local seeds, then choose a local
coordinator (as far I can understand the trace log).

Then for a batch of request I got approximately 98% of
request treated in 2/3ms in local DC with one read request,
and 2% treated by many nodes (according to the trace) and
then way longer (250ms).

?

    On 06/08/2022 14:30, Bowen Song via user wrote:

See the diagram below. Your problem almost certainly
arises from step 4, in which an incorrect consistency
level set by the client caused the coordinator node to
send the READ command to nodes in other DCs.

The load balancing policy only affects step 2 and 3, not
step 1 or 4.

You should change the consistency level to
LOCAL_ONE/LOCAL_QUORUM/etc. to fix the problem.

On 05/08/2022 22:54, Bowen Song wrote:

The DCAwareRoundRobinPolicy/TokenAwareHostPolicy
controlls which Cassandra coordinator node the client
sends queries to, not the nodes it connects to, nor the
nodes that performs the actual read.

A client sends a CQL read query to a coordinator node,
and the coordinator node parses the CQL query, and send
READ requests to other nodes in the cluster based on the
consistency level.

Have you checked the consistency level of the session
(and the query if applicable)? Is it prefixed with
"LOCAL_"? If not, the coordinator will send the READ
requests to non-local DCs.

On 05/08/2022 19:40, Raphael Mazelier wrote:

Hi Cassandra Users,

I'm relatively new to Cassandra and first I have to say
I'm really impressed by the technology.

Good design and a lot of stuff to understand the
underlying (the Oreilly book help a lot as well as
thelastpickle blog post).

I have an muli-datacenter c* cluster (US, Europe,
Singapore) with eight node on each (two seeds on each
region), two racks on Eu, Singapore, 3 on US. Everything
deployed in AWS.

We have a keyspace configured with network topology and
two replicas on every region like this: {'class':
'NetworkTopologyStrategy', 'ap-southeast-1': '2',
'eu-west-1': '2', 'us-east-1': '2'}

Investigating some performance issue I noticed strange
things in my experiment:

Exception encountered during startup: TruncateException

2022-08-06 Thread Bowen Song via user


Hello,


I have Cassandra 4.0.1 on a server failing to start. The server was 
power cycled after it experienced an unrecoverable memory error detected 
by EDAC. The memory error was transitory, and AFAIK it has disappeared. 
But Cassandra is not starting. The logs are:


   INFO  [main] 2022-08-06 18:11:53,494 ColumnFamilyStore.java:2242 -
   Truncating system.size_estimates
   DEBUG [MemtablePostFlush:1] 2022-08-06 18:11:53,495
   ColumnFamilyStore.java:933 - forceFlush requested but everything is
   clean in size_estimates
   INFO  [main] 2022-08-06 18:11:53,496 ColumnFamilyStore.java:2279 -
   Truncating system.size_estimates with truncatedAt=1659805913495
   DEBUG [main] 2022-08-06 18:11:53,496
   CompactionStrategyManager.java:519 - Recreating compaction strategy
   - disk boundaries are out of date for system.size_estimates.
   DEBUG [main] 2022-08-06 18:11:53,497 DiskBoundaryManager.java:55 -
   Refreshing disk boundary cache for system.size_estimates
   DEBUG [main] 2022-08-06 18:11:53,497 DiskBoundaryManager.java:94 -
   Got local ranges
   [Full(/**.**.**.**:7000,(-9223372036854775808,-9223372036854775808])]
   (ringVersion = 428)
   DEBUG [main] 2022-08-06 18:11:53,498 DiskBoundaryManager.java:58 -
   Updating boundaries from
   DiskBoundaries{directories=[DataDirectory{location=/var/lib/cassandra/data}],
   positions=null, ringVersion=1, directoriesVersion=0} to DiskBoundarie
   s{directories=[DataDirectory{location=/var/lib/cassandra/data}],
   positions=[max(9223372036854775807)], ringVersion=428,
   directoriesVersion=0} for system.size_estimates
   DEBUG [MemtablePostFlush:1] 2022-08-06 18:11:53,498
   ColumnFamilyStore.java:933 - forceFlush requested but everything is
   clean in size_estimates
   ERROR [main] 2022-08-06 18:11:53,502 CassandraDaemon.java:909 -
   Exception encountered during startup
   org.apache.cassandra.exceptions.TruncateException: Error during
   truncate: java.lang.IllegalArgumentException: Requested permits (0)
   must be positive
    at
   
org.apache.cassandra.cql3.statements.TruncateStatement.executeLocally(TruncateStatement.java:96)
    at
   
org.apache.cassandra.cql3.QueryProcessor.executeInternal(QueryProcessor.java:323)
    at
   
org.apache.cassandra.db.SystemKeyspace.clearAllEstimates(SystemKeyspace.java:1360)
    at
   
org.apache.cassandra.service.StorageService.cleanupSizeEstimates(StorageService.java:4002)
    at
   org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:373)
    at
   
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:763)
    at
   org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:887)
   Caused by: java.lang.RuntimeException:
   java.lang.IllegalArgumentException: Requested permits (0) must be
   positive
    at
   
org.apache.cassandra.db.ColumnFamilyStore.runWithCompactionsDisabled(ColumnFamilyStore.java:2378)
    at
   
org.apache.cassandra.db.ColumnFamilyStore.runWithCompactionsDisabled(ColumnFamilyStore.java:2325)
    at
   
org.apache.cassandra.db.ColumnFamilyStore.truncateBlocking(ColumnFamilyStore.java:2302)
    at
   
org.apache.cassandra.cql3.statements.TruncateStatement.executeLocally(TruncateStatement.java:92)
    ... 6 common frames omitted
   Caused by: java.lang.IllegalArgumentException: Requested permits (0)
   must be positive
    at
   com.google.common.base.Preconditions.checkArgument(Preconditions.java:189)
    at
   
com.google.common.util.concurrent.RateLimiter.checkPermits(RateLimiter.java:430)
    at
   com.google.common.util.concurrent.RateLimiter.reserve(RateLimiter.java:285)
    at
   com.google.common.util.concurrent.RateLimiter.acquire(RateLimiter.java:273)
    at
   
org.apache.cassandra.db.ColumnFamilyStore.snapshotWithoutFlush(ColumnFamilyStore.java:1849)
    at
   
org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore.java:2029)
    at
   
org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore.java:2005)
    at
   
org.apache.cassandra.db.ColumnFamilyStore.snapshot(ColumnFamilyStore.java:1993)
    at
   org.apache.cassandra.db.ColumnFamilyStore$2.run(ColumnFamilyStore.java:2288)
    at
   
java.base/java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
    at
   
org.apache.cassandra.db.ColumnFamilyStore.runWithCompactionsDisabled(ColumnFamilyStore.java:2374)
    ... 9 common frames omitted

I've tried to restart Cassandra service multiple times on the server 
without any success.


I searched the Internet and found this on StackOverflow: 
https://stackoverflow.com/questions/66795461/cassandra-4-0-beta4-exception-encountered-during-startup-requested-permits-0-m


 * The 1st answer, by Erick Ramirez, suggested removing the
   "snapshot" directories from size_estimates and table_estimates

Re: Understanding multi region read query and latency

2022-08-05 Thread Bowen Song via user

The  DCAwareRoundRobinPolicy/TokenAwareHostPolicy controlls which 
Cassandra coordinator node the client sends queries to, not the nodes it 
connects to, nor the nodes that performs the actual read.


A client sends a CQL read query to a coordinator node, and the 
coordinator node parses the CQL query, and send READ requests to other 
nodes in the cluster based on the consistency level.


Have you checked the consistency level of the session (and the query if 
applicable)? Is it prefixed with "LOCAL_"? If not, the coordinator will 
send the READ requests to non-local DCs.



On 05/08/2022 19:40, Raphael Mazelier wrote:


Hi Cassandra Users,

I'm relatively new to Cassandra and first I have to say I'm really 
impressed by the technology.


Good design and a lot of stuff to understand the underlying (the 
Oreilly book help a lot as well as thelastpickle blog post).


I have an muli-datacenter c* cluster (US, Europe, Singapore) with 
eight node on each (two seeds on each region), two racks on Eu, 
Singapore, 3 on US. Everything deployed in AWS.


We have a keyspace configured with network topology and two replicas 
on every region like this: {'class': 'NetworkTopologyStrategy', 
'ap-southeast-1': '2', 'eu-west-1': '2', 'us-east-1': '2'}



Investigating some performance issue I noticed strange things in my 
experiment:


What we expect is very slow latency 3/5ms max for this specific select 
query. So we want every read to be local the each datacenter.


We configure DCAwareRoundRobinPolicy(local_dc=DC) in python, and the 
same in Go gocql.TokenAwareHostPolicy(gocql.DCAwareRoundRobinPolicy("DC"))


Testing a bit with two short program (I can provide them) in go and 
python I notice very strange result. Basically I do the same query 
over and over with a very limited dataset of id.


The first result were surprising cause the very first query were 
always more than 250ms and after with stressing c* (playing with sleep 
between query) I can achieve a good ratio of query at 3/4 ms (what I 
expected).


My guess was that long query were somewhat executed not locally (or at 
least imply multi datacenter queries) and short one no.


Activating tracing in my program (like enalbing trace in cqlsh) kindla 
confirm my suspicion.


(I will provide trace in attachment).

My question is why sometime C* try to read not localy? how we can 
disable it? what is the criteria for this?


(btw I'm very not fan of this multi region design for theses very 
specific kind of issues...)


Also side question: why C* is so slow at connection? it's like it's 
trying to reach every nodes in each DC? (we only provide locals seeds 
however). Sometimes it take more than 20s...


Any help appreciated.

Best,

--

Raphael Mazelier

Re: unsubscribe

2022-08-04 Thread Bowen Song via user

Please send an email to "user-unsubscr...@cassandra.apache.org" to 
unsubscribe from this mailing list.


On 04/08/2022 18:29, Dathan Vance Pattishall wrote:

unsubscribe

Re: Service shutdown

2022-08-04 Thread Bowen Song via user

Generally speaking, I've seen Cassandra process stopping for the 
following reasons:


   OOM killer
   JVM OOM
   Received a signal, such as SIGTERM and SIGKILL
   File IO error when disk_failure_policy or commit_failure_policy is
   set to die
   Hardware issues, such as memory corruption, causing Cassandra to crash
   Reaching ulimit resource limits, such as "too many open files"

They all leave traces behind. You said you've checked OS logs, and you 
posted only the systemd logs from DAEMON.LOG. Have you checked "dmesg" 
output? Some system logs, such as OOM killer and MCE error logs, don't 
go into the DAEMON.LOG file.



On 04/08/2022 11:00, Marc Hoppins wrote:

Hulloa all,

Service on two nodes stopped yesterday and I can find nothing to indicate why.  
I have checked Cassandra system.logs, gc.logs and debug.logs as well as OS logs 
and all I can see is the following - which is far from helpful:

DAEMON.LOG
Aug  3 11:39:12 cassandra19 systemd[1]: cassandra.service: Main process exited, 
code=exited, status=1/FAILURE
Aug  3 11:39:12 cassandra19 systemd[1]: cassandra.service: Failed with result 
'exit-code'.

Aug  3 13:44:52 cassandra23 systemd[1]: cassandra.service: Main process exited, 
code=exited, status=1/FAILURE
Aug  3 13:44:52 cassandra23 systemd[1]: cassandra.service: Failed with result 
'exit-code'.

Initially I thought that the reason the second node went down was because it 
had problems communicating with the other stopped node but with a gap of 2 
hours it seems unlikely.  If this occurs on any of these two nodes again I will 
probably increase logging level but to do so for every node in the hope that I 
pick something up is impractical.

In the meantime, is there anything else I can look at which may deliver unto us 
more info?

Marc

Re: Wrong Consistency level seems to be used

2022-07-21 Thread Bowen Song via user

It doesn't make any sense to see consistency level ALL if the code is 
not explicitly using it. My best guess is somewhere in the code the 
consistency level was overridden.


On 21/07/2022 14:52, pwozniak wrote:


Hi,

we have the following code (java driver):

cluster =Cluster.builder().addContactPoints(contactPoints).withPort(port)
 .withProtocolVersion(ProtocolVersion.V3)
 .withQueryOptions(new QueryOptions()
 .setConsistencyLevel(ConsistencyLevel.QUORUM))
 .withTimestampGenerator(new AtomicMonotonicTimestampGenerator())
 .withCredentials(userName, password).build();

session =cluster.connect(keyspaceName);

where ConsistencyLevel.QUORUM is our default consistency level. But we 
keep receiving the following exceptions:



com.datastax.driver.core.exceptions.ReadTimeoutException: Cassandra 
timeout during read query at consistency ALL (3 responses were 
required but only 2 replica responded)



Why the consistency level is ALL in there? Availability of our cluster 
is reduced because of that. We verified all our source code and 
haven't found places where ALL is set.

We also did heap dump and found only ConsistencyLevel.QUORUM there.


Regards,

Pawel

Re: Adding nodes

2022-07-20 Thread Bowen Song via user

To unsubscribe, please send an email to 
user-unsubscr...@cassandra.apache.org



On 20/07/2022 18:34, emmanuel warreng wrote:

Unsubscribe

On Thu, Jul 7, 2022, 16:49 Marc Hoppins  wrote:

Hi all,

Cluster of 2 DC and 24 nodes

DC1 (RF3) = 12 nodes, 16 tokens each
DC2 (RF3) = 12 nodes, 16 tokens each

Adding 12 more nodes to DC1: I installed Cassandra (version is the
same across all nodes) but, after the first node added, I couldn't
seem to add any further nodes.

I check nodetool status and the newly added node is UJ. It remains
this way all day and only 86Gb of data is added to the node over
the entire day (probably not yet complete).  This seems a little
slow and, more than a little inconvenient to only be able to add
one node at a time - or at least one node every 2 minutes.  When
the cluster was created, I timed each node from service start to
status UJ (having a UUID) and it was around 120 seconds.  Of
course there was no data.

Is it possible I have some setting not correctly tuned?

Thanks

Marc

Re: Adding nodes

2022-07-12 Thread Bowen Song via user

You have some (many?) misunderstanding of how Cassandra works, and 
therefore many of your questions are hard to answer without educating 
you first and make you asking different but related and relevant 
questions instead. That's why you aren't getting any answer from us. We 
are not paid to do that, nor do we have that much free time to teach you 
about the fundamentals of Cassandra.


For instance, nobody has ever suggested that multiple racks in a DC is a 
requirement. Both Jeff and I were keep telling you that it's a trade-off 
between consistency and availability in the CAP theorem. But somehow, 
you convinced yourself that's a requirement.


You really should systematically learn about Cassandra before planning 
to use it in production. The same way you would systematically learn 
about air plane before trying to fly one. Once you are in the air, some 
mistakes are very hard if not impossible to fix.



On 12/07/2022 16:00, Marc Hoppins wrote:


I posted system log data, GC log data, debug log data, nodetool data.  
I believe I had described the situation more than adequately. 
Yesterday, I was asking what I assumed to be reasonable questions 
regarding the method for adding new nodes to a new rack.


Forgive me if it sounds unreasonable but I asked the same question 
again: your response regarding replication suggests that multiple 
racks in a datacentre is ALWAYS going to be the case when setting up a 
Cassandra cluster. Therefore, I can only assume that when setting up a 
new cluster there absolutely MUST be more than one rack.  The question 
I was asking yesterday regarding adding a new nodes in a new rack has 
never been adequately answered here and the only information I can 
find elsewhere clearly states that it is not recommended to add more 
than one new node at a time to maintain data/token consistency.


So how is it possible to add new hardware when one-at-a-time will 
absolutely overload the first node added?  That seems like a 
reasonable, general question which anyone considering employing the 
software is going to ask.


The reply to suggest that folk head off a pay for a course when there 
are ‘pre-sales’ questions is not a practical response as any business 
is unlikely to be spending speculative money.


*From:*Jeff Jirsa 
*Sent:* Tuesday, July 12, 2022 4:43 PM
*To:* cassandra 
*Cc:* Bowen Song 
*Subject:* Re: Adding nodes

EXTERNAL

On Tue, Jul 12, 2022 at 7:27 AM Marc Hoppins  
wrote:


I was asking the questions but no one cared to answer.

This is probably a combination of "it is really hard to answer a 
question with insufficient data" and your tone. Nobody here gets paid 
to help you solve your company's problems except you.

Re: Adding nodes

2022-07-12 Thread Bowen Song via user

I think you are misinterpreting many concepts here. For a starter, a 
physical rack in a physical DC is not (does not have to be) a logical 
rack in a logical DC in Cassandra; and the 
allocate_tokens_for_local_replication_factor has nothing to do with 
replication factor (other than using it as an input), but has everything 
to do with token allocation.


You need to plan for number of logical (not physical) racks per DC, 
either number of rack = 1, and RF = any, or number of rack = RF within 
that DC. It's not impossible to add (or remove) a rack from an existing 
DC, but it's much better to plan ahead.



On 12/07/2022 07:33, Marc Hoppins wrote:


There is likely going to be 2 racks in each DC.

Adding the new node decided to quit after 12 hours.  Node was 
overloaded and GC pauses caused the bootstrap to fail.  I begin to see 
the pattern here.  If replication is only within the same datacentre, 
and one starts off with only one rack then all data is within that 
rack, adding a new rack…but can only add one node at a time…will cause 
a surge of replication onto the one new node as this is now a failover 
point.  I noticed when checking netstats on the joining node that it 
was getting data from 12 sources. This lead me to the conclusion that 
ALL the streaming data was coming from every node in the same 
datacentre. I checked this by running netstats on other nodes in the 
second datacentre and they were all quiescent.  So, unlike HBASE/HDFS 
where we can spread the replication across sites, it seems that it is 
not a thing for this software.  Or do I have that wrong?


Now, obviously, this is the second successive failure with adding a 
new node. ALL of the new nodes I need to add are in a new rack.


# Replica factor is explicitly set, regardless of keyspace or datacenter.

# This is the replica factor within the datacenter, like NTS.

allocate_tokens_for_local_replication_factor: 3

If this is going to happen every time I try to add a new node this is 
going to be an untenable situation.  Now, I am informed that the data 
in the cluster is not yet production, so it may be possible to wipe 
everything and start again, adding the new rack of nodes at create 
time. HOWEVER, this is then going to resurface when the next rack of 
nodes is added.  If the recommendation is to only add one node at a 
time to prevent problems with token ranges, data  or whatever, it is a 
serious limitation as not every business/organisation is going to have 
multiple racks available.


*From:*Bowen Song via user 
*Sent:* Monday, July 11, 2022 8:57 PM
*To:* user@cassandra.apache.org
*Subject:* Re: Adding nodes

EXTERNAL

I've noticed the joining node has a different rack than the rest of 
the nodes, is this intended? Will you add all new nodes to this rack 
and have RF=2 in that DC?


In principal, you should have equal number of servers (vnodes) in each 
rack, and have the rack number = RF or 1.


On 11/07/2022 13:15, Marc Hoppins wrote:

All clocks are fine.

Why would time synch would affect whether or not a node appears in
the nodetool status when running the command on a different node? 
Either the node is up and visible or not.

From 24 other nodes (including ba-freddy14 itself), it shows in
the status.

For those other 23 nodes AND from the joining node, the one node
which does not show the joining node (ba-freddy03) , is also
visible to all other nodes when running nodetool.

A sample set of nodetool output follows. If you look at the last
status for freddy03 you will see that the joining node
(ba-freddy14) does not appear, but when I started the join, and
for the following 20-25 minutes, it DID appear in the status.  So
I was just asking if anyone else had experienced this behaviour.

(JOINING NODE) ba-freddy14:nodetool status -r

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

-- Address    Load    Tokens Owns 
Host ID   Rack

UN  ba-freddy09 591.78 GiB  16  ?
9f7cdc62-2d5c-4d6e-be99-86c577131be5  SSW09

UJ  ba-freddy14 117.37 GiB  16  ?
bf85305e-256f-4eb9-9f15-5462f3b369b9  SSW05

UN  ba-freddy06 614.26 GiB  16  ?
30d85b23-c66c-4781-86e9-960375caf476  SSW09

UN  ba-freddy02 329.26 GiB  16  ?
3388ca94-5db5-4ef6-b7ab-e6fd0485ba49  SSW09

UN  ba-freddy12 584.57 GiB  16  ?
80239a34-89cb-459b-a30f-4253bc16ed99  SSW09

UN  ba-freddy07 563.51 GiB  16  ?
4de96ef6-bd48-4b16-bee1-05a0a6c9ac72  SSW09

UN  ba-freddy01 578.5 GiB   16  ?
86a84980-2f8f-4d23-9099-d4b48ad9d04c  SSW09

UN  ba-freddy05 575.33 GiB  16  ?
26c03d1b-9022-4e1c-bab4-d0d71bddf645  SSW09

UN  ba-freddy10 581.16 GiB  16  ?
7c4051a5-1c77-4713-aa43-561063cedb3a  SSW09

UN  ba-freddy08 605.92 GiB  16  ?
63fe46d1-c521-4df8-b1bb-ba0136168561  SSW09

UN  ba-freddy04 585.65 GiB  16  ?
4503f80a

Re: Adding nodes

2022-07-11 Thread Bowen Song via user

  SSW09


UN ba-freddy03   569.22 GiB  16  ? 
955f21a8-9bc8-4cef-b875-aa4cf7d3294c  SSW09


Datacenter: DR1

===

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

-- Address    Load    Tokens  Owns Host 
ID   Rack


UN dr1-freddy12  453.6 GiB   16  ? 
533bb049-c8c9-41d9-8da6-64bdeeb6945d  SSW02


UN dr1-freddy08  449.3 GiB   16  ? 
6e8c42d2-0f6d-4203-9bf7-5c5fe5e17093  SSW02


UN dr1-freddy07  450.42 GiB  16  ? 
4c14b75a-74e8-4518-9c22-053b3a1ad991  SSW02


UN dr1-freddy02  454.02 GiB  16  ? 
e68298d7-e5eb-421f-a586-d5ee3c026627  SSW02


UN dr1-freddy10  453.45 GiB  16  ? 
998bc6cb-7412-411a-89a6-ef5689d61a4a  SSW02


UN dr1-freddy05  463.36 GiB  16  ? 
07876bd9-5374-4df8-a480-168b4c06f9f1  SSW02


UN dr1-freddy11  453.01 GiB  16  ? 
38fca1c2-59da-4181-93a6-979b937b3fd9  SSW02


UN dr1-freddy03  460.55 GiB  16  ? 
a1ab1b4b-ccdc-4cb2-ad59-e9e67f0ddfbb  SSW02


UN dr1-freddy04  463.19 GiB  16      ? 
29ee0eff-010d-4fbb-b204-095de2225031  SSW02


UN dr1-freddy06  454.5 GiB   16  ? 
51467fd3-b795-4ba1-8eec-58b1030cb9c5  SSW02


UN dr1-freddy09  446.3 GiB   16  ? 
b071e232-b275-4ce7-809c-7c8fe546fbb4  SSW02


UN dr1-freddy01  450.86 GiB  16  ? 
c2340595-c3ec-440c-b978-62f62fd98a9a  SSW02


ba-freddy03: nodetool status -r

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

-- Address    Load    Tokens  Owns Host 
ID   Rack


UN ba-freddy09   592.23 GiB  16  ? 
9f7cdc62-2d5c-4d6e-be99-86c577131be5  SSW09


UN ba-freddy06   614.63 GiB  16  ? 
30d85b23-c66c-4781-86e9-960375caf476  SSW09


UN ba-freddy02   329.66 GiB  16  ? 
3388ca94-5db5-4ef6-b7ab-e6fd0485ba49  SSW09


UN ba-freddy12   584.97 GiB  16  ? 
80239a34-89cb-459b-a30f-4253bc16ed99  SSW09


UN ba-freddy07   563.91 GiB  16  ? 
4de96ef6-bd48-4b16-bee1-05a0a6c9ac72  SSW09


UN ba-freddy01   578.83 GiB  16  ? 
86a84980-2f8f-4d23-9099-d4b48ad9d04c  SSW09


UN ba-freddy05   575.69 GiB  16  ? 
26c03d1b-9022-4e1c-bab4-d0d71bddf645  SSW09


UN ba-freddy10   581.56 GiB  16  ? 
7c4051a5-1c77-4713-aa43-561063cedb3a  SSW09


UN ba-freddy08   606.27 GiB  16  ? 
63fe46d1-c521-4df8-b1bb-ba0136168561  SSW09


UN ba-freddy04   586.05 GiB  16  ? 
4503f80a-2890-4a3f-b0cb-d3cedc2b51d2  SSW09


UN ba-freddy11   576.86 GiB  16  ? 
b5b368fb-ebe3-4eed-a2a1-404b07ae2b6c  SSW09


UN ba-freddy03   569.32 GiB  16  ? 
955f21a8-9bc8-4cef-b875-aa4cf7d3294c  SSW09


Datacenter: DR1

===

Status=Up/Down

|/ State=Normal/Leaving/Joining/Moving

-- Address    Load    Tokens  Owns Host 
ID   Rack


UN dr1-freddy12  453.68 GiB  16  ? 
533bb049-c8c9-41d9-8da6-64bdeeb6945d  SSW02


UN dr1-freddy08  449.39 GiB  16  ? 
6e8c42d2-0f6d-4203-9bf7-5c5fe5e17093  SSW02


UN dr1-freddy07  450.51 GiB  16  ? 
4c14b75a-74e8-4518-9c22-053b3a1ad991  SSW02


UN dr1-freddy02  454.11 GiB  16  ? 
e68298d7-e5eb-421f-a586-d5ee3c026627  SSW02


UN dr1-freddy10  453.54 GiB  16  ? 
998bc6cb-7412-411a-89a6-ef5689d61a4a  SSW02


UN dr1-freddy05  463.44 GiB  16  ? 
07876bd9-5374-4df8-a480-168b4c06f9f1  SSW02


UN dr1-freddy11  453.1 GiB   16  ? 
38fca1c2-59da-4181-93a6-979b937b3fd9  SSW02


UN dr1-freddy03  460.62 GiB  16  ? 
a1ab1b4b-ccdc-4cb2-ad59-e9e67f0ddfbb  SSW02


UN dr1-freddy04  463.27 GiB  16  ? 
29ee0eff-010d-4fbb-b204-095de2225031  SSW02


UN dr1-freddy06  454.57 GiB  16  ? 
51467fd3-b795-4ba1-8eec-58b1030cb9c5  SSW02


UN dr1-freddy09  446.39 GiB  16  ? 
b071e232-b275-4ce7-809c-7c8fe546fbb4  SSW02


UN dr1-freddy01  450.94 GiB  16  ? 
c2340595-c3ec-440c-b978-62f62fd98a9a  SSW02


*From:*Joe Obernberger 
*Sent:* Monday, July 11, 2022 1:29 PM
*To:* user@cassandra.apache.org
*Subject:* Re: Adding nodes

EXTERNAL

I too came from HBase and discovered adding several nodes at a time 
doesn't work.  Are you absolutely sure that the clocks are in sync 
across the nodes?  This has bitten me several times.


-Joe

On 7/11/2022 6:23 AM, Bowen Song via user wrote:

You should look for warning and error level logs in the
system.log, not the debug.log or gc.log, and certainly not only
the latest lines.

BTW, you may want to spend some time investigating potential GC
issues based on the GC logs you provided. I can see 1 full GC in
the 3 hours since the node started. It's not necessarily a problem
(if it only occasionally happens during the initial bootstraping
process), but it should justify an investigation if this is the
first time you've seen it.

On 11/07/2022 11:09, Marc Hoppins wrote:

Service still running. No errors showing.

The latest info is in debug.log

DEBUG [Streaming-EventLoop-4-3] 2022-07-11 12:00:38,902
NettyStreamingMessageSender.java:258 - [Stream
#befbc5d0-00e7-11ed-860a-a139feb6a78a

Re: Adding nodes

2022-07-11 Thread Bowen Song via user

You should look for warning and error level logs in the system.log, not 
the debug.log or gc.log, and certainly not only the latest lines.


BTW, you may want to spend some time investigating potential GC issues 
based on the GC logs you provided. I can see 1 full GC in the 3 hours 
since the node started. It's not necessarily a problem (if it only 
occasionally happens during the initial bootstraping process), but it 
should justify an investigation if this is the first time you've seen it.


On 11/07/2022 11:09, Marc Hoppins wrote:


Service still running. No errors showing.

The latest info is in debug.log

DEBUG [Streaming-EventLoop-4-3] 2022-07-11 12:00:38,902 
NettyStreamingMessageSender.java:258 - [Stream 
#befbc5d0-00e7-11ed-860a-a139feb6a78a channel: 053f2911] Sending 
keep-alive


DEBUG [Stream-Deserializer-/10.1.146.174:7000-053f2911] 2022-07-11 
12:00:39,790 StreamingInboundHandler.java:179 - [Stream 
#befbc5d0-00e7-11ed-860a-a139feb6a78a channel: 053f2911] Received 
keep-alive


DEBUG [ScheduledTasks:1] 2022-07-11 12:00:44,688 
StorageService.java:2398 - Ignoring application state LOAD from 
/x.x.x.64:7000 because it is not a member in token metadata


DEBUG [ScheduledTasks:1] 2022-07-11 12:01:44,689 
StorageService.java:2398 - Ignoring application state LOAD from 
/x.x.x.64:7000 because it is not a member in token metadata


DEBUG [ScheduledTasks:1] 2022-07-11 12:02:44,690 
StorageService.java:2398 - Ignoring application state LOAD from 
/x.x.x.64:7000 because it is not a member in token metadata


And

gc.log.1.current

2022-07-11T12:08:40.562+0200: 11122.837: [GC (Allocation Failure) 
2022-07-11T12:08:40.562+0200: 11122.838: [ParNew


Desired survivor size 41943040 bytes, new threshold 1 (max 1)

- age   1:  57264 bytes,  57264 total

: 655440K->74K(737280K), 0.0289143 secs] 2575800K->1920436K(8128512K), 
0.0291355 secs] [Times: user=0.23 sys=0.00, real=0.03 secs]


Heap after GC invocations=6532 (full 1):

par new generation   total 737280K, used 74K [0x0005cae0, 
0x0005fce0, 0x0005fce0)


eden space 655360K,   0% used [0x0005cae0, 0x0005cae0, 
0x0005f2e0)


from space 81920K,   0% used [0x0005f2e0, 0x0005f2e12848, 
0x0005f7e0)


to   space 81920K,   0% used [0x0005f7e0, 0x0005f7e0, 
0x0005fce0)


concurrent mark-sweep generation total 7391232K, used 1920362K 
[0x0005fce0, 0x0007c000, 0x0007c000)


Metaspace used 53255K, capacity 56387K, committed 56416K, reserved 
1097728K


class space    used 6926K, capacity 7550K, committed 7576K, reserved 
1048576K


}

2022-07-11T12:08:40.591+0200: 11122.867: Total time for which 
application threads were stopped: 0.0309913 seconds, Stopping threads 
took: 0.0012599 seconds


{Heap before GC invocations=6532 (full 1):

par new generation   total 737280K, used 655434K [0x0005cae0, 
0x0005fce0, 0x0005fce0)


eden space 655360K, 100% used [0x0005cae0, 0x0005f2e0, 
0x0005f2e0)


from space 81920K,   0% used [0x0005f2e0, 0x0005f2e12848, 
0x0005f7e0)


to   space 81920K,   0% used [0x0005f7e0, 0x0005f7e0, 
0x0005fce0)


concurrent mark-sweep generation total 7391232K, used 1920362K 
[0x0005fce0, 0x0007c000, 0x0007c000)


Metaspace   used 53255K, capacity 56387K, committed 56416K, 
reserved 1097728K


class space    used 6926K, capacity 7550K, committed 7576K, reserved 
1048576K


2022-07-11T12:08:42.163+0200: 11124.438: [GC (Allocation Failure) 
2022-07-11T12:08:42.163+0200: 11124.438: [ParNew


Desired survivor size 41943040 bytes, new threshold 1 (max 1)

- age   1:  54984 bytes,  54984 total

: 655434K->80K(737280K), 0.0291754 secs] 2575796K->1920445K(8128512K), 
0.0293884 secs] [Times: user=0.22 sys=0.00, real=0.03 secs]


*From:*Bowen Song via user 
*Sent:* Monday, July 11, 2022 11:56 AM
*To:* user@cassandra.apache.org
*Subject:* Re: Adding nodes

EXTERNAL

Checking on multiple nodes won't help if the joining node suffers from 
any of the issues I described, as it will likely be flipping up and 
down frequently, and the existing nodes in the cluster may never reach 
an agreement before the joining node stays up (or stays down) for a 
while. However, it will be a very strange thing if this is a 
persistent behaviour. If the 'nodetool status' output on each node 
remained unchanged for hours and the outputs aren't the same between 
nodes, it could be an indicator of something else that had gone wrong.


Does the strange behaviour goes away after the joining node completes 
the streaming and fully joins the cluster?


On 11/07/2022 10:46, Marc Hoppins wrote:

I am beginning to wonder…

If you recall, I stated that I had checked status on a bunch of
other nodes from both datacentres and the joining node shows up.
No errors are occurring anywhere; data is streaming; node is

Re: Adding nodes

2022-07-11 Thread Bowen Song via user

How long doe it take to add a new node? I'm 100% sure neither 90s nor 
120s is the answer. The answer is it varies. If you want to wait for 
finishing adding a new node, be explicit about it, wait for the node 
fully joins the cluster. Don't put a fixed number of seconds in there.


You can estimate the time for adding many nodes once you've had added a 
node to the cluster. The time not only depends on the data size, 
hardware and network, but also the data in the SSTables files. For 
example, if a full copy of a very large partition exists in may SSTables 
files but the latest one of them is a tombstone, then the actual data 
get streamed is only the tombstone, not the other copies of large data.


BTW, for your own sake, you should consider automate the process to 
minimise human interactions required to add multiple nodes. It may be 
manageable when you have 5 or 10 nodes to add, but it will quickly spin 
off control when you have tons or a few hundred of them.


On 11/07/2022 10:41, Marc Hoppins wrote:


“Where did you come up with the 90 seconds number?” The database folk 
came up with THAT number. For myself, I timed adding a new node at 120 
seconds for the initial setup with no data in the cluster.


“What exactly are you waiting for by doing that?” I wanted to see for 
myself how long it took to add a new node.  Isn’t that what RESEARCH 
is all about?  I suppose I could have just ‘googled’ it.


“Since adding nodes doesn't interfere with the client queries, the 
time it takes to add a node shouldn't be a concern at all…” It IS a 
concern if one has to add many nodes and the ‘customers’ want some 
idea of how long the process will take. Or, and I may be alone in 
this, it would be helpful to know when to begin adding the next new 
node in the ticket. Therefore, if I know when my first node is 
finished, I will have an idea of how long before I check for the when 
subsequent nodes can be joined.


*From:*Bowen Song via user 
*Sent:* Monday, July 11, 2022 11:25 AM
*To:* user@cassandra.apache.org
*Subject:* Re: Adding nodes

EXTERNAL

Sleeping/pausing for a fixed amount of time between operations at best 
is a hack to workaround an unknown issue, but it's almost always 
better to be explicit about what you are waiting for. Where did you 
come up with the 90 seconds number? What exactly are you waiting for 
by doing that? If you want to wait for the node's state becomes normal 
(from joining), be explicit about it, check the nodetool output or the 
system.log file periodically instead of waiting for a fixed 90 seconds.


Streaming 600GB in a few hours sounds fairly reasonable. Since adding 
nodes doesn't interfere with the client queries, the time it takes to 
add a node shouldn't be a concern at all, as long as it's 
significantly faster the data growth rate. Just leave it running in 
the background, and get on with your life.


If you must speed up that process and don't care about data 
inconstancy or potencial down time, there's faster ways to do it, but 
doing that breaks the consistency and/or availability, which means it 
will interfere with client read/write operations.


A few hundred GB to a few TB per node is pretty common in Cassandra 
clusters. Big data is not about how much data on EACH node, it's about 
how much data in TOTAL.


On 11/07/2022 09:01, Marc Hoppins wrote:

Well then…

I left this on Friday (still running) and came back to it today
(Monday) to find the service stopped.  So, I blitzed this node
from the ring and began anew with a different new node.

I rather suspect the problem was with trying to use Ansible to add
these initially - despite the fact that I had a serial limit of 1
and a pause of 90s for starting the service on each new node
(based on the time taken when setting up this Cassandra cluster).

So…moving forward…

It is recommended to only add one new node at a time from what I
read.  This leads me to:

Although I see the new node LOAD is progressing far faster than
the previous failure, it is still going to take several hours to
move from UJ to UN, which means I’ll be at this all week for the
12 new nodes. If our LOAD per node is around 400-600GB, is there
any practical method to speed up adding multiple new nodes which
is unlikely to cause problems?  After all, in the modern world of
big (how big is big?) data, 600G per node is far less than the
real BIG big-data.

Marc

*From:*Jeff Jirsa  <mailto:jji...@gmail.com>
*Sent:* Friday, July 8, 2022 5:46 PM
*To:* cassandra 
<mailto:user@cassandra.apache.org>
*Cc:* Bowen Song  <mailto:bo...@bso.ng>
*Subject:* Re: Adding nodes

EXTERNAL

Having a node UJ but not sending/receiving other streams is an
invalid state (unless 4.0 moved the streaming data out of
netstats? I'm not 100% sure, but I'm 99% sure it should be there).

It likely stopped the bootstrap process long ago with a

1 2 >

1 - 100 of 107 matches

Mail list logo