Fwd: Re: Cassandra uneven data repartition

2023-01-06 Thread onmstester onmstester via user
Isn't there a very big (>40GB) sstable in /volumes/cassandra/data/data1? If 
there is you could split it or change your data model to prevent such sstables.



Sent using https://www.zoho.com/mail/








 Forwarded message 
From: Loïc CHANEL via user 
To: 
Date: Fri, 06 Jan 2023 12:58:11 +0330
Subject: Re: Cassandra uneven data repartition
 Forwarded message 



Hi team,



Does anyone know how to even the data between several data disks ?

Another approach could be to prevent Cassandra from writing on a 90% full disk, 
but is there a way to do that ?

Thanks,





Loïc CHANEL
System Big Data engineer
SoftAtHome (Lyon, France)
























Le lun. 19 déc. 2022 à 11:07, Loïc CHANEL  
a écrit :





Hi team,



I had a disk space issue on a Cassandra server, and I noticed that the data was 
not evenly shared between my 15 disks.

Here is the repartition :

/dev/vde1        99G   89G  4.7G  96% /volumes/cassandra/data/data1
/dev/vdd1        99G   51G   44G  54% /volumes/cassandra/data/data2
/dev/vdf1        99G   57G   38G  61% /volumes/cassandra/data/data3
/dev/vdg1        99G   51G   44G  54% /volumes/cassandra/data/data4
/dev/vdh1        99G   50G   44G  54% /volumes/cassandra/data/data5
/dev/vdi1        99G   50G   44G  53% /volumes/cassandra/data/data6
/dev/vdj1        99G   77G   17G  83% /volumes/cassandra/data/data7
/dev/vdk1        99G   49G   45G  53% /volumes/cassandra/data/data8
/dev/vdl1        99G   52G   42G  56% /volumes/cassandra/data/data9
/dev/vdm1        99G   50G   45G  53% /volumes/cassandra/data/data10
/dev/vdn1        99G   47G   47G  51% /volumes/cassandra/data/data11
/dev/vdo1        99G   50G   44G  54% /volumes/cassandra/data/data12
/dev/vdp1        99G   52G   43G  55% /volumes/cassandra/data/data13
/dev/vdq1        99G   49G   45G  52% /volumes/cassandra/data/data14
/dev/vdr1        99G   50G   44G  53% /volumes/cassandra/data/data15



Do you know what could cause this, and how to even a little bit the data 
between the disks to avoid any disk saturation situation ? Because I noticed 
that when approaching 5% disk space on one disk, Cassandra performance drops.

Thanks,





Loïc CHANEL
System Big Data engineer
SoftAtHome (Lyon, France)

RE: Best compaction strategy for rarely used data

2023-01-06 Thread onmstester onmstester via user
Another solution: distribute data in more tables, for example you could create 
multiple tables based on value or hash_bucket of one of the columns, by doing 
this current data volume  and compaction overhead would be divided to the 
number of underlying tables. Although there is a limitation for number of 
tables in Cassandra (a few hundreds).

I wish STCS simply had a limitation for maximum sstable size so sstables bigger 
that this limit would not be compacted at all, that would have solved most of 
similar problems?!



Sent using https://www.zoho.com/mail/








 On Fri, 30 Dec 2022 21:43:27 +0330 Durity, Sean R via user 
 wrote ---




Yes, clean-up will reduce the disk space on the existing nodes by re-writing 
only the data that the node now owns into new sstables.

 

 

Sean R. Durity

DB Solutions

Staff Systems Engineer – Cassandra

 

From: Lapo Luchini  
 Sent: Friday, December 30, 2022 4:12 AM
 To: mailto:user@cassandra.apache.org
 Subject: [EXTERNAL] Re: Best compaction strategy for rarely used data

 

On 2022-12-29 21: 54, Durity, Sean R via user wrote: > At some point you will 
end up with large sstables (like 1 TB) that won’t > compact because there are 
not
 4 similar-sized ones able to be compacted Yes, that's exactly what's 
happening. 




 

INTERNAL USE


On 2022-12-29 21:54, Durity, Sean R via user wrote:

> At some point you will end up with large sstables (like 1 TB) that won’t 

> compact because there are not 4 similar-sized ones able to be compacted 

 

Yes, that's exactly what's happening.

 

I'll see maybe just one more compaction, since the biggest sstable is 

already more than 20% of residual free space.

 

> For me, the backup strategy shouldn’t drive the rest.

 

Mhh, yes, that makes sense.

 

> And if your data is ever-growing 

> and never deleted, you will be adding nodes to handle the extra data as 

> time goes by (and running clean-up on the existing nodes).

 

What will happen when adding new nodes, as you say, though?

If I have a 1GB sstable with 250GB of data that will be no longer useful 

(as a new node will be the new owner) will that sstable be reduced to 

750GB by "cleanup" or will it retain old data?

 

Thanks,

 

-- 

Lapo Luchini

mailto:l...@lapo.it

 

Re: Fwd: Re: Problem on setup Cassandra v4.0.1 cluster

2022-10-08 Thread onmstester onmstester via user
I encountered the same problem again with same error logs(this time with Apache 
Cassandra 4.0.6 and a new cluster), but unlike the previous time, hostname 
config was fine. After days of try and fail, finally i've found the root cause: 
time in faulty server has a 2 minute difference and not in sync with other 
servers in the cluster!, just synced the time and problem fixed.

I wonder if community could provide more information at log level for such 
problems (to prevent users struggle and debug these sort of stuff), because 
these two problems (faulty hostname config and non-sync server timestamp) are 
common due to manual config or no one thought such problems could prevent a 
Cassandra node from joining the cluster!


Sent using https://www.zoho.com/mail/








 On Mon, 31 Jan 2022 16:35:50 +0330 onmstester onmstester 
 wrote ---





Once again it was related to hostname configuration (I remember had problem 
with this multiple times before even on different applications), this time the 
root cause was a typo in one of multiple config files for hostname (different 
name on /etc/hostname with /etc/hosts)! I fixed that and now there is no 
problem.



I wonder how Cassandra-3.11 worked?!



P.S: Default dc name in version 4 was changed to datacenter1 (from dc1) and it 
seems to cause a bit of problem with previous configs(default one in rack-dc 
conf still is dc1).



Thank you



Best Regards

Sent using https://www.zoho.com/mail/






 Forwarded message 
From: Erick Ramirez <mailto:erick.rami...@datastax.com>
To: <mailto:user@cassandra.apache.org>
Date: Mon, 31 Jan 2022 15:06:21 +0330
Subject: Re: Problem on setup Cassandra v4.0.1 cluster
 Forwarded message 












TP stats indicate pending gossip. Check that the times are synchronised on both 
nodes (use NTP) since it can prevent gossip from working.



I'd also suggest looking at the logs on both nodes to see what other WARN and 
ERROR messages are being reported. Cheers!

Re: Using zstd compression on Cassandra 3.x

2022-09-13 Thread onmstester onmstester via user
I patched this on 3.11.2 easily: 

1. build jar file from src and put in cassandra/lib directory

2. restart cassandra service

3. alter table using compression zstd and rebuild sstables



But it was in a time when 4.0 was not available yet and after that i upgraded 
to 4.0 immidiately.


Sent using https://www.zoho.com/mail/








 On Tue, 13 Sep 2022 06:38:08 +0430 Eunsu Kim  
wrote ---



Hi all, 
 
Since zstd compression is a very good compression algorithm, it is available in 
Cassandra 4.0. Because the overall performance and ratio are excellent 
 
There is open source available for Cassandra 3.x. 
https://github.com/MatejTymes/cassandra-zstd 
 
Do you have any experience applying this to production? 
 
I want to improve performance and disk usage by applying it to a running 
Cassandra cluster. 
 
Thanks.

Re: Compaction task priority

2022-09-06 Thread onmstester onmstester via user
Using nodetool stop -id COMPACTION_UUID(reported in compactionstats), also you 
could figure it out with nodetool help stop


Sent using https://www.zoho.com/mail/








 On Mon, 05 Sep 2022 10:18:52 +0430 Gil Ganz  wrote ---




onmstester  - How can you stop a specific compaction task? stop command stops 
all compactions of a given type (would be nice to be able to stop specific one).

Jim - in my case the solution was actually to limit concurrent compactors, not 
increase it. Too many tasks caused the server to slow down and not be able to 
keep up.




On Fri, Sep 2, 2022 at 4:55 PM Jim Shaw <mailto:jxys...@gmail.com> wrote:





if capacity allowed,  increase compaction_throughput_mb_per_sec as 1st tuning,  
and if still behind, increase concurrent_compactors as 2nd tuning.



Regards,



Jim
On Fri, Sep 2, 2022 at 3:05 AM onmstester onmstester via user 
<mailto:user@cassandra.apache.org> wrote:

Another thing that comes to my mind: increase minimum sstable count to compact 
from 4 to 32 for the big table that won't be read that much, although you 
should watch out for too many sstables count.



Sent using https://www.zoho.com/mail/








 On Fri, 02 Sep 2022 11:29:59 +0430 onmstester onmstester via user 
<mailto:user@cassandra.apache.org> wrote ---



I was there too! and found nothing to work around it except stopping 
big/unnecessary compactions manually (using nodetool stop) whenever they 
appears by some shell scrips (using crontab)



Sent using https://www.zoho.com/mail/








 On Fri, 02 Sep 2022 10:59:22 +0430 Gil Ganz <mailto:gilg...@gmail.com> 
wrote ---











HeyWhen deciding which sstables to compact together, how is the priority 
determined between tasks, and can I do something about it?



In some cases (mostly after removing a node), it takes a while for compactions 
to keep up with the new data the came from removed nodes, and I see it is busy 
on huge compaction tasks, but in the meantime a lot of small sstables are 
piling up (new data that is coming from the application, so read performance is 
not good, new data is scattered in many sstables, and probably combining big 
sstables won't help reduce fragmentation as much (I think).



Another thing that comes to mind, is perhaps I have a table that is very big, 
but not being read that much, would be nice to have other tables have higher 
compaction priority (to help in a case like I described above).



Version is 4.0.4



Gil

Re: Compaction task priority

2022-09-02 Thread onmstester onmstester via user
Another thing that comes to my mind: increase minimum sstable count to compact 
from 4 to 32 for the big table that won't be read that much, although you 
should watch out for too many sstables count.


Sent using https://www.zoho.com/mail/








 On Fri, 02 Sep 2022 11:29:59 +0430 onmstester onmstester via user 
 wrote ---



I was there too! and found nothing to work around it except stopping 
big/unnecessary compactions manually (using nodetool stop) whenever they 
appears by some shell scrips (using crontab)



Sent using https://www.zoho.com/mail/








 On Fri, 02 Sep 2022 10:59:22 +0430 Gil Ganz <mailto:gilg...@gmail.com> 
wrote ---











HeyWhen deciding which sstables to compact together, how is the priority 
determined between tasks, and can I do something about it?



In some cases (mostly after removing a node), it takes a while for compactions 
to keep up with the new data the came from removed nodes, and I see it is busy 
on huge compaction tasks, but in the meantime a lot of small sstables are 
piling up (new data that is coming from the application, so read performance is 
not good, new data is scattered in many sstables, and probably combining big 
sstables won't help reduce fragmentation as much (I think).



Another thing that comes to mind, is perhaps I have a table that is very big, 
but not being read that much, would be nice to have other tables have higher 
compaction priority (to help in a case like I described above).



Version is 4.0.4



Gil

Re: Compaction task priority

2022-09-02 Thread onmstester onmstester via user
I was there too! and found nothing to work around it except stopping 
big/unnecessary compactions manually (using nodetool stop) whenever they 
appears by some shell scrips (using crontab)


Sent using https://www.zoho.com/mail/








 On Fri, 02 Sep 2022 10:59:22 +0430 Gil Ganz  wrote ---



HeyWhen deciding which sstables to compact together, how is the priority 
determined between tasks, and can I do something about it?



In some cases (mostly after removing a node), it takes a while for compactions 
to keep up with the new data the came from removed nodes, and I see it is busy 
on huge compaction tasks, but in the meantime a lot of small sstables are 
piling up (new data that is coming from the application, so read performance is 
not good, new data is scattered in many sstables, and probably combining big 
sstables won't help reduce fragmentation as much (I think).



Another thing that comes to mind, is perhaps I have a table that is very big, 
but not being read that much, would be nice to have other tables have higher 
compaction priority (to help in a case like I described above).



Version is 4.0.4



Gil

Re: slow compactions

2022-03-06 Thread onmstester onmstester
Forgot to mention that i'm using default STCS for all tables





 On Sun, 06 Mar 2022 12:29:52 +0330 onmstester onmstester 
 wrote 



Hi, 

Sometimes compactions getting so slow (a few KBs per second for each 
compaction) on a few nodes which would be fixed temporarily by restarting  
restarting cassandra (although would coming back a few hours later).

Copied sstables related to slow compactions to a isolated/single node 
cassandra, and there, they are fast.

Using Set as non-key column

I suspect one table with big partitions (1000 rows, each row has a Set which 
sum of set size for all rows in a big partitions is 10M), but disabling 
autocompaction for this table didn't help.

There are no obvious hardware resource shortage, resource consumptions of all 
nodes in cluster is identical but only a few of nodes has slow compactions.



I'm using Apache Cassandra 3.11.2. A few related configs:



concurrent compactors: 5

throughput: 64MB

tables count: 20 + default tables

Disk: 7.2K

CPU: 12 cores with 50% usage



Anyone experienced similar problems with compactions? Is there any related bug 
which would be fixed by upgrading? How can i find out the root cause of this 
problem?



Best Regards

slow compactions

2022-03-06 Thread onmstester onmstester
Hi, 

Sometimes compactions getting so slow (a few KBs per second for each 
compaction) on a few nodes which would be fixed temporarily by restarting  
restarting cassandra (although would coming back a few hours later).

Copied sstables related to slow compactions to a isolated/single node 
cassandra, and there, they are fast.

Using Set as non-key column

I suspect one table with big partitions (1000 rows, each row has a Set which 
sum of set size for all rows in a big partitions is 10M), but disabling 
autocompaction for this table didn't help.

There are no obvious hardware resource shortage, resource consumptions of all 
nodes in cluster is identical but only a few of nodes has slow compactions.



I'm using Apache Cassandra 3.11.2. A few related configs:



concurrent compactors: 5

throughput: 64MB

tables count: 20 + default tables

Disk: 7.2K

CPU: 12 cores with 50% usage



Anyone experienced similar problems with compactions? Is there any related bug 
which would be fixed by upgrading? How can i find out the root cause of this 
problem?



Best Regards

Re: TLS/SSL overhead

2022-02-07 Thread onmstester onmstester
Thank you,



I should have mention hardware and software which I used in this experiment:

CPU: one Intel Xeon silver 4210 10 core 2.2G

Network: 1Gb

OS: Ubuntu 20.04.2 LTS

Java: 1.8.0_321 Oracle

Apache Cassandra 4.0.1


Data model is a single table:



 text partitionKey,   15chars

 int clusterKey,        8 digits

 text simpleColumn  1200 chars

key: (partitionkey, clusterKey)




Generated keys and Cassandra ssl config is the same as dzone article: 
https://dzone.com/articles/setting-up-a-cassandra-cluster-with-ssl#

server_encryption_options:


3


internode_encryption: all


4


keystore: /opt/cassandra/conf/certs/cassandra.keystore


5


keystore_password: cassandra


6


truststore: /opt/cassandra/conf/certs/cassandra.truststore


7


truststore_password: cassandra


8


# More advanced defaults below:


9


protocol: TLS


10


​


11


client_encryption_options:


12


enabled: true


13


# If enabled and optional is set to true encrypted and unencrypted 
connections are handled.


14


optional: false


15


keystore: /opt/cassandra/conf/certs/cassandra.keystore


16


keystore_password: cassandra


17


truststore: /opt/cassandra/conf/certs/cassandra.truststore


18


truststore_password: cassandra


19


require_client_auth: true


20


protocol: TLS




Cassandra Configs other than default:

Max Heap: 31GB

G1 gc almost tuned for write throughput (90%):

Separate physical disk drive for commitlog and data

commitlog compression (lz4) + sstable compression (flush lz4 + compaction: zstd)

internode_compression: all




Client side: datastax-oss 4.13 with client protocol encryption, 10 threads/1000 
async insert



And the benchmark result for single node cluster, which is the only scenario 
that i could validate with multiple repeats:



Scenario


Write/sec


Node CPU usage (other resources < 10% utilized)


No_SSL


115K


90%



Client_SSL


112K


90%







So the overhead was 2.5% for client SSL on single node cluster with default SSL 
configs.




Honestly, I'm not very satisfied with accuracy of my benchmarks because I could 
not use all CPU resources on multi node cluster with RF > 1 and throughput was 
almost the same for both SSL and non-SSL configurations on those scenarios (I 
asked community for help on that matter but still no luck). 



Eric, for the sake of making it a blog post, its not a comprehensive, accurate 
experiment to rely on as i explained, but anyway the information i provided 
above is all i got so far. If you need more information or have suggestions on 
improving these experiments, please let me know.



Daemeon, output packets from client (which use lz4 compresion) are about 400 
bytes and from netstat -s/tcp part: while 16M segments sent/1000 of them 
retransmitted


Best Regards



Sent using https://www.zoho.com/mail/






 On Mon, 07 Feb 2022 06:50:16 +0330 daemeon reiydelle  
wrote 



the % numbers seen high for a clean network and a reasonable fast client. The 
5% really not reasonable. No jumbo frames? No network retries (netstats)?







Daemeon Reiydelle

email: mailto:daeme...@gmail.com

San Francisco 1.415.501.0198/Skype daemeon.c.m.reiydelle



"Why is it so hard to rhyme either Life or Love?" - Sondheim









On Sun, Feb 6, 2022 at 6:06 PM Dinesh Joshi <mailto:djo...@apache.org> wrote:

I wish there was an easy answer to this question. Like you pointed out it is 
hardware dependent but software stack plays a big part. For instance, the JVM 
you're running makes a difference too. Cassandra comes with netty and IIRC we 
include tcnative which accelerates TLS. You could also slip Amazon's Corretto 
Crypto Provider into your runtime. I am not suggesting using everything all at 
once but a combination of libraries, runtimes, JVM, OS, cipher suites can make 
a big difference. Therefore it is best to try it out on your stack.



Typically modern hardware has accelerators for common encryption algorithms. If 
the software stack enables you to optimally take advantage of the hardware then 
you could see very little to no impact on latencies.



Cassandra maintains persistent connections therefore the visible impact is on 
connection establishment time (TLS handshake is expensive). Encryption will 
make thundering herd problems worse. You should watch out for those two issues.



Dinesh





On Feb 5, 2022, at 3:53 AM, onmstester onmstester <mailto:onmstes...@zoho.com> 
wrote:



Hi, 



Anyone measured impact of wire encryption using TLS 
(client_encryption/server_encryption) on cluster latency/throughput? 

It may be dependent on Hardware or even data model but I already did some sort 
of measurements and got to 2% for client encryption and 3-5% for client + 
server encryption and wanted to validate that with community.



Best Regards



Sent using https://www.zoho.com/mail/

TLS/SSL overhead

2022-02-05 Thread onmstester onmstester
Hi, 



Anyone measured impact of wire encryption using TLS 
(client_encryption/server_encryption) on cluster latency/throughput? 

It may be dependent on Hardware or even data model but I already did some sort 
of measurements and got to 2% for client encryption and 3-5% for client + 
server encryption and wanted to validate that with community.



Best Regards



Sent using https://www.zoho.com/mail/

Fwd: Re: Cassandra internal bottleneck

2022-02-05 Thread onmstester onmstester
Thanks,



I've got only one client, 10 threads and 1K async writes, This single client 
was able to send 110K insert/seconds to single node cluster but its only 
sending 90K insert/seconds to the cluster with 2 nodes(client CPU/network usage 
is less than 20%)


Sent using https://www.zoho.com/mail/






 Forwarded message 
From: Erick Ramirez 
To: 
Date: Sat, 05 Feb 2022 13:25:23 +0330
Subject: Re: Cassandra internal bottleneck
 Forwarded message 



How many clients do you have sending write requests? In several cases I've 
worked on, the bottleneck is on the client side.



Try increasing the number of app instances and you might find that the combined 
throughput increases significantly. Cheers!

Cassandra internal bottleneck

2022-02-05 Thread onmstester onmstester
Hi, 



I'm trying to evaluate performance of Apache Cassandra V4.0.1 for write-only 
workloads using on-premise physical servers.

On a single node cluster, doing some optimizations i was able to make CPU of 
node >90%, throughput is high enough and CPU is the bottleneck as i expected.

Then doing the same benchmark on a cluster with two nodes/RF=2/CL=ALL, the 
throughput decreased by 20% in compare with single node scenario but CPU usage 
on both nodes is about 70%(changing from 50% to 90% over and over again each 
5-6 seconds), i wonder how could i make CPU usage >90% steadily in this 
scenario so reach maximum throughput of my Hardware (resources other than CPU 
being used by less than 10% of their capability)

>From jvisualvm: There are only 90 Native-Transport threads mostly waiting and 
>31 Mutation stage threads also mostly waiting, the only threads that are 
>always running are Messaging-EventLoop (6 threads) and epollEventLoop (40 
>threads).

Where is the bottleneck of the cluster now? How can i increase its resources to 
again reach 90% CPU and max write throughput? How can i debug pipeline of SEDA 
architecture of Cassandra to find such bottlenecks?



Best Regards



Sent using https://www.zoho.com/mail/

Fwd: Re: Problem on setup Cassandra v4.0.1 cluster

2022-01-31 Thread onmstester onmstester
Once again it was related to hostname configuration (I remember had problem 
with this multiple times before even on different applications), this time the 
root cause was a typo in one of multiple config files for hostname (different 
name on /etc/hostname with /etc/hosts)! I fixed that and now there is no 
problem.



I wonder how Cassandra-3.11 worked?!



P.S: Default dc name in version 4 was changed to datacenter1 (from dc1) and it 
seems to cause a bit of problem with previous configs(default one in rack-dc 
conf still is dc1).


Thank you


Best Regards
Sent using https://www.zoho.com/mail/






 Forwarded message 
From: Erick Ramirez 
To: 
Date: Mon, 31 Jan 2022 15:06:21 +0330
Subject: Re: Problem on setup Cassandra v4.0.1 cluster
 Forwarded message 



TP stats indicate pending gossip. Check that the times are synchronised on both 
nodes (use NTP) since it can prevent gossip from working.



I'd also suggest looking at the logs on both nodes to see what other WARN and 
ERROR messages are being reported. Cheers!

Problem on setup Cassandra v4.0.1 cluster

2022-01-31 Thread onmstester onmstester
Hi, 

I'm trying to setup a Cluster of  apache Cassandra version 4.0.1 with 2 nodes:

1. on node1 (192.168.1.1), extracted tar.gz and config these on yml:

 - seeds: "192.168.1.1"

listen_address: 192.168.1.1

rpc_address: 192.168.1.1

2. started node1 and a few seconds later it is UN

3.on node2 (192.168.1.2), extracted tar.gz and config these on yml:

 - seeds: "192.168.1.1"

listen_address: 192.168.1.2

rpc_address: 192.168.1.2

4. started node2 but a few seconds later an error reported in system.log:

INFO  [ScheduledTasks:1] 2022-01-31 13:46:37,499 MessagingMetrics.java:206 - 
ECHO_REQ messages were dropped in last 5000 ms: 0 internal and 5 cross node. 
Mean internal dropped latency: 0 ms and Mean cross-node dropped latency: 53142 
ms

INFO  [ScheduledTasks:1] 2022-01-31 13:46:37,499 StatusLogger.java:65 - Pool 
Name   Active   Pending  Completed   Blocked  All Time 
Blocked

INFO  [ScheduledTasks:1] 2022-01-31 13:46:37,503 StatusLogger.java:69 - 
CompactionExecutor   0 0  0 0   
  0

INFO  [ScheduledTasks:1] 2022-01-31 13:46:37,503 StatusLogger.java:69 - 
MemtableReclaimMemory    0 0  4 0   
  0

INFO  [ScheduledTasks:1] 2022-01-31 13:46:37,503 StatusLogger.java:69 - 
GossipStage  0 0 24 0   
  0

INFO  [ScheduledTasks:1] 2022-01-31 13:46:37,503 StatusLogger.java:69 - 
SecondaryIndexManagement 0 0  0 0   
  0

INFO  [ScheduledTasks:1] 2022-01-31 13:46:37,503 StatusLogger.java:69 - 
HintsDispatcher  0 0  0 0   
  0

INFO  [ScheduledTasks:1] 2022-01-31 13:46:37,504 StatusLogger.java:69 - 
MemtableFlushWriter  0 0  4 0   
  0

INFO  [ScheduledTasks:1] 2022-01-31 13:46:37,504 StatusLogger.java:69 - 
PerDiskMemtableFlushWriter_0 0 0  4 0   
  0

INFO  [ScheduledTasks:1] 2022-01-31 13:46:37,504 StatusLogger.java:69 - 
MemtablePostFlush    0 0  8 0   
  0

INFO  [ScheduledTasks:1] 2022-01-31 13:46:37,504 StatusLogger.java:69 - Sampler 
 0 0  0 0   
  0

INFO  [ScheduledTasks:1] 2022-01-31 13:46:37,504 StatusLogger.java:69 - 
ValidationExecutor   0 0  0 0   
  0

INFO  [ScheduledTasks:1] 2022-01-31 13:46:37,504 StatusLogger.java:69 - 
ViewBuildExecutor    0 0  0 0   
  0

INFO  [ScheduledTasks:1] 2022-01-31 13:46:37,505 StatusLogger.java:69 - 
CacheCleanupExecutor 0 0  0 0   
  0

INFO  [ScheduledTasks:1] 2022-01-31 13:46:37,505 StatusLogger.java:79 - 
CompactionManager 0 0

INFO  [ScheduledTasks:1] 2022-01-31 13:46:37,505 StatusLogger.java:91 - 
MessagingService    n/a   0/0

INFO  [ScheduledTasks:1] 2022-01-31 13:46:37,506 StatusLogger.java:101 - Cache 
Type Size Capacity   KeysToSave

INFO  [ScheduledTasks:1] 2022-01-31 13:46:37,506 StatusLogger.java:103 - 
KeyCache    240    104857600
  all

INFO  [ScheduledTasks:1] 2022-01-31 13:46:37,506 StatusLogger.java:109 - 
RowCache  0    0
  all

INFO  [ScheduledTasks:1] 2022-01-31 13:46:37,506 StatusLogger.java:116 - Table  
 Memtable ops,data

INFO  [ScheduledTasks:1] 2022-01-31 13:46:37,506 StatusLogger.java:119 - 
system_schema.columns   218,33711

INFO  [ScheduledTasks:1] 2022-01-31 13:46:37,506 StatusLogger.java:119 - 
system_schema.types  2,16

INFO  [ScheduledTasks:1] 2022-01-31 13:46:37,507 StatusLogger.java:119 - 
system_schema.indexes    2,16

INFO  [ScheduledTasks:1] 2022-01-31 13:46:37,507 StatusLogger.java:119 - 
system_schema.keyspaces 4,242

INFO  [ScheduledTasks:1] 2022-01-31 13:46:37,507 StatusLogger.java:119 - 
system_schema.dropped_columns   3,123

INFO  [ScheduledTasks:1] 2022-01-31 13:46:37,507 StatusLogger.java:119 - 
system_schema.aggregates 2,16

INFO  [ScheduledTasks:1] 2022-01-31 13:46:37,507 StatusLogger.java:119 - 
system_schema.triggers   2,16

INFO  [ScheduledTasks:1] 2022-01-31 13:46:37,507 StatusLogger.java:119 - 
system_schema.tables 32,22085

INFO  [ScheduledTasks:1] 2022-01-31 13:46:37,507 StatusLogger.java:119 - 
system_schema.views  2,16

INFO  [ScheduledTasks:1] 2022-01-31 13:46:37,507 StatusLogger.java:119 - 
system_schema.functions  2,16


Re: gc throughput

2021-11-17 Thread onmstester onmstester
Thank You

I'm going to achieve the most possible (write) throughput with Cassandra and 
care less about latency, recommendations from community suggests that better to 
use G1GC with 16GB heap, but when i already got 92% throughput with CMS, should 
i consider changing it?



Sent using https://www.zoho.com/mail/






 On Tue, 16 Nov 2021 16:52:29 +0330 Bowen Song  wrote 



Do you have any performance issues? such as long STW GC pauses or
  high p99.9 latency? If not, then you shouldn't tune the GC for the
  sake of it. However, if you do have performance issues related to
  GC, regardless what is the GC metric you are looking at saying,
  you will need to address the issue and that probably will involve
  some GC tunings.

On 15/11/2021 06:00, onmstester
  onmstester wrote:

Hi, 

We are using Apache Cassandra 3.11.2 with its default gc
  configuration (CMS and ...) on a 16GB heap, i inspected gc
  logs using gcviewer and it reported 92% of throughput, is that
  means not necessary to do any further tuning for gc? and
  everything is ok with gc of Cassandra?





Sent using https://www.zoho.com/mail/

Re: Separating storage and processing

2021-11-15 Thread onmstester onmstester
Thank You


Sent using https://www.zoho.com/mail/





 On Tue, 16 Nov 2021 10:00:19 +0330   wrote 


> I can, but i thought with 5TB per node already violated best practices (1-2 
>TB per node) and won't be a good idea to 2X or 3X that?


The main downside of larger disks is that it takes longer to replace a host 
that goes down, since there’s less network capacity to move data from surviving 
instances to the new, replacement instances. The longer it takes to replace a 
host, the longer the time window when further failure may cause unavailability 
(for example: if you’re running in a 3-instance cluster, one node goes down and 
requires replacement, any additional nodes going down will cause downtime for 
reads that require a quorum).



These are some of the main factors to consider here. You can always bump the 
disk capacity for one instance, measure replacement times, then decide whether 
to increase disk capacity across the cluster.

Re: Separating storage and processing

2021-11-15 Thread onmstester onmstester
I can, but i thought with 5TB per node already violated best practices (1-2 TB 
per node) and won't be a good idea to 2X or 3X that?


Sent using https://www.zoho.com/mail/





 On Mon, 15 Nov 2021 20:55:53 +0330   wrote 


It sounds like you can downsize your cluster but increase your drive capacity. 
Depending on how your cluster is deployed, it’s very possible that disks larger 
than 5TB per node are available. Could you reduce the number of nodes and 
increase your disk sizes? 
 
— 
Abe

gc throughput

2021-11-14 Thread onmstester onmstester
Hi, 

We are using Apache Cassandra 3.11.2 with its default gc configuration (CMS and 
...) on a 16GB heap, i inspected gc logs using gcviewer and it reported 92% of 
throughput, is that means not necessary to do any further tuning for gc? and 
everything is ok with gc of Cassandra?





Sent using https://www.zoho.com/mail/

Separating storage and processing

2021-11-14 Thread onmstester onmstester
Hi, 

In our Cassandra cluster, because of big rows in input data/data model with TTL 
of several months, we ended up using almost 80% of storage (5TB per node), but 
having less than 20% of CPU usage which almost all of it would be writing rows 
to memtables and compacting sstables, so a lot of CPU capacity wasted.

I wonder if there is anything we can do to solve this problem using Cassandra 
or should migrate from Cassandra to something that separates storage and 
processing (currently i'm not aware of anything as satble as cassandra)?



Sent using https://www.zoho.com/mail/

Re: New Servers - Cassandra 4

2021-08-11 Thread onmstester onmstester
Hi,

What about this type of blades, which gives you about 12 (commodity) servers  
in 3U:

https://www.supermicro.com/en/products/microcloud



Sent using https://www.zoho.com/mail/





 On Tue, 03 Aug 2021 02:01:13 +0430 Joe Obernberger 
 wrote 


Thank you Max.  That is a solid choice.  You can even configure
  each blade with two 15TBytes SSDs (may not be wise), but that
  would yield ~430TBytes of SSD across 14 nodes in 4u space for
  around $150k.  

-Joe

On 8/2/2021 4:29 PM, Max C. wrote:

Have you considered a blade chassis?  Then you can get most of the
  redundancy of having lots of small nodes in few(er) rack units. 

SuperMicro has a chassis that can accommodate 14
servers in 4U:



https://www.supermicro.com/en/products/superblade/enclosure#4U



- Max


On Aug 2, 2021, at 12:05 pm, Joe Obernberger

wrote:


Thank you Jeff.  Consider that if rack
space is at a premium, what would make the most
sense?

-Joe

On 8/2/2021 2:46 PM, Jeff
Jirsa wrote:

IF you bought a server with
  that topology, you would definitely want to run
  lots of instances, perhaps 24, to effectively
  utilize that disk space.  

You'd also need 24 IPs, and you'd
need a NIC that could send/receive 24x the
normal bandwidth. And the cost of rebuilding
such a node would be 24x higher than normal (so
consider how many of those you'd have in a
cluster, and how often they'd fail).







On Mon, Aug 2,
2021 at 11:06 AM Joe Obernberger 

wrote:

We have a
large amount of data to be stored in Cassandra,
and if we were 
 to purchase new hardware in limited space, what
would make the most sense?
 Dell has machines with 24, 8TByte drives in a 2u
configuration. Given 
 Cassandra's limitations (?) to large nodes,
would it make sense to run 
 24 copies of Cassandra on that one node (one per
drive)?
 Thank you!
 
 -Joe
 



http://www.avg.com/email-signature?utm_medium=email_source=link_campaign=sig-email_content=emailclient

Virus-free. 
http://www.avg.com/email-signature?utm_medium=email_source=link_campaign=sig-email_content=emailclient

Re: Question about the num_tokens

2021-04-28 Thread onmstester onmstester
Some posts/papers discusses this in more detail. for example the one from 
thelastpickle:

https://thelastpickle.com/blog/2019/02/21/set-up-a-cluster-with-even-token-distribution.html

Which says:

Using statistical computation, the point where all clusters of any size always 
had a good token range balance was when 256 vnodes were used. Hence, the 
num_tokens default value of 256 was the recommended by the community to prevent 
hot spots in a cluster. The problem here is that the performance for operations 
requiring token-range scans (e.g. repairs, Spark operations) will tank big 
time. It can also cause problems with bootstrapping due to large numbers of 
SSTables generated. Furthermore, as Joseph Lynch and Josh Snyder pointed out in 
a 
http://mail-archives.apache.org/mod_mbox/cassandra-dev/201804.mbox/%3CCALShVHcz5PixXFO_4bZZZNnKcrpph-=5QmCyb0M=w-mhdyl...@mail.gmail.com%3E
 they wrote, the higher the value of num_tokens in large clusters, the higher 
the risk of data unavailability .





Sent using https://www.zoho.com/mail/







 On Wed, 28 Apr 2021 10:43:35 +0430 Jai Bheemsen Rao Dhanwada 
 wrote 


Thank you,

Is there a specific reason why Cassandra4.0 recommends to use 16 tokens?



On Tue, Apr 27, 2021 at 11:11 PM Jeff Jirsa  wrote:



On Apr 27, 2021, at 10:47 PM, Jai Bheemsen Rao Dhanwada 
 wrote:



Hello,


I am currently using num_tokens: 256 in my cluster with the version 3.11.6 and 
when I looked at the Cassandra4.0 I see the num_tokens set to 16. 



Is there a specific reason for changing the default value from 256 to 16? 

What is the best value to use ?






Probably something like 16 on new clusters. If you have an existing cluster, 
it’s likely not worth the hassle to change it unless it’s actively causing you 
pain 


If 16 is recommended, is there a way to change the num_tokens to 16 from 256 on 
the live production cluster?






Not easily, no. You have to add a new data center or similar. Lots of effort. 






I tried to directly update and restart Cassandra but the process won't startup 
with the below error



org.apache.cassandra.exceptions.ConfigurationException: Cannot change the 
number of tokens from 256 to 16



Any suggestions? 





Change the yaml back to 256 so it starts

Re: What Happened To Alternate Storage And Rocksandra?

2021-03-12 Thread onmstester onmstester
Beside the enhancements at storage layer, i think there are couple of good 
ideas in Rocksdb that could be used in Cassandra, like the one with disabling 
sort at memtable-insert part (write data fast like commitlig) and only sort the 
data when flushing/creating sst files.

Sent using https://www.zoho.com/mail/






 On Fri, 12 Mar 2021 23:47:05 +0330 Elliott Sims  
wrote 


I'm not too familiar with the details on what's happened more recently, but I 
do remember that while Rocksandra was very favorably compared to Cassandra 2.x, 
the improvements looked fairly similar in nature and magnitude to what 
Cassandra got from the move to the 3.x sstable format and increased use of 
off-heap memory.  That might have damped a lot of the enthusiasm for further 
development.


On Fri, Mar 12, 2021 at 10:50 AM Gareth Collins 
 wrote:

Hi,

I remember a couple of years ago there was some noise about Rocksandra 
(Cassandra using rocksdb for storage) and opening up Cassandra to alternate 
storage mechanisms.



I haven't seen anything about it for a while now though. The last commit to 
Rocksandra on github was in Nov 2019. The associated JIRA items 
(CASSANDRA-13474 and CASSANDRA-13476) haven't had any activity since 2019 
either.



I was wondering whether anyone knew anything about it. Was it decided that this 
wasn't a good idea after all (the alleged performance differences weren't worth 
it...or were exaggerated)? Or is it just that it still may be a good idea, but 
there are no resources available to make this happen (e.g. perhaps the original 
sponsor moved onto other things)?



I ask because I was looking at RocksDB/Kafka Streams for another project (which 
may replace some functionality which currently uses Cassandra)...and was 
wondering if there could be some important info about RocksDB I may be missing.



thanks in advance,

Gareth Collins

Fwd: Re: using zstd cause high memtable switch count

2021-02-28 Thread onmstester onmstester
No, i didn't backport that one.



Thank you


Sent using https://www.zoho.com/mail/






 Forwarded message 
From: Kane Wilson 
To: 
Date: Mon, 01 Mar 2021 03:18:33 +0330
Subject: Re: using zstd cause high memtable switch count
 Forwarded message 


Did you also backport 
https://github.com/apache/cassandra/commit/9c1bbf3ac913f9bdf7a0e0922106804af42d2c1e
 to still use LZ4 for flushing? I would be curious if this is a side effect of 
using zstd for flushing.
https://raft.so - Cassandra consulting, support, and managed services







On Sun, Feb 28, 2021 at 9:22 PM onmstester onmstester 
<mailto:onmstes...@zoho.com.invalid> wrote:

Hi,



I'm using 3.11.2, just add the patch for zstd and changed table compression 
from default (LZ4) to zstd with level 1 and chunk 64kb, everything is fine 
(disk usage decreased by 40% and CPU usage is almost the same as before), only 
the memtable switch count was changed dramatically; with lz4 it was less than 
100 for a week, but with zstd it was more than 1000. I don't understand how its 
related.



P.S: Thank you guys for bringing zstd to Cassandra, it had a huge impact on my 
use-case by reducing almost 40% of costs, i wish i could have used it sooner 
(Although some sort of patch was already available for this feature 4 years 
ago).

Just find out that Hbase guys had zstd from 2017 and IMHO it would be good for 
Cassandra community to change its release policy to provide such features 
faster. 


Best Regards
Sent using https://www.zoho.com/mail/

using zstd cause high memtable switch count

2021-02-28 Thread onmstester onmstester
Hi,



I'm using 3.11.2, just add the patch for zstd and changed table compression 
from default (LZ4) to zstd with level 1 and chunk 64kb, everything is fine 
(disk usage decreased by 40% and CPU usage is almost the same as before), only 
the memtable switch count was changed dramatically; with lz4 it was less than 
100 for a week, but with zstd it was more than 1000. I don't understand how its 
related.



P.S: Thank you guys for bringing zstd to Cassandra, it had a huge impact on my 
use-case by reducing almost 40% of costs, i wish i could have used it sooner 
(Although some sort of patch was already available for this feature 4 years 
ago).

 Just find out that Hbase guys had zstd from 2017 and IMHO it would be good for 
Cassandra community to change its release policy to provide such features 
faster. 


Best Regards
Sent using https://www.zoho.com/mail/

number of racks in a deployment with VMs

2021-02-14 Thread onmstester onmstester
Hi,

In a article by thelastpickle [1], i noticed:



The key here is to configure the cluster so that for a given datacenter the 
number of racks is the same as the replication factor.



When using virtual machines as Cassandra nodes we have to set up the cluster in 
a way that number of racks is the same as physical servers, so by losing one 
physical server just one copy of any data would be lost, right? which could be 
much greater than RF, would this cause any harm on cluster 
health/balance/availability?

[1]: 
https://thelastpickle.com/blog/2021/01/29/impacts-of-changing-the-number-of-vnodes.html



Sent using https://www.zoho.com/mail/

Fwd: Re: local read from coordinator

2020-11-14 Thread onmstester onmstester
What if i use sstabledump? 

Because i'm going to read all data from a table that has no more 
insert/compaction(fixed list of sstables).

 For each sstable:

1. Using sstabledump -e, fetch all partition keys stored on sstable (ordered 
list)

2. Using sstabledump -k partitionKey, fetch all rows for the partitionKey on 
the sstable, by nodetool getsstables,  check if there is any other part of data 
on other sstables



This way maybe i could get 100% disk read performance (fast possible sequential 
read of all data on a table)

I will put the outcome of this experiment here






Sent using https://www.zoho.com/mail/







 Forwarded message 
From: onmstester onmstester 
To: "user"
Date: Sat, 14 Nov 2020 08:24:14 +0330
Subject: Re: local read from coordinator
 Forwarded message 



Thank you Jeff,

I disabled dynamic_snitch (after reading some Docs about how its not working 
good enough in production) and using GossipingFileProperty snitch, Is it make 
any difference to choosing the replica?



Sent using https://www.zoho.com/mail/








 On Wed, 11 Nov 2020 17:53:42 +0330 Jeff Jirsa <mailto:jji...@gmail.com> 
wrote 





This isn’t necessarily true and cassandra has no coordinator-only consistency 
level to force this behavior 



(The snitch is going to pick the best option for local_one reads and any 
compactions or latency deviations from load will make it likely that another 
replica is chosen in practice)



On Nov 11, 2020, at 3:46 AM, Alex Ott <mailto:alex...@gmail.com> wrote:





if you force routing key, then the replica that owns the data will be selected 
as coordinator



On Wed, Nov 11, 2020 at 12:35 PM onmstester onmstester 
<mailto:onmstes...@zoho.com.invalid> wrote:



Thanx,



But i'm OK with coordinator part, actually i was looking for kind of read CL to 
force to read from the coordinator only with no other connections to other 
nodes!



Sent using https://www.zoho.com/mail/








 Forwarded message 
From: Alex Ott <mailto:alex...@gmail.com>
To: "user"<mailto:user@cassandra.apache.org>
Date: Wed, 11 Nov 2020 11:28:56 +0330
Subject: Re: local read from coordinator
 Forwarded message 



token-aware policy doesn't work for token range queries (at least in the Java 
driver 3.x).  You need to force the driver to do the reading using a specific 
token as a routing key.  Here is Java implementation of the token range 
scanning algorithm that Spark uses: 
https://github.com/alexott/cassandra-dse-playground/blob/master/driver-1.x/src/main/java/com/datastax/alexott/demos/TokenRangesScan.java



I'm not aware if Python driver is able to set routing key explicitly, but 
whitelist policy should help








On Wed, Nov 11, 2020 at 7:03 AM Erick Ramirez 
<mailto:erick.rami...@datastax.com> wrote:

Yes, use a token-aware policy so the driver will pick a coordinator where the 
token (partition) exists. Cheers!








-- 

With best wishes,                    Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)


















-- 

With best wishes,                    Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)

Re: local read from coordinator

2020-11-13 Thread onmstester onmstester
Thank you Jeff,

I disabled dynamic_snitch (after reading some Docs about how its not working 
good enough in production) and using GossipingFileProperty snitch, Is it make 
any difference to choosing the replica?


Sent using https://www.zoho.com/mail/






 On Wed, 11 Nov 2020 17:53:42 +0330 Jeff Jirsa  wrote 



This isn’t necessarily true and cassandra has no coordinator-only consistency 
level to force this behavior 

(The snitch is going to pick the best option for local_one reads and any 
compactions or latency deviations from load will make it likely that another 
replica is chosen in practice)

On Nov 11, 2020, at 3:46 AM, Alex Ott <mailto:alex...@gmail.com> wrote:



if you force routing key, then the replica that owns the data will be selected 
as coordinator


On Wed, Nov 11, 2020 at 12:35 PM onmstester onmstester 
<mailto:onmstes...@zoho.com.invalid> wrote:

Thanx,



But i'm OK with coordinator part, actually i was looking for kind of read CL to 
force to read from the coordinator only with no other connections to other 
nodes!


Sent using https://www.zoho.com/mail/






 Forwarded message 
From: Alex Ott <mailto:alex...@gmail.com>
To: "user"<mailto:user@cassandra.apache.org>
Date: Wed, 11 Nov 2020 11:28:56 +0330
Subject: Re: local read from coordinator
 Forwarded message 


token-aware policy doesn't work for token range queries (at least in the Java 
driver 3.x).  You need to force the driver to do the reading using a specific 
token as a routing key.  Here is Java implementation of the token range 
scanning algorithm that Spark uses: 
https://github.com/alexott/cassandra-dse-playground/blob/master/driver-1.x/src/main/java/com/datastax/alexott/demos/TokenRangesScan.java



I'm not aware if Python driver is able to set routing key explicitly, but 
whitelist policy should help







On Wed, Nov 11, 2020 at 7:03 AM Erick Ramirez 
<mailto:erick.rami...@datastax.com> wrote:

Yes, use a token-aware policy so the driver will pick a coordinator where the 
token (partition) exists. Cheers!






-- 
With best wishes,                    Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)















-- 
With best wishes,                    Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)

Fwd: Re: local read from coordinator

2020-11-11 Thread onmstester onmstester
Thanx,



But i'm OK with coordinator part, actually i was looking for kind of read CL to 
force to read from the coordinator only with no other connections to other 
nodes!

Sent using https://www.zoho.com/mail/






 Forwarded message 
From: Alex Ott 
To: "user"
Date: Wed, 11 Nov 2020 11:28:56 +0330
Subject: Re: local read from coordinator
 Forwarded message 


token-aware policy doesn't work for token range queries (at least in the Java 
driver 3.x).  You need to force the driver to do the reading using a specific 
token as a routing key.  Here is Java implementation of the token range 
scanning algorithm that Spark uses: 
https://github.com/alexott/cassandra-dse-playground/blob/master/driver-1.x/src/main/java/com/datastax/alexott/demos/TokenRangesScan.java



I'm not aware if Python driver is able to set routing key explicitly, but 
whitelist policy should help







On Wed, Nov 11, 2020 at 7:03 AM Erick Ramirez 
 wrote:

Yes, use a token-aware policy so the driver will pick a coordinator where the 
token (partition) exists. Cheers!






-- 
With best wishes,                    Alex Ott
http://alexott.net/
Twitter: alexott_en (English), alexott (Russian)

local read from coordinator

2020-11-10 Thread onmstester onmstester
Hi,

I'm going to read all the data in the cluster as fast as possible, i'm aware 
that spark could do such things out of the box but just wanted to do it at low 
level to see how fast it could be. So:
1. retrieved partition keys on each node using nodetool ring token ranges and 
getting distinct partition for the range
2. run the query for each partition on its main replica node using python 
(parallel job on all nodes of the cluster). i used loadBalancing strategy and 
only the local ip as contact point but i will try the whitelist policy too 
(With whitelist load balancing strategy restricted queries (read) to a 
single/local coordinator (a python script on same host as coordinator))
This mechanism turned out to be fast but not as fast as the sequential read of 
the disk could be (the query could be 100 times faster theoretically!




I'm using RF=3 in a single DC cluster with default WCL which is LOCAL_ONE. I 
suspect that may be the coordinator is also connecting other replicas but how 
can i debug that?

Is there any workaround to force the coordinator to only read data from itself 
so

if there is other replicas (beside the coordinator) for the partition key, only 
the coordinator's data would be read and returned and it should not even check 
other replicas foe the data

if the coordinator is not a replica for the partition key, it simply throw 
exception or return empty result


Is there any mechanism to accomplish this kind of local read?

Best Regards
Sent using https://www.zoho.com/mail/

OOM on ccm with large cluster on a single node

2020-10-27 Thread onmstester onmstester
Hi, 



I'm using ccm to create a cluster of 80 nodes on a physical server with 10 
cores and 64GB of ram, but always the 43th node could not start with error:

java.lang.OutOfMemoryError: unable to create new native thread



apache cassandra 3.11.2

cassandra xmx600M

30GB of memory is still free (according to free -m)

umilit -u: 255K

Sent using https://www.zoho.com/mail/

reducing RF wen using token allocation algorithm

2020-10-26 Thread onmstester onmstester
Hi,

I've set up cluster with:

3.11.2

30 nodes

RF=3,single dc, NetworkStrategy


Now i'm going to reduce rf to 2, but i've setup cluster with vnode=16 and 
allocation algorithm(allocate_tokens_for_keyspace) for the main keyspace (which 
i'm reducing its RF), so is the procedure still be 1. alter keyspace... and 2. 
run nodetool cleanup on all nodes?









Sent using https://www.zoho.com/mail/

Re: dropped mutations cross node

2020-10-05 Thread onmstester onmstester
Thanks,

I've done a lot of conf changes to fix  the problem but nothing worked (last 
one was disabling hints) and after a few days problem gone!!

The source of droppedCrossNode was changing every half an hour and it was not 
always the new nodes

No difference between new nodes and old ones in configuration and node spec

Sent using https://www.zoho.com/mail/




 On Mon, 05 Oct 2020 09:14:17 +0330 Erick Ramirez 
 wrote 


Sorry for the late reply. Do you still need assistance with this issue?



If the source of the dropped mutations and high latency are the newer nodes, 
that indicates to me that you have an issue with the commitlog disks. Are the 
newer nodes identical in hardware configuration to the pre-existing nodes? Any 
differences in configuration you could point out? Cheers!

dropped mutations cross node

2020-09-21 Thread onmstester onmstester
Hi, 

I've extended a cluster by 10% and after that each hour, on some of the nodes 
(which changes randomly each time),  "dropped mutations cross node" appears on 
logs (each time 1 or 2 drops and some times some thousands with cross node 
latency from 3000ms to 9ms or 90seconds!) and insert rate been decreased 
abour 50%:

 on token ownership everything is OK (stdev.p of ownership percent even 
decreased with cluster extension)

CPU usage on nodes is less than 30 percent and all well balanced

disk usage is less than 10% watching through iostat and also no pending 
compaction on nodes
there is no other log beside dropped reports (although a few GC about 200-300ms 
every 5 minuts)

no sign of memory problem looking at jvisualVM

honestly i do not monitor network equipments (Switches) but the network did not 
changed since the extend of cluster and not increase in packet discard counters 
at node side




So to emphasize; there is mutation drop  which i can not detect the root cause. 
Is there any workaround or monitoring metric that i missed here?





Cluster Info:
Cassandra 3.11.2

RF 3

30 Nodes


Sent using https://www.zoho.com/mail/

Re: Node is UNREACHABLE after decommission

2020-09-19 Thread onmstester onmstester
Another workaround that i used for UNREACHABLE nodes problem, is to restart the 
whole cluster and it would be fixed, but i don't know if it cause any problem 
or not

Sent using https://www.zoho.com/mail/




 On Fri, 18 Sep 2020 01:19:35 +0430 Paulo Motta  
wrote 


Oh, if you're adding the same hosts to another cluster then the old cluster 
might try to communicate to the decommissioned nodes if you do that before the 
3 day grace period. The cluster name not matching is a good protection, 
otherwise the two clusters will connect to each other and mayhem will ensue. 
You definitely don't want this to happen!



Unfortunately this delay of 3 days is hard-coded and non-configurable before 
4.0 (see https://issues.apache.org/jira/browse/CASSANDRA-15596).



As long as all the old cluster nodes are UP and don't see the decommissioned 
node on "nodetool status", you can safely assassinate decommissioned nodes to 
prevent this.





The requirement that all nodes in the cluster are UP before safely 
assassinating a node is to prevent a down node from trying to connect to the 
decommissioned node after it recovers.





Em qui., 17 de set. de 2020 às 17:25, Krish Donald 
 escreveu:

Thanks Paulo,

We have to decommission multiple nodes from the  cluster and move those nodes 
to other clusters. 

So if we have to wait for 3 days for every node then it is going to take a lot 
of time.

If i am trying to add the decommissioned node to the other cluster it is giving 
me an error that cluster_name is not matching however cluster name is correct 
as per new cluster.

So until i issue assasinate , i am not able to move forward.  



On Thu, Sep 17, 2020 at 1:13 PM Paulo Motta  
wrote:

After decommissioning the node remains in gossip for a period of 3 days (if I 
recall correctly) and it will show up on describecluster during that period, so 
this is expected behavior. This allows other nodes that eventually were down 
when the node decommissioned to learn that this node left the cluster.



What assassinate does is remove the node from gossip, so that's why it no 
longer shows up on describecluster, but this shouldn't be necessary. You should 
check that the node successfully decommissioned if it doesn't show up on 
"nodetool status".



Em qui., 17 de set. de 2020 às 14:26, Krish Donald 
 escreveu:

We are on 3.11.5 opensource cassandra


On Thu, Sep 17, 2020 at 10:25 AM Krish Donald  
wrote:

Hi,

We decommissioned a node from the cluster.

On decommissioned node it said in system.log that node has been decommissioned .

But after couple of minutes only , on rest of the nodes the node is showing 
UNREACHABLE when we issue nodetool describecluster .



nodetool status is not showing the node however nodetool describecluster is 
showing UNREACHABLE.



I tried nodetool assassinate and now node is not showing in nodetool 
describecluster , however that seems to be the last option.



Ideally it should leave the cluster immediately after decommission.  

Once decommissioned is completed as per log then is there any issue in issuing 
nodetool assasinate ?



Thanks

Re: data modeling qu: use a Map datatype, or just simple rows... ?

2020-09-19 Thread onmstester onmstester
I used Cassandra Set (no experience with map ), and one thing for sure is that 
with Cassandra collections you are limited to a few thousands entry per row 
(less than 10K for better performance)


Sent using https://www.zoho.com/mail/




 On Fri, 18 Sep 2020 20:33:21 +0430 Attila Wind  
wrote 


Hey guys,

I'm curious about your experiences regarding a data modeling
  question we are facing with. 
 At the moment we see 2 major different approaches in terms of how
  to build the tables
 But I'm googling around already for days with no luck to find any
  useful material explaining to me how a Map (as collection
  datatype) works on the storage engine, and what could surprise us
  later if we . So decided to ask this question... (If someone has
  some nice pointers here maybe that is also much appreciated!)
So
 To describe the problem in a simplified form
Imagine you have users (everyone is identified with a UUID),

and we want to answer a simple question: "have we seen this
guy before?"

we "just" want to be able to answer this question for a
limited time - let's say for 3 months

but... there are lots of lots of users we run into... many
millions / each day... 

and ~15-20% of them are returning users only - so many
guys we just might see once


We are thinking about something like a big big Map, in a form of
     userId => lastSeenTimestamp
 
 Obviously if we would have something like that then answering the
  above question is simply:
     if(map.get(userId) != null)  => TRUE - we have seen the guy
  before

Regarding the 2 major modelling approaches I mentioned above
 
 Approach 1
 Just simply use a table, something like this
 
 CREATE TABLE IF NOT EXISTS users (
     user_id            varchar,
     last_seen        int,                -- a UNIX timestamp is
  enough, thats why int
    PRIMARY KEY (user_id)
 ) 
 AND default_time_to_live = <3 months of seconds>;
Approach 2
 to do not produce that much rows, "cluster" the guys a bit
  together (into 1 row) so
 introduce a hashing function over the userId, producing a value
  btw [0; 1]
 and go with a table like
CREATE TABLE IF NOT EXISTS users (
     user_id_hash    int,
     users_seen        map,            -- this is
  a userId => last timestamp map

    PRIMARY KEY (user_id_hash)
 ) 
 AND default_time_to_live = <3 months of seconds>;        --
yes, its clearly not a good enough way ...
 
In theory:
on a WRITE path both representation gives us a way to do the
write without the need of read

even the READ path is pretty efficient in both cases

Approach2 is worse definitely when we come to the cleanup -
"remove info if older than 3 month"

Approach2 might affect the balance of the cluster more - thats
clear (however not that much due to the "law of large number"
and really enough random factors)


And what we are struggling around is: what do you think
 Which approach would be better over time? So will slow
  down the cluster less considering in compaction etc etc
As far as we can see the real question is: 

which hurts more?

much more rows, but very small rows (regarding data size), or

much less rows, but much bigger rows (regarding data size)


?

Any thoughts, comments, pointers to some related case studies,
  articles, etc is highly appreciated!! :-)

thanks!

-- 
 Attila Wind
 
  http://www.linkedin.com/in/attilaw
 Mobile: +49 176 43556932

Re: Re: streaming stuck on joining a node with TBs of data

2020-08-05 Thread onmstester onmstester
OK. Thanks 




I'm using STCS.



Anyway, IMHO, this is one of the main bottlenecks for using big/dense node in 
Cassandra (which reduces cluster size and data center costs) and it could be 
almost solved (at least for me), if we could reduce number of sstables at 
receiver side (either by sending bigger sstables at sending side or by merging 
sstables in memtable at receiving side)



(Just fixed a wrong word in my previous question)


 On Wed, 05 Aug 2020 10:02:51 +0430 onmstester onmstester 
<mailto:onmstes...@zoho.com.INVALID> wrote 


OK. Thanks

I'm using STCS.

Anyway, IMHO, this is one of the main bottlenecks for using big/dense node in 
Cassandra (which reduces cluster size and data center costs) and it could be 
almost solved (at least for me), if we could eliminate number of sstables at 
receiver side (either by sending bigger sstables at sending side or by merging 
sstables in memtable at receiving side)


Sent using https://www.zoho.com/mail/




 On Mon, 03 Aug 2020 19:17:33 +0430 Jeff Jirsa <mailto:jji...@gmail.com> 
wrote 


Memtable really isn't involved here, each data file is copied over as-is and 
turned into a new data file, it doesn't read into the memtable (though it does 
deserialize and re-serialize, which temporarily has it in memory, but isn't in 
the memtable itself).



You can cut down on the number of data files copied in by using fewer vnodes, 
or by changing your compaction parameters (e.g. if you're using LCS, change 
sstable size from 160M to something higher), but there's no magic to join / 
compact those data files on the sending side before sending.




On Mon, Aug 3, 2020 at 4:15 AM onmstester onmstester 
<mailto:onmstes...@zoho.com.invalid> wrote:

IMHO (reading system.log) each streamed-in file from any node would be write 
down as a separate sstable to the disk and won't be wait in memtable until 
enough amount of memtable has been created inside memory, so there would be 
more compactions because of multiple small sstables. Is there any configuration 
in cassandra to force streamed-in to pass memtable-sstable cycle, to have 
bigger sstables at first place?



Sent using https://www.zoho.com/mail/






 Forwarded message ====
From: onmstester onmstester <mailto:onmstes...@zoho.com.INVALID>
To: "user"<mailto:user@cassandra.apache.org>
Date: Sun, 02 Aug 2020 08:35:30 +0430
Subject: Re: streaming stuck on joining a node with TBs of data
 Forwarded message 



Thanks Jeff,



Already used netstats and it only shows that streaming from a single node 
remained and stuck and bunch of dropped messages, next time i will check 
tpstats too.

Currently i stopped the joining/stucked node, make the auto_bootstrap false and 
started the node and its UN now, is this OK too?



What about streaming tables one by one, any idea?



Sent using https://www.zoho.com/mail/






 On Sat, 01 Aug 2020 21:44:09 +0430 Jeff Jirsa <mailto:jji...@gmail.com> 
wrote 





Nodetool tpstats and netstats should give you a hint why it’s not joining



If you don’t care about consistency and you just want it joined in its current 
form (which is likely strictly incorrect but I get it), “nodetool disablegossip 
&& nodetool enablegossip” in rapid succession (must be less than 30 seconds in 
between commands) will PROBABLY change it from joining to normal (unclean, 
unsafe, do this at your own risk).





On Jul 31, 2020, at 11:46 PM, onmstester onmstester 
<mailto:onmstes...@zoho.com.invalid> wrote:





No Secondary index, No SASI, No materialized view



Sent using https://www.zoho.com/mail/






 On Sat, 01 Aug 2020 11:02:54 +0430 Jeff Jirsa <mailto:jji...@gmail.com> 
wrote 



Are there secondary indices involved? 



On Jul 31, 2020, at 10:51 PM, onmstester onmstester 
<mailto:onmstes...@zoho.com.invalid> wrote:





Hi,



I'm going to join multiple new nodes to already existed and running cluster. 
Each node should stream in >2TB of data, and it took a few days (with 500Mb 
streaming) to almost get finished. But it stuck on streaming-in from one final 
node, but i can not see any bottleneck on any side (source or destination 
node), the only problem is 400 pending compactions on joining node, which i 
disabled auto_compaction, but no improvement.



1. How can i safely stop streaming/joining the new node and make it UN, then 
run repair on the node?

2. On bootstrap a new node, multiple tables would be streamed-in simultaneously 
and i think that this would increase number of compactions in compare with a 
scenario that "the joining node first stream-in one table then switch to 
another one and etc". Am i right and this would decrease compactions? If so, is 
there a config or hack in cassandra to force that?





Sent using https://www.zoho.com/mail/

Re: Re: streaming stuck on joining a node with TBs of data

2020-08-04 Thread onmstester onmstester
OK. Thanks

I'm using STCS.
Anyway, IMHO, this is one of the main bottlenecks for using big/dense node in 
Cassandra (which reduces cluster size and data center costs) and it could be 
almost solved (at least for me), if we could eliminate number of sstables at 
receiver side (either by sending bigger sstables at sending side or by merging 
sstables in memtable at receiving side)

Sent using https://www.zoho.com/mail/




 On Mon, 03 Aug 2020 19:17:33 +0430 Jeff Jirsa  wrote 


Memtable really isn't involved here, each data file is copied over as-is and 
turned into a new data file, it doesn't read into the memtable (though it does 
deserialize and re-serialize, which temporarily has it in memory, but isn't in 
the memtable itself).



You can cut down on the number of data files copied in by using fewer vnodes, 
or by changing your compaction parameters (e.g. if you're using LCS, change 
sstable size from 160M to something higher), but there's no magic to join / 
compact those data files on the sending side before sending.




On Mon, Aug 3, 2020 at 4:15 AM onmstester onmstester 
<mailto:onmstes...@zoho.com.invalid> wrote:

IMHO (reading system.log) each streamed-in file from any node would be write 
down as a separate sstable to the disk and won't be wait in memtable until 
enough amount of memtable has been created inside memory, so there would be 
more compactions because of multiple small sstables. Is there any configuration 
in cassandra to force streamed-in to pass memtable-sstable cycle, to have 
bigger sstables at first place?



Sent using https://www.zoho.com/mail/






 Forwarded message 
From: onmstester onmstester <mailto:onmstes...@zoho.com.INVALID>
To: "user"<mailto:user@cassandra.apache.org>
Date: Sun, 02 Aug 2020 08:35:30 +0430
Subject: Re: streaming stuck on joining a node with TBs of data
 Forwarded message 



Thanks Jeff,



Already used netstats and it only shows that streaming from a single node 
remained and stuck and bunch of dropped messages, next time i will check 
tpstats too.

Currently i stopped the joining/stucked node, make the auto_bootstrap false and 
started the node and its UN now, is this OK too?



What about streaming tables one by one, any idea?



Sent using https://www.zoho.com/mail/






 On Sat, 01 Aug 2020 21:44:09 +0430 Jeff Jirsa <mailto:jji...@gmail.com> 
wrote 





Nodetool tpstats and netstats should give you a hint why it’s not joining



If you don’t care about consistency and you just want it joined in its current 
form (which is likely strictly incorrect but I get it), “nodetool disablegossip 
&& nodetool enablegossip” in rapid succession (must be less than 30 seconds in 
between commands) will PROBABLY change it from joining to normal (unclean, 
unsafe, do this at your own risk).





On Jul 31, 2020, at 11:46 PM, onmstester onmstester 
<mailto:onmstes...@zoho.com.invalid> wrote:





No Secondary index, No SASI, No materialized view



Sent using https://www.zoho.com/mail/






 On Sat, 01 Aug 2020 11:02:54 +0430 Jeff Jirsa <mailto:jji...@gmail.com> 
wrote 



Are there secondary indices involved? 



On Jul 31, 2020, at 10:51 PM, onmstester onmstester 
<mailto:onmstes...@zoho.com.invalid> wrote:





Hi,



I'm going to join multiple new nodes to already existed and running cluster. 
Each node should stream in >2TB of data, and it took a few days (with 500Mb 
streaming) to almost get finished. But it stuck on streaming-in from one final 
node, but i can not see any bottleneck on any side (source or destination 
node), the only problem is 400 pending compactions on joining node, which i 
disabled auto_compaction, but no improvement.



1. How can i safely stop streaming/joining the new node and make it UN, then 
run repair on the node?

2. On bootstrap a new node, multiple tables would be streamed-in simultaneously 
and i think that this would increase number of compactions in compare with a 
scenario that "the joining node first stream-in one table then switch to 
another one and etc". Am i right and this would decrease compactions? If so, is 
there a config or hack in cassandra to force that?





Sent using https://www.zoho.com/mail/

Fwd: Re: streaming stuck on joining a node with TBs of data

2020-08-03 Thread onmstester onmstester
IMHO (reading system.log) each streamed-in file from any node would be write 
down as a separate sstable to the disk and won't be wait in memtable until 
enough amount of memtable has been created inside memory, so there would be 
more compactions because of multiple small sstables. Is there any configuration 
in cassandra to force streamed-in to pass memtable-sstable cycle, to have 
bigger sstables at first place?



Sent using https://www.zoho.com/mail/






 Forwarded message 
From: onmstester onmstester 
To: "user"
Date: Sun, 02 Aug 2020 08:35:30 +0430
Subject: Re: streaming stuck on joining a node with TBs of data
 Forwarded message 



Thanks Jeff,



Already used netstats and it only shows that streaming from a single node 
remained and stuck and bunch of dropped messages, next time i will check 
tpstats too.

Currently i stopped the joining/stucked node, make the auto_bootstrap false and 
started the node and its UN now, is this OK too?



What about streaming tables one by one, any idea?



Sent using https://www.zoho.com/mail/






 On Sat, 01 Aug 2020 21:44:09 +0430 Jeff Jirsa <mailto:jji...@gmail.com> 
wrote 





Nodetool tpstats and netstats should give you a hint why it’s not joining



If you don’t care about consistency and you just want it joined in its current 
form (which is likely strictly incorrect but I get it), “nodetool disablegossip 
&& nodetool enablegossip” in rapid succession (must be less than 30 seconds in 
between commands) will PROBABLY change it from joining to normal (unclean, 
unsafe, do this at your own risk).





On Jul 31, 2020, at 11:46 PM, onmstester onmstester 
<mailto:onmstes...@zoho.com.invalid> wrote:





No Secondary index, No SASI, No materialized view



Sent using https://www.zoho.com/mail/






 On Sat, 01 Aug 2020 11:02:54 +0430 Jeff Jirsa <mailto:jji...@gmail.com> 
wrote 



Are there secondary indices involved? 



On Jul 31, 2020, at 10:51 PM, onmstester onmstester 
<mailto:onmstes...@zoho.com.invalid> wrote:





Hi,



I'm going to join multiple new nodes to already existed and running cluster. 
Each node should stream in >2TB of data, and it took a few days (with 500Mb 
streaming) to almost get finished. But it stuck on streaming-in from one final 
node, but i can not see any bottleneck on any side (source or destination 
node), the only problem is 400 pending compactions on joining node, which i 
disabled auto_compaction, but no improvement.



1. How can i safely stop streaming/joining the new node and make it UN, then 
run repair on the node?

2. On bootstrap a new node, multiple tables would be streamed-in simultaneously 
and i think that this would increase number of compactions in compare with a 
scenario that "the joining node first stream-in one table then switch to 
another one and etc". Am i right and this would decrease compactions? If so, is 
there a config or hack in cassandra to force that?





Sent using https://www.zoho.com/mail/

Re: streaming stuck on joining a node with TBs of data

2020-08-01 Thread onmstester onmstester
Thanks Jeff,



Already used netstats and it only shows that streaming from a single node 
remained and stuck and bunch of dropped messages, next time i will check 
tpstats too.

Currently i stopped the joining/stucked node, make the auto_bootstrap false and 
started the node and its UN now, is this OK too?



What about streaming tables one by one, any idea?

Sent using https://www.zoho.com/mail/




 On Sat, 01 Aug 2020 21:44:09 +0430 Jeff Jirsa  wrote 



Nodetool tpstats and netstats should give you a hint why it’s not joining

If you don’t care about consistency and you just want it joined in its current 
form (which is likely strictly incorrect but I get it), “nodetool disablegossip 
&& nodetool enablegossip” in rapid succession (must be less than 30 seconds in 
between commands) will PROBABLY change it from joining to normal (unclean, 
unsafe, do this at your own risk).



On Jul 31, 2020, at 11:46 PM, onmstester onmstester 
<mailto:onmstes...@zoho.com.invalid> wrote:



No Secondary index, No SASI, No materialized view



Sent using https://www.zoho.com/mail/






 On Sat, 01 Aug 2020 11:02:54 +0430 Jeff Jirsa <mailto:jji...@gmail.com> 
wrote 



Are there secondary indices involved? 



On Jul 31, 2020, at 10:51 PM, onmstester onmstester 
<mailto:onmstes...@zoho.com.invalid> wrote:





Hi,



I'm going to join multiple new nodes to already existed and running cluster. 
Each node should stream in >2TB of data, and it took a few days (with 500Mb 
streaming) to almost get finished. But it stuck on streaming-in from one final 
node, but i can not see any bottleneck on any side (source or destination 
node), the only problem is 400 pending compactions on joining node, which i 
disabled auto_compaction, but no improvement.



1. How can i safely stop streaming/joining the new node and make it UN, then 
run repair on the node?

2. On bootstrap a new node, multiple tables would be streamed-in simultaneously 
and i think that this would increase number of compactions in compare with a 
scenario that "the joining node first stream-in one table then switch to 
another one and etc". Am i right and this would decrease compactions? If so, is 
there a config or hack in cassandra to force that?





Sent using https://www.zoho.com/mail/

Re: streaming stuck on joining a node with TBs of data

2020-08-01 Thread onmstester onmstester
No Secondary index, No SASI, No materialized view



Sent using https://www.zoho.com/mail/






 On Sat, 01 Aug 2020 11:02:54 +0430 Jeff Jirsa  wrote 



Are there secondary indices involved? 



On Jul 31, 2020, at 10:51 PM, onmstester onmstester 
<mailto:onmstes...@zoho.com.invalid> wrote:





Hi,



I'm going to join multiple new nodes to already existed and running cluster. 
Each node should stream in >2TB of data, and it took a few days (with 500Mb 
streaming) to almost get finished. But it stuck on streaming-in from one final 
node, but i can not see any bottleneck on any side (source or destination 
node), the only problem is 400 pending compactions on joining node, which i 
disabled auto_compaction, but no improvement.



1. How can i safely stop streaming/joining the new node and make it UN, then 
run repair on the node?

2. On bootstrap a new node, multiple tables would be streamed-in simultaneously 
and i think that this would increase number of compactions in compare with a 
scenario that "the joining node first stream-in one table then switch to 
another one and etc". Am i right and this would decrease compactions? If so, is 
there a config or hack in cassandra to force that?





Sent using https://www.zoho.com/mail/

streaming stuck on joining a node with TBs of data

2020-07-31 Thread onmstester onmstester
Hi,



I'm going to join multiple new nodes to already existed and running cluster. 
Each node should stream in >2TB of data, and it took a few days (with 500Mb 
streaming) to almost get finished. But it stuck on streaming-in from one final 
node, but i can not see any bottleneck on any side (source or destination 
node), the only problem is 400 pending compactions on joining node, which i 
disabled auto_compaction, but no improvement.



1. How can i safely stop streaming/joining the new node and make it UN, then 
run repair on the node?

2. On bootstrap a new node, multiple tables would be streamed-in simultaneously 
and i think that this would increase number of compactions in compare with a 
scenario that "the joining node first stream-in one table then switch to 
another one and etc". Am i right and this would decrease compactions? If so, is 
there a config or hack in cassandra to force that?





Sent using https://www.zoho.com/mail/

Re: Multi DCs vs Single DC performance

2020-07-28 Thread onmstester onmstester
Thanks for your immediate and clear response, Jeff


Sent using https://www.zoho.com/mail/




 On Wed, 29 Jul 2020 08:00:33 +0430 Jeff Jirsa  wrote 


PROBABLY not, unless you've got a very very clever idea of using local 
consistency levels or somehow taking advantage of write forwarding in a way I 
havent personally figured out yet (maybe if you had a very high replica count 
per DC, then using forwarding and EACH_QUORUM may get fun, but you'd be better 
off dropping the replica count than coming up with stuff like this).


On Tue, Jul 28, 2020 at 8:27 PM onmstester onmstester 
<mailto:onmstes...@zoho.com.invalid> wrote:

Hi,



Logically, i do not need to use multiple DCs(cluster is not geographically 
separated), but i wonder if splitting the cluster to two half (two separate dc) 
would decrease overhead of node ack/communication and result in better (write) 
performance?

Sent using https://www.zoho.com/mail/

Multi DCs vs Single DC performance

2020-07-28 Thread onmstester onmstester
Hi,



Logically, i do not need to use multiple DCs(cluster is not geographically 
separated), but i wonder if splitting the cluster to two half (two separate dc) 
would decrease overhead of node ack/communication and result in better (write) 
performance?

Sent using https://www.zoho.com/mail/

Re: design principle to manage roll back

2020-07-14 Thread onmstester onmstester
Hi,



I think that Cassandra alone is not suitable for your use case. You can use a 
mix of Distributed/NoSQL (to storing single records of whatever makes your 
input the big data) & Relational/Single Database (for transactional non-big 
data part)

Sent using https://www.zoho.com/mail/




 On Tue, 14 Jul 2020 10:47:33 +0430 Manu Chadha  
wrote 



Hi

 

What are the design approaches I can follow to ensure that data is consistent 
from an application perspective (not from individual tables perspective). I am 
thinking of issues which arise due to unavailability of rollback or executing 
atomic
 transactions in Cassandra. Is Cassandra not suitable for my project?

 

Cassandra recommends creating a new table for each query. This results in data 
duplication (which doesn’t bother me). Take the following scenario. An 
application which allows users to create, share and manage food recipes. Each 
of the function
 below adds records in a separate database

 

for {savedRecipe <- saveInRecipeRepository(...)

   recipeTagRepository <- saveRecipeTag(...)
   partitionInfoOfRecipes <- savePartitionOfTheTag(...)   
   updatedUserProfile <- updateInUserProfile(...)
   recipesByUser <- saveRecipesCreatedByUser(...)
   supportedRecipes <- updateSupportedRecipesInformation(tag)}

 

If say updateInUserProfile fails, then I'll have to manage rollback in the 
application itself as Cassandra doesn’t do it. My concerns is that the rollback 
process could itself fail due to network issues say.

 

Is there a recommended way or a design principle I can follow to keep data 
consistent?

 

Thanks

Manu

 

Sent from https://go.microsoft.com/fwlink/?LinkId=550986 for Windows 10

 

Relation between num_tokens and cluster extend limitations

2020-07-13 Thread onmstester onmstester
Hi, 



I'm using allocate_tokens_for_keyspace and num_tokens=32 and i wan't to extend 
the size of some clusters.

I read in articles that for num_tokens=4, one should add more 25% of cluster 
size for the cluster to become balanced again.



1. For example, with num_tokens=4 and already have 16 nodes, so i should add 4 
nodes to be balanced again and my cluster would be in dangerous and unbalanced 
state until i add the 4th node?

Or should i add all 4 nodes at once (i don't know how because i should wait for 
each node to stream in multiple TBs)?



2. with num_tokens=32 for these different scenarios of cluster size, how many 
nodes should i add to be balanced again?

a. cluster size = 16

b. cluster size = 32

c. cluster size = 70



3. I can not understand why cassandra could not keep cluster balanced when you 
add/remove a single node to the cluster, how the mechanism works?

4. Is the limitation the same with the shrink?



Thank you in advance

Re: Running Large Clusters in Production

2020-07-10 Thread onmstester onmstester
Yes, you should handle the routing logic at app level

I wish there was another level of sharding (above dc, rack) as cluster to 
distribute data on multiple cluster! but i don't think there is any other 
database that does such a thing for you.

Another problem with big cluster is for huge amount of threads on each node 
which is : (CLUSTER_SIZE - 1) * (3 INCOMING Threads + 3 OUTGOING), even for 100 
nodes it would be 600 threads, i wonder how some papers reported linear 
scalability for Cassandra even with >300nodes (such as Netflix at 2011), i mean 
shouldn't the overhead of increasing number of threads on each node to slow 
down the linear scalability?

Sent using https://www.zoho.com/mail/




 On Sat, 11 Jul 2020 06:18:33 +0430 Sergio  
wrote 


Sorry for the dumb question:

When we refer to 1000 nodes divided in 10 clusters(shards): we would have 100 
nodes per cluster

A shard is not intended as Datacenter but it would be a cluster itself that it 
doesn't talk with the other ones so there should be some routing logic at the 
application level to route the requests to the correct cluster?

Is this the recommended approach?



Thanks 







On Fri, Jul 10, 2020, 4:06 PM Jon Haddad  wrote:





I worked on a handful of large clusters (> 200 nodes) using vnodes, and there 
were some serious issues with both performance and availability.  We had to put 
in a LOT of work to fix the problems.

I agree with Jeff - it's way better to manage multiple clusters than a really 
large one.





On Fri, Jul 10, 2020 at 2:49 PM Jeff Jirsa  wrote:

1000 instances are fine if you're not using vnodes.

I'm not sure what the limit is if you're using vnodes. 


If you might get to 1000, shard early before you get there. Running 8x100 host 
clusters will be easier than one 800 host cluster.




On Fri, Jul 10, 2020 at 2:19 PM Isaac Reath (BLOOMBERG/ 919 3RD A) 
 wrote:

Hi All,

I’m currently dealing with a use case that is running on around 200 nodes, due 
to growth of their product as well as onboarding additional data sources, we 
are looking at having to expand that to around 700 nodes, and potentially 
beyond to 1000+. To that end I have a couple of questions:



1)  For those who have experienced managing clusters at that scale, what types 
of operational challenges have you run into that you might not see when 
operating 100 node clusters? A couple that come to mind are version (especially 
major version) upgrades become a lot more risky as it no longer becomes 
feasible to do a blue / green style deployment of the database and backup & 
restore operations seem far more error prone as well for the same reason 
(having to do an in-place restore instead of being able to spin up a new 
cluster to restore to).



2) Is there a cluster size beyond which sharding across multiple clusters 
becomes the recommended approach?



Thanks,

Isaac

Cassandra crashes when using offheap_objects for memtable_allocation_type

2020-06-02 Thread onmstester onmstester
I just changed these properties to increase flushed file size (decrease number 
of compactions):

memtable_allocation_type from heap_buffers to offheap_objects

memtable_offheap_space_in_mb: from default (2048) to 8192


Using default value for other memtable/compaction/commitlog configurations .


After a few hours some of nodes stopped to do any mutations (dropped mutaion 
increased) and also pending flushes increased, they were just up and running 
and there was only a single CPU core with 100% usage(other cores was 0%). other 
nodes on the cluster determines the node as DN. Could not access 7199 and also 
could not create thread dump even with jstack -F. 



Restarting Cassandra service fixes the problem but after a while some other 
node would be DN.



Am i missing some configurations?  What should i change in cassandra default 
configuration to maximize write throughput in single node/cluster in 
write-heavy scenario for the data model:

Data mode is a single table:

  create table test(

  text partition_key,

  text clustering_key,

  set rows,

  primary key ((partition_key, clustering_key))






vCPU: 12

Memory: 32GB

Node data size: 2TB
Apache cassandra 3.11.2

JVM heap size: 16GB, CMS, 1GB newgen



Sent using https://www.zoho.com/mail/

Fwd: Re: [Discuss] num_tokens default in Cassandra 4.0

2020-02-03 Thread onmstester onmstester
Thank you so much



Sent using https://www.zoho.com/mail/






 Forwarded message 
From: Max C. 
To: 
Date: Tue, 04 Feb 2020 08:37:21 +0330
Subject: Re: [Discuss] num_tokens default in Cassandra 4.0
 Forwarded message 



Let’s say you have a 6 node cluster, with RF=3, and no vnodes.  In that case 
each piece of data is stored as follows:



: 

N1: N2 N3

N2: N3 N4

N3: N4 N5

N4: N5 N6

N5: N6 N1

N6: N1 N2



With this setup, there are some circumstances where you could lose 2 nodes (ex: 
N1 & N4) and still be able to maintain CL=quorum.  If your cluster is very 
large, then you could lose even more — and that’s a good thing, because if you 
have hundreds/thousands of nodes then you don’t want the world to come tumbling 
down if  > 1 node is down.  Or maybe you want to upgrade the OS on your nodes, 
and want to (with very careful planning!) do it by taking down more than 1 node 
at a time.



… but if you have a large number of vnodes, then a given node will share a 
small segment of data with LOTS of other nodes, which destroys this property.  
The more vnodes, the less likely you’re able to handle > 1 node down.



For example, see this diagram in the Datastax docs —



https://docs.datastax.com/en/dse/5.1/dse-arch/datastax_enterprise/dbArch/archDataDistributeVnodesUsing.html#Distributingdatausingvnodes



In that bottom picture, you can’t knock out 2 nodes and still maintain 
CL=quorum.  Ex:  If you knock out node 1 & 4, then ranges B & L would no longer 
meet CL=quorum;  but you can do that in the top diagram, since there are no 
ranges shared between node 1 & 4.



Hope that helps.



- Max





On Feb 3, 2020, at 8:39 pm, onmstester onmstester 
<mailto:onmstes...@zoho.com.INVALID> wrote:



Sorry if its trivial, but i do not understand how num_tokens affects 
availability, with RF=3, CLW,CLR=quorum, the cluster could tolerate to lost at 
most one node and all of the tokens assigned to that node would be also 
assigned to two other nodes no matter what num_tokens is, right?



Sent using https://www.zoho.com/mail/






 Forwarded message 
From: Jon Haddad <mailto:j...@jonhaddad.com>
To: <mailto:d...@cassandra.apache.org>
Date: Tue, 04 Feb 2020 01:15:21 +0330
Subject: Re: [Discuss] num_tokens default in Cassandra 4.0
 Forwarded message 



I think it's a good idea to take a step back and get a high level view of 
the problem we're trying to solve. 
 
First, high token counts result in decreased availability as each node has 
data overlap with with more nodes in the cluster.  Specifically, a node can 
share data with RF-1 * 2 * num_tokens.  So a 256 token cluster at RF=3 is 
going to almost always share data with every other node in the cluster that 
isn't in the same rack, unless you're doing something wild like using more 
than a thousand nodes in a cluster.  We advertise 
 
With 16 tokens, that is vastly improved, but you still have up to 64 nodes 
each node needs to query against, so you're again, hitting every node 
unless you go above ~96 nodes in the cluster (assuming 3 racks / AZs).  I 
wouldn't use 16 here, and I doubt any of you would either.  I've advocated 
for 4 tokens because you'd have overlap with only 16 nodes, which works 
well for small clusters as well as large.  Assuming I was creating a new 
cluster for myself (in a hypothetical brand new application I'm building) I 
would put this in production.  I have worked with several teams where I 
helped them put 4 token clusters in prod and it has worked very well.  We 
didn't see any wild imbalance issues. 
 
As Mick's pointed out, our current method of using random token assignment 
for the default number of problematic for 4 tokens.  I fully agree with 
this, and I think if we were to try to use 4 tokens, we'd want to address 
this in tandem.  We can discuss how to better allocate tokens by default 
(something more predictable than random), but I'd like to avoid the 
specifics of that for the sake of this email. 
 
To Alex's point, repairs are problematic with lower token counts due to 
over streaming.  I think this is a pretty serious issue and I we'd have to 
address it before going all the way down to 4.  This, in my opinion, is a 
more complex problem to solve and I think trying to fix it here could make 
shipping 4.0 take even longer, something none of us want. 
 
For the sake of shipping 4.0 without adding extra overhead and time, I'm ok 
with moving to 16 tokens, and in the process adding extensive documentation 
outlining what we recommend for production use.  I think we should also try 
to figure out something better than random as the default to fix the data 
imbalance issues.  I've got a few ideas here I've been noodling on. 
 
As long as folks are fine with potentially changing the default again in C* 
5.0 (after another discussion / debate), 16 is enough of an improvement 
that I'm O

Fwd: Re: [Discuss] num_tokens default in Cassandra 4.0

2020-02-03 Thread onmstester onmstester
Sorry if its trivial, but i do not understand how num_tokens affects 
availability, with RF=3, CLW,CLR=quorum, the cluster could tolerate to lost at 
most one node and all of the tokens assigned to that node would be also 
assigned to two other nodes no matter what num_tokens is, right?


Sent using https://www.zoho.com/mail/




 Forwarded message 
From: Jon Haddad 
To: 
Date: Tue, 04 Feb 2020 01:15:21 +0330
Subject: Re: [Discuss] num_tokens default in Cassandra 4.0
 Forwarded message 


I think it's a good idea to take a step back and get a high level view of 
the problem we're trying to solve. 
 
First, high token counts result in decreased availability as each node has 
data overlap with with more nodes in the cluster.  Specifically, a node can 
share data with RF-1 * 2 * num_tokens.  So a 256 token cluster at RF=3 is 
going to almost always share data with every other node in the cluster that 
isn't in the same rack, unless you're doing something wild like using more 
than a thousand nodes in a cluster.  We advertise 
 
With 16 tokens, that is vastly improved, but you still have up to 64 nodes 
each node needs to query against, so you're again, hitting every node 
unless you go above ~96 nodes in the cluster (assuming 3 racks / AZs).  I 
wouldn't use 16 here, and I doubt any of you would either.  I've advocated 
for 4 tokens because you'd have overlap with only 16 nodes, which works 
well for small clusters as well as large.  Assuming I was creating a new 
cluster for myself (in a hypothetical brand new application I'm building) I 
would put this in production.  I have worked with several teams where I 
helped them put 4 token clusters in prod and it has worked very well.  We 
didn't see any wild imbalance issues. 
 
As Mick's pointed out, our current method of using random token assignment 
for the default number of problematic for 4 tokens.  I fully agree with 
this, and I think if we were to try to use 4 tokens, we'd want to address 
this in tandem.  We can discuss how to better allocate tokens by default 
(something more predictable than random), but I'd like to avoid the 
specifics of that for the sake of this email. 
 
To Alex's point, repairs are problematic with lower token counts due to 
over streaming.  I think this is a pretty serious issue and I we'd have to 
address it before going all the way down to 4.  This, in my opinion, is a 
more complex problem to solve and I think trying to fix it here could make 
shipping 4.0 take even longer, something none of us want. 
 
For the sake of shipping 4.0 without adding extra overhead and time, I'm ok 
with moving to 16 tokens, and in the process adding extensive documentation 
outlining what we recommend for production use.  I think we should also try 
to figure out something better than random as the default to fix the data 
imbalance issues.  I've got a few ideas here I've been noodling on. 
 
As long as folks are fine with potentially changing the default again in C* 
5.0 (after another discussion / debate), 16 is enough of an improvement 
that I'm OK with the change, and willing to author the docs to help people 
set up their first cluster.  For folks that go into production with the 
defaults, we're at least not setting them up for total failure once their 
clusters get large like we are now. 
 
In future versions, we'll probably want to address the issue of data 
imbalance by building something in that shifts individual tokens around.  I 
don't think we should try to do this in 4.0 either. 
 
Jon 
 
 
 
On Fri, Jan 31, 2020 at 2:04 PM Jeremy Hanna 
 
wrote: 
 
> I think Mick and Anthony make some valid operational and skew points for 
> smaller/starting clusters with 4 num_tokens. There’s an arbitrary line 
> between small and large clusters but I think most would agree that most 
> clusters are on the small to medium side. (A small nuance is afaict the 
> probabilities have to do with quorum on a full token range, ie it has to do 
> with the size of a datacenter not the full cluster 
> 
> As I read this discussion I’m personally more inclined to go with 16 for 
> now. It’s true that if we could fix the skew and topology gotchas for those 
> starting things up, 4 would be ideal from an availability perspective. 
> However we’re still in the brainstorming stage for how to address those 
> challenges. I think we should create tickets for those issues and go with 
> 16 for 4.0. 
> 
> This is about an out of the box experience. It balances availability, 
> operations (such as skew and general bootstrap friendliness and 
> streaming/repair), and cluster sizing. Balancing all of those, I think for 
> now I’m more comfortable with 16 as the default with docs on considerations 
> and tickets to unblock 4 as the default for all users. 
> 
> >>> On Feb 1, 2020, at 6:30 AM, Jeff Jirsa  

Re: bug in cluster key push down

2020-01-12 Thread onmstester onmstester
-It’s probably just a logging / visibility problem, but we should confirm 

I think it is.

Cause with tracing on, cqlsh logs that "read 1 live rows... " for the query 
with both clustering key restricted but the whole partition (with no clustering 
key restriction) has 12 live rows, so i suppose that clustering key 
restrictions been pushed down to storage engine.



Thanks Jeff
Sent using https://www.zoho.com/mail/






 On Mon, 13 Jan 2020 08:38:44 +0330 onmstester onmstester 
<mailto:onmstes...@zoho.com.INVALID> wrote 



Done.

https://issues.apache.org/jira/browse/CASSANDRA-15500



Sent using https://www.zoho.com/mail/






 On Sun, 12 Jan 2020 19:22:33 +0330 Jeff Jirsa <mailto:jji...@gmail.com> 
wrote 












Can you open a jira so someone can investigate ? It’s probably just a logging / 
visibility problem, but we should confirm 



Sent from my iPhone



On Jan 12, 2020, at 6:04 AM, onmstester onmstester 
<mailto:onmstes...@zoho.com.invalid> wrote:





Using Apache Cassandra 3.11.2, defined a table like this:



create table my_table(

                   partition text,

           clustering1 int,

      clustering2 text,

      data set,

    primary key (partition, clustering1, clustering2))



and configured slow queries threshold to 1ms in yaml to see how queries passed 
to cassandra. Query below:



select * from my_table where partition='a' and clustering1= 1 and 
clustering2='b'



would be like this in debug.log of cassandra:



select * from my_table where partition='a' LIMIT 100>  (it means that the two 
cluster key restriction did not push down to storage engine and the whole 
partition been retrieved)



but this query:



select * from my_table where partition='a' and clustering1= 1



would be 



select * from my_table where partition='a' and clustering1= 1 LIMIT 100> 
(single cluster key been pushed down to storage engine)





So it seems to me that, we could not restrict multiple clustering keys in 
select because it would retrieve the whole partition ?!

Sent using https://www.zoho.com/mail/

Re: bug in cluster key push down

2020-01-12 Thread onmstester onmstester
Done.

https://issues.apache.org/jira/browse/CASSANDRA-15500


Sent using https://www.zoho.com/mail/




 On Sun, 12 Jan 2020 19:22:33 +0330 Jeff Jirsa  wrote 


Can you open a jira so someone can investigate ? It’s probably just a logging / 
visibility problem, but we should confirm 

Sent from my iPhone


On Jan 12, 2020, at 6:04 AM, onmstester onmstester 
<mailto:onmstes...@zoho.com.invalid> wrote:



Using Apache Cassandra 3.11.2, defined a table like this:



create table my_table(

                   partition text,

           clustering1 int,

      clustering2 text,

      data set,

    primary key (partition, clustering1, clustering2))



and configured slow queries threshold to 1ms in yaml to see how queries passed 
to cassandra. Query below:



select * from my_table where partition='a' and clustering1= 1 and 
clustering2='b'



would be like this in debug.log of cassandra:



select * from my_table where partition='a' LIMIT 100>  (it means that the two 
cluster key restriction did not push down to storage engine and the whole 
partition been retrieved)



but this query:



select * from my_table where partition='a' and clustering1= 1



would be 



select * from my_table where partition='a' and clustering1= 1 LIMIT 100> 
(single cluster key been pushed down to storage engine)





So it seems to me that, we could not restrict multiple clustering keys in 
select because it would retrieve the whole partition ?!

Sent using https://www.zoho.com/mail/

bug in cluster key push down

2020-01-12 Thread onmstester onmstester
Using Apache Cassandra 3.11.2, defined a table like this:



create table my_table(

                   partition text,
           clustering1 int,
      clustering2 text,

      data set,

    primary key (partition, clustering1, clustering2))



and configured slow queries threshold to 1ms in yaml to see how queries passed 
to cassandra. Query below:



select * from my_table where partition='a' and clustering1= 1 and 
clustering2='b'



would be like this in debug.log of cassandra:



select * from my_table where partition='a' LIMIT 100>  (it means that the two 
cluster key restriction did not push down to storage engine and the whole 
partition been retrieved)



but this query:



select * from my_table where partition='a' and clustering1= 1



would be 



select * from my_table where partition='a' and clustering1= 1 LIMIT 100> 
(single cluster key been pushed down to storage engine)





So it seems to me that, we could not restrict multiple clustering keys in 
select because it would retrieve the whole partition ?!

Sent using https://www.zoho.com/mail/

cassandra collection best practices and performance

2020-01-07 Thread onmstester onmstester
Sweet spot for set and list items count (in datastax's documents, the max is 
2billions)?

Write and read performance of Set vs List vs simple partition row?


Thanks in advance

Cluster of small clusters

2019-11-16 Thread onmstester onmstester
Each cassandra node creates 6 seperate threads for incomming and outgoing 
streams to other nodes in the cluster. So with big clusters for example 

100 nodes, it would be more than 600 threads running in each Cassandra app, 
that would cause performance problems, so better have multiple small

clusters for

one application.

Suppose that the application could not be divided to different clusters and DCs 
only limited to RF so they won't help so much in this case. Is there any

workaround for this scenario

to have a cluster of small clusters of cassandra and be able to route requests 
again with partition key'd token amoung multiple clusters automatically?

I wonder is this scenario also a challenge to companies like Apple (which has 
100s thousands of nodes among thousands of clusters)?



This is more of a academic research about big big data!, so any real case 
exprience or suggestion would be appreciated



Thanks in advance

Sent using https://www.zoho.com/mail/

Re: Cassandra.Link Knowledge Base - v. 0.4

2019-07-21 Thread onmstester onmstester
Thank you all!


Sent using https://www.zoho.com/mail/




 On Sat, 20 Jul 2019 16:13:29 +0430 Rahul Singh 
 wrote 


Hey Cassandra community , 

Thanks for all the feedback in the past on my cassandra knowledge base project. 
Without the feedback cycle it’s not really for the community. 



V. 0.1 - Awesome Cassandra  readme.me 

https://anant.github.io/awesome-cassandra 



Hundreds of Cassandra articles tools etc. organized in a table of contents. 
This currently still maintains the official Cassandra.Link redirection. 



V. 0.2 - 600+ Organized links in Wallabag exposed as a web interface 

http://leaves.anant.us/leaves/#!/?tag=cassandra



V. 0.3 - those 600+ Indexed with some Natural language / entity extraction jazz 
to machine taxonomize





V. 0.4 - 10-15 Blog feeds + organized links as a statically generates site. 
(Search not working but will be. )  



https://cassandra.netlify.com/



This last version was made possible by a few of our team members in Washington 
DC and in Delhi : Mohd Danish (Delhi) , Tanaka Mapondera (DC), Rishi Nair 
(DC/VA) and the numerous contributions of the Cassandra community in the form 
of tools, articles, and videos. 



We’d love to get folks to check it out and send critical feedback — directly to 
me or do a pull request on the awesome-cassandra repo (this is just for the 
readme with Cassandra knowledge organized in a Table or Contents. ) 



Best, 





mailto:rahul.xavier.si...@gmail.com http://cassandra.link

Re: How to set up a cluster with allocate_tokens_for_keyspace?

2019-05-05 Thread onmstester onmstester
The problem is that i have defined too many racks in my cluster (because i have 
multiple Cassandra nodes on a single server, so i defined each physical server 
as a separate rack) and because i haven't heard of any rule of "one seed per 
rack" before the tlp article, (actually the only rule about seed node i had in 
my mind was: "3-4 seed nodes in the cluster is enough, more is unnecessary and 
nonperformant"), i set up my clusters with 3-4 seed nodes always.



I already have a cluster set-up with the wrong mechanism (just one seed node 
with initial_token and then just bootsrtapped other nodes one after another), 
and it seems to be working, it's almost balanced and when i unplug a whole 
rack, writes and reads are still working with no error (using CL=ONE). 

So what would be the problem? Is this catastrophic to not to use manual token 
on every seed node of any rack?

I assume that when i define racks, whatever happens, Cassandra never put two 
copies of my data in a single rack? (Right now, its my main concern, because 
i'm OK with my cluster's balanced load)


Sent using https://www.zoho.com/mail/








 On Mon, 06 May 2019 07:17:14 +0430 Anthony Grasso 
 wrote 



Hi



If you are planning on setting up a new cluster with 
allocate_tokens_for_keyspace, then yes, you will need one seed node per rack. 
As Jon mentioned in a previous email, you must manually specify the token range 
for each seed node. This can be done using the initial_token setting.



The article you are referring to 
(https://thelastpickle.com/blog/2019/02/21/set-up-a-cluster-with-even-token-distribution.html)
 includes python code which calculates the token ranges for each of the seed 
nodes. When calling that python code, you must specify the vnodes - number of 
token per node and the number of racks.



Regards,

Anthony







On Sat, 4 May 2019 at 19:14, onmstester onmstester 
<mailto:onmstes...@zoho.com.invalid> wrote:







I just read this article by tlp:

https://thelastpickle.com/blog/2019/02/21/set-up-a-cluster-with-even-token-distribution.html

 

Noticed that:

>>We will need to set the tokens for the seed nodes in each rack manually. This 
>>is to prevent each node from randomly calculating its own token ranges



 But until now, i was using this recommendation to setup a new cluster:

>>

You'll want to set them explicitly using: python -c 'print( [str(((2**64 / 4) * 
i) - 2**63) for i in range(4)])'


After you fire up the first seed, create a keyspace using RF=3 (or whatever 
you're planning on using) and set allocate_tokens_for_keyspace to that keyspace 
in your config, and join the rest of the nodes. That gives even
distribution.

I've defined plenty of racks in my cluster (and only 3 seed nodes), should i 
have a seed node per rack and use initial_token for all of the seed nodes or 
just one seed node with inital_token would be ok?

Best Regards

Fwd: Re: How to set up a cluster with allocate_tokens_for_keyspace?

2019-05-04 Thread onmstester onmstester
So do you mean setting tokens for only one node (one of the seed node) is fair 
enough?

 I can not see any problem with this mechanism (only one manual token 
assignment at cluster set up), but the article was also trying to set up a 
balanced cluster and the way that it insist on doing manual token assignment 
for multiple seed nodes, confused me.



Sent using https://www.zoho.com/mail/






 Forwarded message 

From: Jon Haddad 

To: 

Date: Sat, 04 May 2019 22:10:39 +0430

Subject: Re: How to set up a cluster with allocate_tokens_for_keyspace?

 Forwarded message 




That line is only relevant for when you're starting your cluster and 

you need to define your initial tokens in a non-random way.  Random 

token distribution doesn't work very well when you only use 4 tokens. 

 

Once you get the cluster set up you don't need to specify tokens 

anymore, you can just use allocate_tokens_for_keyspace. 

 

On Sat, May 4, 2019 at 2:14 AM onmstester onmstester 

<mailto:onmstes...@zoho.com.invalid> wrote: 

> 

> I just read this article by tlp: 

> https://thelastpickle.com/blog/2019/02/21/set-up-a-cluster-with-even-token-distribution.html
>  

> 

> Noticed that: 

> >>We will need to set the tokens for the seed nodes in each rack manually. 
> >>This is to prevent each node from randomly calculating its own token ranges 

> 

>  But until now, i was using this recommendation to setup a new cluster: 

> >> 

> 

> You'll want to set them explicitly using: python -c 'print( [str(((2**64 / 4) 
> * i) - 2**63) for i in range(4)])' 

> 

> 

> After you fire up the first seed, create a keyspace using RF=3 (or whatever 
> you're planning on using) and set allocate_tokens_for_keyspace to that 
> keyspace in your config, and join the rest of the nodes. That gives even 

> distribution. 

> 

> I've defined plenty of racks in my cluster (and only 3 seed nodes), should i 
> have a seed node per rack and use initial_token for all of the seed nodes or 
> just one seed node with inital_token would be ok? 

> 

> Best Regards 

> 

> 

 

- 

To unsubscribe, e-mail: mailto:user-unsubscr...@cassandra.apache.org 

For additional commands, e-mail: mailto:user-h...@cassandra.apache.org

How to set up a cluster with allocate_tokens_for_keyspace?

2019-05-04 Thread onmstester onmstester
I just read this article by tlp:

https://thelastpickle.com/blog/2019/02/21/set-up-a-cluster-with-even-token-distribution.html

 

Noticed that:

>>We will need to set the tokens for the seed nodes in each rack manually. This 
>>is to prevent each node from randomly calculating its own token ranges



 But until now, i was using this recommendation to setup a new cluster:

>>

You'll want to set them explicitly using: python -c 'print( [str(((2**64 / 4) * 
i) - 2**63) for i in range(4)])'


After you fire up the first seed, create a keyspace using RF=3 (or whatever 
you're planning on using) and set allocate_tokens_for_keyspace to that keyspace 
in your config, and join the rest of the nodes. That gives even
distribution.

I've defined plenty of racks in my cluster (and only 3 seed nodes), should i 
have a seed node per rack and use initial_token for all of the seed nodes or 
just one seed node with inital_token would be ok?
Best Regards

Re: when the "delete statement" would be deleted?

2019-04-24 Thread onmstester onmstester
Found the answer: it would be deleted after gc_grace

Just decreased the gc_grace, run compact, and the "marked_deleted" partitions 
purged from sstable


Sent using https://www.zoho.com/mail/






 On Wed, 24 Apr 2019 14:15:33 +0430 onmstester onmstester 
 wrote 



Just deleted multiple partitions from one of my tables, dumping sstables shows 
that the data successfully deleted, but the 'marked_deleted' rows for each of 
partitions still exists on sstable and allocates storage. 

Is there any way to get rid of these delete statements storage overhead 
(everything be deleted after final compactions, even the delete statements)?

Sent using https://www.zoho.com/mail/

when the "delete statement" would be deleted?

2019-04-24 Thread onmstester onmstester
Just deleted multiple partitions from one of my tables, dumping sstables shows 
that the data successfully deleted, but the 'marked_deleted' rows for each of 
partitions still exists on sstable and allocates storage. 

Is there any way to get rid of these delete statements storage overhead 
(everything be deleted after final compactions, even the delete statements)?

Sent using https://www.zoho.com/mail/

Re: gc_grace config for time serie database

2019-04-17 Thread onmstester onmstester
I do not use table default ttl (every row has its own TTL) and also no update 
occurs to the rows.

 I suppose that (because of immutable nature of everything in cassandra) 
cassandra would keep only the insertion timestamp + the original ttl and  
computes ttl of a row using these two and current timestamp of the system 
whenever needed (when you select ttl or when the compaction occurs).

So there should be something like this attached to every row: "this row 
inserted at 4/17/2019 12:20 PM  and should be deleted in 2 months", so whatever 
happens to the row replicas, my intention of removing it at 6/17 should not be 
changed!



Would you suggest that my idea of "gc_grace = max_hint = 3 hours" for a time 
serie db is not reasonable?


Sent using https://www.zoho.com/mail/






 On Wed, 17 Apr 2019 17:13:02 +0430 Stefan Miklosovic 
 wrote 



TTL value is decreasing every second and it is set to original TTL 

value back after some update occurs on that row (see example below). 

Does not it logically imply that if a node is down for some time and 

updates are occurring on live nodes and handoffs are saved for three 

hours and after three hours it stops to do them, your data on other 

nodes would not be deleted as TTLS are reset upon every update and 

countdown starts again, which is correct, but they would be deleted on 

that node which was down because it didnt receive updates so if you 

query that node, data will not be there but they should. 

 

On the other hand, a node was down, it was TTLed on healthy nodes and 

tombstone was created, then you start the first one which was down and 

as it counts down you hit that node with update. So there is not a 

tombstone on the previously dead node but there are tombstones on 

healthy ones and if you delete tombstones after 3 hours, previously 

dead node will never get that info and it your data might actually end 

up being resurrected as they would be replicated to always healthy 

nodes as part of the repair. 

 

Do you see some flaw in my reasoning? 

 

cassandra@cqlsh> DESCRIBE TABLE test.test; 

 

CREATE TABLE test.test ( 

 id uuid PRIMARY KEY, 

 value text 

) WITH bloom_filter_fp_chance = 0.6 

 AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'} 

 AND comment = '' 

 AND compaction = {'class': 

'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 

'max_threshold': '32', 'min_threshold': '4'} 

 AND compression = {'chunk_length_in_kb': '64', 'class': 

'org.apache.cassandra.io.compress.LZ4Compressor'} 

 AND crc_check_chance = 1.0 

 AND dclocal_read_repair_chance = 0.1 

 AND default_time_to_live = 60 

 AND gc_grace_seconds = 864000 

 AND max_index_interval = 2048 

 AND memtable_flush_period_in_ms = 0 

 AND min_index_interval = 128 

 AND read_repair_chance = 0.0 

 AND speculative_retry = '99PERCENTILE'; 

 

 

cassandra@cqlsh> select ttl(value) from test.test where id = 

4f860bf0-d793-4408-8330-a809c6cf6375; 

 

 ttl(value) 

 

 25 

 

(1 rows) 

cassandra@cqlsh> UPDATE test.test SET value = 'c' WHERE  id = 

4f860bf0-d793-4408-8330-a809c6cf6375; 

cassandra@cqlsh> select ttl(value) from test.test where id = 

4f860bf0-d793-4408-8330-a809c6cf6375; 

 

 ttl(value) 

 

 59 

 

(1 rows) 

cassandra@cqlsh> select * from test.test  ; 

 

 id   | value 

--+--- 

 4f860bf0-d793-4408-8330-a809c6cf6375 | c 

 

 

On Wed, 17 Apr 2019 at 19:18, fald 1970  wrote: 

> 

> 

> 

> Hi, 

> 

> According to these Facts: 

> 1. If a node is down for longer than max_hint_window_in_ms (3 hours by 
> default), the coordinator stops writing new hints. 

> 2. The main purpose of gc_grace property is to prevent Zombie data and also 
> it determines for how long the coordinator should keep hinted files 

> 

> When we use Cassandra for Time series data which: 

> A) Every row of data has TTL and there would be no explicit delete so not so 
> much worried about zombies 

> B) At every minute there should be hundredrs of write requets to each node, 
> so if one of the node was down for longer than max_hint_window_in_ms, we 
> should run manual repair on that node, so anyway stored hints on the 
> coordinator won't be necessary. 

> 

> So Finally the question, is this a good idea to set gc_grace equal to 
> max_hint_window_in_ms (/1000 to convert to seconds), 

> for example set them both to 3 hours (why should keep the tombstones for 10 
> days when they won't be needed at all)? 

> 

> Best Regards 

> Federica Albertini 

 

- 

To unsubscribe, e-mail: mailto:user-unsubscr...@cassandra.apache.org 

For additional commands, e-mail: mailto:user-h...@cassandra.apache.org

can i delete a sstable with Estimated droppable tombstones > 1, manually?

2019-03-19 Thread onmstester onmstester
Running:
SSTablemetadata /THE_KEYSPACE_DIR/mc-1421-big-Data.db



result was:

Estimated droppable tombstones: 1.2



Having STCS and data disk usage of 80% (do not have enough free space for 
normal compaction), Is it OK to just: 1. stop Cassandra, 2. delete mc-1421* and 
then 3. start Cassandra?
Sent using https://www.zoho.com/mail/

Re: removenode force vs assasinate

2019-03-11 Thread onmstester onmstester
The only option to stream decommissioned node's data is to run "nodetool 
decommission" on the decommissioned node (while cassandra is running on the 
node)

removenode only streams data from node's relpica, so any data that only stored 
on decommissioned node would be lost.

You should monitoring streaming status by "nodetool netstats" on all nodes, to 
see if it has only progress. if it stuck on any node, check that node's logs to 
find out the reason.

Normally i try  removenode 3 times, if none succeeded then i run "nodetool 
assasinate"





Sent using https://www.zoho.com/mail/






 On Mon, 11 Mar 2019 16:02:19 +0330 Ahmed Eljami  
wrote 



Thx onmstester,



I Thnik that remove node dont stream data  : refer to blog of TLP : 
http://thelastpickle.com/blog/2018/09/18/assassinate.html

Will NOT stream any of the decommissioned node’s data to the new replicas. 



Anyway, I have already launched a remove node but it continues to appear DL 
after 72 hours.



So I  want to know if I should execute  one again a removenode with force or go 
directly to assasinate and the diff between.



THx

Re: removenode force vs assasinate

2019-03-11 Thread onmstester onmstester
You should first try with removenode which triggers cluster streaming, if 
removenode failes or stuck, Assassinate is the last solution.



Sent using https://www.zoho.com/mail/






 On Mon, 11 Mar 2019 14:27:13 +0330 Ahmed Eljami  
wrote 



Hello,



Can someone explain me the difference between removenode foce and assasinate in 
a case where a node staying in status DL ?



Thx

forgot to run nodetool cleanup

2019-02-12 Thread onmstester onmstester
Hi,



I should have run cleanup after adding a few nodes to my cluster, about 2 
months ago, the ttl is 6 month, What happens now? Should i worry about any 
catastrophics? 

Should i run the cleanup now?



Thanks in advance


Sent using https://www.zoho.com/mail/

Fwd: Question about allocate_tokens_for_keyspace

2019-01-28 Thread onmstester onmstester
You could only have one keyspace for the value of allocate_tokens_for_keyspace  
to specify a keyspace from which the algorithm can find the replication to 
optimize for. So as far as your keyspaces are using similar replication 
strategies and replication factor you should not worry about this.



for more detail read this doc:

https://www.datastax.com/dev/blog/token-allocation-algorithm



Sent using https://www.zoho.com/mail/






 Forwarded message 

>From : Ahmed Eljami 

To : 

Date : Mon, 28 Jan 2019 12:14:24 +0330

Subject : Question about allocate_tokens_for_keyspace

 Forwarded message 




Hi Folks,



I'm about to configure a new cluster with num_token = 32 and using the new 
token allocation.



For the first keyspace, I understood  that it will be used to start my cluster: 
allocate_tokens_for_keyspace = my_first_ks.



My question is how about the rest of keyspaces ? they will take the same conf 
than the first ks or I have to add them to the cassandra.yaml and restart each 
time my cluster? 







Thanks

slow commitlog sync

2018-12-23 Thread onmstester onmstester
Hi, I'm seeing a lot of logs like this in all of my nodes (every 5 minutes): 
WARN  [PERIODIC-COMMIT-LOG-SYNCER] 2018-05-23 08:59:19,075 NoSpamLogger.java:94 
- Out of 50 commit log syncs over the past 300s with average duration of 
300.00ms, 30 have exceeded the configured commit interval by an average of 
400.00ms Should i worry about it? if not, which parameter to tune? Using C* 
3.11.2 and separate disk for commitlog (7200 rpm) Best Regards Sent using Zoho 
Mail

Fwd: Cassandra does launch since computer was accidentally unplugged

2018-12-08 Thread onmstester onmstester
Delete the file: C:\Program 
Files\DataStax-DDC\data\commitlog\CommitLog-6-1542650688953.log and restart 
Cassandra. Its possible that you lose a bit of data that just existed on this 
log (not matter if you have replica or could re-insert data again) Sent using 
Zoho Mail  Forwarded message  From : Will Mackle 
 To :  Date : Sat, 08 Dec 
2018 11:56:00 +0330 Subject : Cassandra does launch since computer was 
accidentally unplugged  Forwarded message  Hello, I am 
a novice cassandra user and am looking for some insight with respect to my 
circumstance:  The computer I was using to run cassandra was accidentally 
unplugged by my friend, since this event, I have not been able to successfully 
relaunch cassandra.   I have included a chunk from the log file below.  It 
looks to me like the corrupt log files are the issue, but I would like to 
confirm that that error is not dependent on the earlier JMX error.  Does this 
JMX error impact cassandra's launch if cassandra is only being accessed by the 
computer that cassandra is running on?  I have the port assinged in 
cassandra-env.sh, so it is really confusing to me why this error occurs. With 
respect to the log file corruption, does there exist the capacity to 
recover/repair the issue?  I'm assuming that if I delete the log file to launch 
cassandra that I will lose data.. am I correct in this assumption? I left some 
lines out of the log file that were not errors or warnings, if it is important 
for me to include them, I can do so, I'm simply not sure if any info from the 
log file is a security risk for me to share. INFO  14:50:01 JVM Arguments: 
[-ea, -javaagent:C:\Program 
Files\DataStax-DDC\apache-cassandra\lib\jamm-0.3.0.jar, -Xms1G, -Xmx1G, 
-XX:+HeapDumpOnOutOfMemoryError, -XX:+UseParNewGC, -XX:+UseConcMarkSweepGC, 
-XX:+CMSParallelRemarkEnabled, -XX:SurvivorRatio=8, -XX:MaxTenuringThreshold=1, 
-XX:CMSInitiatingOccupancyFraction=75, -XX:+UseCMSInitiatingOccupancyOnly, 
-Dcom.sun.management.jmxremote.port=7199, 
-Dcom.sun.management.jmxremote.ssl=false, 
-Dcom.sun.management.jmxremote.authenticate=false, 
-Dlog4j.configuration=log4j-server.properties, 
-Dlog4j.defaultInitOverride=true, -DCassandra] WARN  14:50:01 JNA link failure, 
one or more native method will be unavailable. WARN  14:50:01 JMX is not 
enabled to receive remote connections. Please see cassandra-env.sh for more 
info. ERROR 14:50:01 cassandra.jmx.local.port missing from cassandra-env.sh, 
unable to start local JMX service. WARN  14:50:01 Use of 
com.sun.management.jmxremote.port at startup is deprecated. Please use 
cassandra.jmx.remote.port instead. --- INFO  14:50:05 Not submitting build 
tasks for views in keyspace system as storage service is not initialized WARN  
14:50:05 JMX settings in cassandra-env.sh have been bypassed as the JMX 
connector server is already initialized. Please refer to cassandra-env.(sh|ps1) 
for JMX configuration info INFO  14:50:07 Populating token metadata from system 
tables --- INFO  14:50:08 Completed loading (15 ms; 26 keys) KeyCache cache 
INFO  14:50:08 Replaying C:\Program 
Files\DataStax-DDC\data\commitlog\CommitLog-6-1542650688952.log, C:\Program 
Files\DataStax-DDC\data\commitlog\CommitLog-6-1542650688953.log, C:\Program 
Files\DataStax-DDC\data\commitlog\CommitLog-6-1542987010987.log, C:\Program 
Files\DataStax-DDC\data\commitlog\CommitLog-6-1542987613467.log, C:\Program 
Files\DataStax-DDC\data\commitlog\CommitLog-6-1542990216101.log ERROR 14:50:09 
Exiting due to error while processing commit log during initialization. 
org.apache.cassandra.db.commitlog.CommitLogReadHandler$CommitLogReadException: 
Could not read commit log descriptor in file C:\Program 
Files\DataStax-DDC\data\commitlog\CommitLog-6-1542650688953.log at 
org.apache.cassandra.db.commitlog.CommitLogReader.readCommitLogSegment(CommitLogReader.java:155)
 [apache-cassandra-3.9.0.jar:3.9.0] at 
org.apache.cassandra.db.commitlog.CommitLogReader.readAllFiles(CommitLogReader.java:85)
 [apache-cassandra-3.9.0.jar:3.9.0] at 
org.apache.cassandra.db.commitlog.CommitLogReplayer.replayFiles(CommitLogReplayer.java:135)
 [apache-cassandra-3.9.0.jar:3.9.0] at 
org.apache.cassandra.db.commitlog.CommitLog.recoverFiles(CommitLog.java:187) 
[apache-cassandra-3.9.0.jar:3.9.0] at 
org.apache.cassandra.db.commitlog.CommitLog.recoverSegmentsOnDisk(CommitLog.java:167)
 [apache-cassandra-3.9.0.jar:3.9.0] at 
org.apache.cassandra.service.CassandraDaemon.setup(CassandraDaemon.java:323) 
[apache-cassandra-3.9.0.jar:3.9.0] at 
org.apache.cassandra.service.CassandraDaemon.activate(CassandraDaemon.java:601) 
[apache-cassandra-3.9.0.jar:3.9.0] at 
org.apache.cassandra.service.CassandraDaemon.main(CassandraDaemon.java:730) 
[apache-cassandra-3.9.0.jar:3.9.0] Any help/insight is much appreciated, Thanks

Fwd: Re: How to gracefully decommission a highly loaded node?

2018-12-06 Thread onmstester onmstester
After few hours, i just removed the node. done another node decommissioned, 
which finished successfully (the writer app was down, so no pressure on the 
cluster)  Started another node decommission (third), Since didn't have time to 
wait for decommissioning to finish, i started the writer Application, when 
almost most of decommissioning-node's streaming was done and only a few GBs to 
two other nodes remained to be streamed. After 12 Hours i checked the 
decommissioning node  and netstats says: LEAVING, Restore Replica Count! So 
just ran removednode on this one too. Is there something wrong with 
decommissioning while someones writing to Cluster? Using Apache Cassandra 
3.11.2 Sent using Zoho Mail  Forwarded message  From : 
onmstester onmstester  To : 
"user" Date : Wed, 05 Dec 2018 09:00:34 +0330 
Subject : Fwd: Re: How to gracefully decommission a highly loaded node? 
 Forwarded message  After a long time stuck in LEAVING, 
and "not doing any streams", i killed Cassandra process and restart it, then 
again ran nodetool decommission (Datastax recipe for stuck decommission), now 
it says, LEAVING, "unbootstrap $(the node id)" What's going on? Should i forget 
about decommission and just remove the node? There is an issue to make 
decommission resumable: https://issues.apache.org/jira/browse/CASSANDRA-12008 
but i couldn't figure out how this suppose to work? I was expecting that after 
restarting stucked-decommission-cassandra, it resume the decommissioning 
process, but the node became UN after restart. Sent using Zoho Mail 
 Forwarded message  From : Simon Fontana Oscarsson 
 To : 
"user@cassandra.apache.org" Date : Tue, 04 Dec 2018 
15:20:15 +0330 Subject : Re: How to gracefully decommission a highly loaded 
node?  Forwarded message  
- To 
unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional 
commands, e-mail: user-h...@cassandra.apache.org Hi, If it already uses 100 % 
CPU I have a hard time seeing it being able to do a decomission while serving 
requests. If you have a lot of free space I would first try nodetool 
disableautocompaction. If you don't see any progress in nodetool netstats you 
can also disablebinary, disablethrift and disablehandoff to stop serving client 
requests.  -- SIMON FONTANA OSCARSSON
Software Developer

Ericsson
Ölandsgatan 1
37133 Karlskrona, Sweden
simon.fontana.oscars...@ericsson.com
www.ericsson.com On tis, 2018-12-04 at 14:21 +0330, onmstester onmstester 
wrote: One node suddenly uses 100% CPU, i suspect hardware problems and do not 
have time to trace that, so decided to just remove the node from the cluster, 
but although the node state changed to UL, but no sign of Leaving: the node is 
still compacting and flushing memtables, writing mutations and CPU is 100% for 
hours since. Is there any means to force a Cassandra Node to just decommission 
and stop doing normal things? Due to W.CL=ONE, i can not use removenode and 
shutdown the node Best Regards Sent using Zoho Mail

smime.p7s
Description: Binary data

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

Fwd: Re: How to gracefully decommission a highly loaded node?

2018-12-04 Thread onmstester onmstester
After a long time stuck in LEAVING, and "not doing any streams", i killed 
Cassandra process and restart it, then again ran nodetool decommission 
(Datastax recipe for stuck decommission), now it says, LEAVING, "unbootstrap 
$(the node id)" What's going on? Should i forget about decommission and just 
remove the node? There is an issue to make decommission resumable: 
https://issues.apache.org/jira/browse/CASSANDRA-12008 but i couldn't figure out 
how this suppose to work? I was expecting that after restarting 
stucked-decommission-cassandra, it resume the decommissioning process, but the 
node became UN after restart. Sent using Zoho Mail  Forwarded 
message  From : Simon Fontana Oscarsson 
 To : 
"user@cassandra.apache.org" Date : Tue, 04 Dec 2018 
15:20:15 +0330 Subject : Re: How to gracefully decommission a highly loaded 
node?  Forwarded message  Hi, If it already uses 100 % 
CPU I have a hard time seeing it being able to do a decomission while serving 
requests. If you have a lot of free space I would first try nodetool 
disableautocompaction. If you don't see any progress in nodetool netstats you 
can also disablebinary, disablethrift and disablehandoff to stop serving client 
requests.  -- SIMON FONTANA OSCARSSON
Software Developer

Ericsson
Ölandsgatan 1
37133 Karlskrona, Sweden
simon.fontana.oscars...@ericsson.com
www.ericsson.com On tis, 2018-12-04 at 14:21 +0330, onmstester onmstester 
wrote: One node suddenly uses 100% CPU, i suspect hardware problems and do not 
have time to trace that, so decided to just remove the node from the cluster, 
but although the node state changed to UL, but no sign of Leaving: the node is 
still compacting and flushing memtables, writing mutations and CPU is 100% for 
hours since. Is there any means to force a Cassandra Node to just decommission 
and stop doing normal things? Due to W.CL=ONE, i can not use removenode and 
shutdown the node Best Regards Sent using Zoho Mail

smime.p7s
Description: Binary data

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org

How to gracefully decommission a highly loaded node?

2018-12-04 Thread onmstester onmstester
One node suddenly uses 100% CPU, i suspect hardware problems and do not have 
time to trace that, so decided to just remove the node from the cluster, but 
although the node state changed to UL, but no sign of Leaving: the node is 
still compacting and flushing memtables, writing mutations and CPU is 100% for 
hours since. Is there any means to force a Cassandra Node to just decommission 
and stop doing normal things? Due to W.CL=ONE, i can not use removenode and 
shutdown the node Best Regards Sent using Zoho Mail

Fwd: RE : issue while connecting to apache-cassandra-3.11.1 hosted on a remote VM.

2018-11-16 Thread onmstester onmstester
Also set rpc_address to your remote ip address and restart cassandra. Run 
nodetool status on Cassandra node to be sure that its running properly. The 
port you should look for and connect to is 9042, 7199 is the JMX port Sent 
using Zoho Mail  Forwarded message  From : Gaurav Kumar 
 To : 
"d...@cassandra.apache.org" Date : Fri, 16 Nov 2018 
13:13:56 +0330 Subject : RE : issue while connecting to apache-cassandra-3.11.1 
hosted on a remote VM.  Forwarded message  Hi, Whenever 
I am trying to connect to apache-cassandra-3.11.1, I am getting exception 
Unexpected client failure - null The detailed explanation : 
JMXConnectionPool.getJMXConnection - Failed to retrieve RMIServer stub: 
javax.naming.ServiceUnavailableException [Root exception is 
java.rmi.ConnectException: Connection refused to host:; nested 
exception is: java.net.ConnectException: Connection refused (Connection 
refused)] I tried following workaround 1) Adding IP Address of the machine 
(where server is installed) in /etc/hosts (For Linux OS) 2) Adding IP Address 
of the machine (where server is installed) in cassandra.yaml (entry as seed and 
listen addresses) 3)Also check for proper variables are set for java and 
Cassandra. However, while executing the "netstat -an | grep 7199" command ,I am 
still getting the 127.0.0.1 as the hosted ip. Can you suggest me any change 
which needs to be done in the configuration or the connection mechanism of 
apache-cassandra-3.11.1 ,because my application is working fine for 
apache-cassandra-3.0.15. ? Kindly, revert ASAP. Thanks and Regards, Gaurav 
Kumar- Software Engineer

Fwd: Re: Multiple cluster for a single application

2018-11-08 Thread onmstester onmstester
Thank you all, Actually, "the documents" i mentioned in my question, was a talk 
in youtube seen long time ago and could not find it. Also noticing that a lot 
of companies like Netflix built hundreds of Clusters each having 10s of nodes 
and saying that its much stable, i just concluded that big cluster is not 
recommended. I see some of the reasons in your answers: the problem with 
dynamic snitch and probability of node failures that simply increases with more 
nodes in cluster which even could cause cluster outage.

Multiple cluster for a single application

2018-11-05 Thread onmstester onmstester
Hi, One of my applications requires to create a cluster with more than 100 
nodes, I've read documents recommended to use clusters with less than 50 or 100 
nodes (Netflix got hundreds of clusters with less 100 nodes on each). Is it a 
good idea to use multiple clusters for a single application, just to decrease 
maintenance problems and system complexity/performance? If So, which one of 
below policies is more suitable to distribute data among clusters and Why? 1. 
each cluster' would be responsible for a specific partial set of tables only 
(table sizes are almost equal so easy calculations here) for example inserts to 
table X would go to cluster Y 2. shard data at loader level by some business 
logic grouping of data, for example all rows with some column starting with X 
would go to cluster Y I would appreciate sharing your experiences working with 
big clusters, problem encountered and solutions. Thanks in Advance Sent using 
Zoho Mail

Fwd: Re: A quick question on unlogged batch

2018-11-02 Thread onmstester onmstester
unlogged batch meaningfully outperforms parallel execution of individual 
statements, especially at scale, and creates lower memory pressure on both the 
clients and cluster.  They do outperform parallel individuals, but in cost of 
higher pressure on coordinators which leads to more blocked Natives and dropped 
mutations, Actually i think that 10-20% better write performance + 20-30% less 
CPU usage on client machines (we don't care about client machines in compare 
with cluster machines) which is outcome of batch statements with multiple 
partitions on each batch, does not worth it, because less-busy cluster nodes 
are needed to answer read queries, compactions, repairs, etc The biggest major 
downside to unlogged batches are that the unit of retry during failure is the 
entire batch.  So if you use a retry policy, write timeouts will tip over your 
cluster a lot faster than individual statements.  Bounding your batch sizes 
helps mitigate this risk.   I assume that in most scenarios, the client 
machines are in the same network with Cassandra cluster, so is it still faster? 
Thank you all. Now I understand whether to use batch or asynchronous writes 
really depends on use case. Till now batch writes work for me in a 8 nodes 
cluster with over 500 million requests per day. Did you compare the cluster 
performance including blocked natives, dropped mutations, 95 percentiles, 
cluster CPU usage, etc  in two scenarios (batch vs single)? Although 500M per 
day is not so much for 8 nodes cluster (if the node spec is compliant with 
datastax recommendations) and async single statements could handle it (just 
demands high CPU on client machine), the impact of such things (non compliant 
batch statements annoying the cluster) would show up after some weeks, when 
suddenly a lot of cluster tasks need to be run simultaneously; one or two big 
compactions are running on most of the nodes, some hinted hand offs and cluster 
could not keep up and starts to became slower and slower. The way to prevent it 
sooner, would be keep the error counters as low as possible, things like 
blocked NTPs, dropped, errors, hinted hinted hand-offs, latencies, etc.

Fwd: Re: Re: How to set num tokens on live node

2018-11-02 Thread onmstester onmstester
I think that is not possible. If currently both DC's are in use, you should 
remove one of them (gently, by changing replication config), then change 
num_tokens in removed dc, add it again with changing replication config, and 
finally do the same for the other dc. P.S A while ago, there was a thread in 
this forum, discussing that num_tokens 256 is not a good default in Cassandra 
and should use a smaller number like 4,8 or 16, i recommend you to read it 
through, maybe the whole migration (from 8 to 256) became unnecessary Sent 
using Zoho Mail  Forwarded message  From : Goutham 
reddy  To :  Date : Fri, 
02 Nov 2018 11:52:53 +0330 Subject : Re: Re: How to set num tokens on live node 
 Forwarded message  Onmstester, Thanks for the reply, 
but for both the DC’s I need to change my num_token value from 8 to 256. So 
that is the challenge I am facing. Any comments. Thanks and Regards, Goutham On 
Fri, Nov 2, 2018 at 1:08 AM onmstester onmstester  
wrote: -- Regards Goutham Reddy IMHO, the best option with two datacenters is 
to config replication strategy to stream data from dc with wrong num_token to 
correct one, and then a repair on each node would move your data to the other 
dc Sent using Zoho Mail  Forwarded message  From : 
Goutham reddy  To :  
Date : Fri, 02 Nov 2018 10:46:10 +0330 Subject : Re: How to set num tokens on 
live node  Forwarded message  Elliott,  Thanks Elliott, 
how about if we have two Datacenters, any comments? Thanks and Regards, 
Goutham. On Thu, Nov 1, 2018 at 5:40 PM Elliott Sims  
wrote: -- Regards Goutham Reddy As far as I know, it's not possible to change 
it live.  You have to create a new "datacenter" with new hosts using the new 
num_tokens value, then switch everything to use the new DC and tear down the 
old. On Thu, Nov 1, 2018 at 6:16 PM Goutham reddy  
wrote: Hi team, Can someone help me out I don’t find anywhere how to change the 
numtokens on a running nodes. Any help is appreciated Thanks and Regards, 
Goutham. -- Regards Goutham Reddy

Fwd: Re: How to set num tokens on live node

2018-11-02 Thread onmstester onmstester
IMHO, the best option with two datacenters is to config replication strategy to 
stream data from dc with wrong num_token to correct one, and then a repair on 
each node would move your data to the other dc Sent using Zoho Mail 
 Forwarded message  From : Goutham reddy 
 To :  Date : Fri, 02 
Nov 2018 10:46:10 +0330 Subject : Re: How to set num tokens on live node 
 Forwarded message  Elliott,  Thanks Elliott, how about 
if we have two Datacenters, any comments? Thanks and Regards, Goutham. On Thu, 
Nov 1, 2018 at 5:40 PM Elliott Sims  wrote: -- Regards 
Goutham Reddy As far as I know, it's not possible to change it live.  You have 
to create a new "datacenter" with new hosts using the new num_tokens value, 
then switch everything to use the new DC and tear down the old. On Thu, Nov 1, 
2018 at 6:16 PM Goutham reddy  wrote: Hi team, Can 
someone help me out I don’t find anywhere how to change the numtokens on a 
running nodes. Any help is appreciated Thanks and Regards, Goutham. -- Regards 
Goutham Reddy

Fwd: A quick question on unlogged batch

2018-11-01 Thread onmstester onmstester
Read this: https://docs.datastax.com/en/cql/3.3/cql/cql_reference/batch_r.html 
Please use batch (any type of batch) for statements that only concerns a single 
partition, otherwise it cause a lot of performance degradation on your cluster 
and after a while throughput would be alot less than parallel single statements 
with executeAsync. Sent using Zoho Mail  Forwarded message 
 From : wxn...@zjqunshuo.com To : "user" 
Date : Thu, 01 Nov 2018 10:48:33 +0330 Subject : A quick question on unlogged 
batch  Forwarded message  Hi All, What's the difference 
between logged batch and unlogged batch? I'm asking this question it's because 
I'm seeing the below WARNINGs after a new app started writting to the cluster.  
WARNING in system.log: Unlogged batch covering 135 partitions detected against 
table [cargts.eventdata]. You should use a logged batch for atomicity, or 
asynchronous writes for performance Best regards, -Simon

Fwd: Re: Re: High CPU usage on some of the nodes due to message coalesce

2018-10-21 Thread onmstester onmstester
Any cron or other scheduler running on those nodes? no Lots of Java processes 
running simultaneously? no, just Apache Cassandra Heavy repair continuously 
running? none Lots of pending compactions? none, the cpu goes to 100% on first 
seconds of insert (write load) so no memtable flushed yet,  Is the number of 
CPU cores the same in all the nodes? yes, 12 Did you try rebooting one of the 
nodes? Yes, cold rebooted all of them once, no luck! Thanks for your time

Re: Re: High CPU usage on some of the nodes due to message coalesce

2018-10-21 Thread onmstester onmstester
What takes the most CPU? System or User?  most of it is used by 
org.apache.cassandra.util.coalesceInternal and SepWorker.run Did you try 
removing a problematic node and installing a brand new one (instead of 
re-adding)? I did not install a new node, but did remove the problematic node 
and CPU load in all the cluster became normal again When you decommissioned 
these nodes, did the high CPU "move" to other nodes (probably data model/query 
issues) or was it completely gone? (server issues) it was completely gone

Fwd: Re: High CPU usage on some of the nodes due to message coalesce

2018-10-21 Thread onmstester onmstester
I don't think that root cause is related to Cassandra config, because the nodes 
are homogeneous and config for all of them are the same (16GB heap with default 
gc), also mutation counter and Native Transport counter is the same in all of 
the nodes, but only these 3 nodes experiencing 100% CPU usage (others have less 
than 20% CPU usage)  I even decommissioned these 3 nodes from cluster and 
re-add them, but still the same The cluster is OK without these 3 nodes (in a 
state that these nodes are decommissioned) Sent using Zoho Mail  
Forwarded message  From : Chris Lohfink  To : 
 Date : Sat, 20 Oct 2018 23:24:03 +0330 Subject : 
Re: High CPU usage on some of the nodes due to message coalesce  
Forwarded message  1s young gcs are horrible and likely cause of 
some of your bad metrics. How large are your mutations/query results and what 
gc/heap settings are you using? You can use 
https://github.com/aragozin/jvm-tools to see the threads generating allocation 
pressure and using the cpu (ttop) and what garbage is being created (hh 
--dead-young). Just a shot in the dark, I would guess you have rather large 
mutations putting pressure on commitlog and heap. G1 with a larger heap might 
help in that scenario to reduce fragmentation and adjust its eden and survivor 
regions to the allocation rate better (but give it a bigger reserve space) but 
theres limits to what can help if you cant change your workload. Without more 
info on schema etc its hard to tell but maybe that can help give you some ideas 
on places to look. It could just as likely be repair coordination, wide 
partition reads, or compactions so need to look more at what within the app is 
causing the pressure to know if its possible to improve with settings or if the 
load your application is producing exceeds what your cluster can handle (needs 
more nodes). Chris On Oct 20, 2018, at 5:18 AM, onmstester onmstester 
 wrote: 3 nodes in my cluster have 100% cpu usage 
and most of it is used by org.apache.cassandra.util.coalesceInternal and 
SepWorker.run? The most active threads are the messaging-service-incomming. 
Other nodes are normal, having 30 nodes, using Rack Aware strategy. with 10 
rack each having 3 nodes. The problematic nodes are configured for one rack, on 
normal write load, system.log reports too many hint message dropped (cross 
node). also there are alot of parNewGc with about 700-1000ms and commit log 
isolated disk, is utilized about 80-90%. on startup of these 3 nodes, there are 
alot of "updateing topology" logs (1000s of them pending). Using iperf, i'm 
sure that network is OK checking NTPs and mutations on each node, load is 
balanced among the nodes. using apache cassandra 3.11.2 I can not not figure 
out the root cause of the problem, although there are some obvious symptoms. 
Best Regards Sent using Zoho Mail

How to validate if network infrastructure is efficient for Cassandra cluster?

2018-10-21 Thread onmstester onmstester
Currently, before launching the production cluster, i run 'iperf -s' on half of 
the cluster and then run 'iperf -c $nextIP' on the other half using parallel 
ssh, So simultaneously all cluster's nodes are connecting together (paired) and 
then examining the result of iperfs, doing the math that if the Switches could 
keep up with Cassandra load or not? I'm afraid that i do not determine usual 
packet size of Cassandra and in real scenarios each node is streaming with many 
other nodes Any better idea on examining network before running a cluster? Sent 
using Zoho Mail

High CPU usage on some of the nodes due to message coalesce

2018-10-20 Thread onmstester onmstester
3 nodes in my cluster have 100% cpu usage and most of it is used by 
org.apache.cassandra.util.coalesceInternal and SepWorker.run? The most active 
threads are the messaging-service-incomming. Other nodes are normal, having 30 
nodes, using Rack Aware strategy. with 10 rack each having 3 nodes. The 
problematic nodes are configured for one rack, on normal write load, system.log 
reports too many hint message dropped (cross node). also there are alot of 
parNewGc with about 700-1000ms and commit log isolated disk, is utilized about 
80-90%. on startup of these 3 nodes, there are alot of "updateing topology" 
logs (1000s of them pending). Using iperf, i'm sure that network is OK checking 
NTPs and mutations on each node, load is balanced among the nodes. using apache 
cassandra 3.11.2 I can not not figure out the root cause of the problem, 
although there are some obvious symptoms. Best Regards Sent using Zoho Mail

Re: Re: Re: how to configure the Token Allocation Algorithm

2018-10-02 Thread onmstester onmstester
Sent using Zoho Mail  On Mon, 01 Oct 2018 18:36:03 +0330 Alain RODRIGUEZ 
 wrote  Hello again :), I thought a little bit more 
about this question, and I was actually wondering if something like this would 
work: Imagine 3 node cluster, and create them using: For the 3 nodes: 
`num_token: 4` Node 1: `intial_token: -9223372036854775808, 
-4611686018427387905, -2, 4611686018427387901` Node 2: `intial_token: 
-7686143364045646507, -3074457345618258604, 1537228672809129299, 
6148914691236517202` Node 3: `intial_token: -6148914691236517206, 
-1537228672809129303, 3074457345618258600, 7686143364045646503`  If you know 
the initial size of your cluster, you can calculate the total number of tokens: 
number of nodes * vnodes and use the formula/python code above to get the 
tokens. Then use the first token for the first node, move to the second node, 
use the second token and repeat. In my case there is a total of 12 tokens (3 
nodes, 4 tokens each) ``` >>> number_of_tokens = 12 >>> [str(((2**64 / 
number_of_tokens) * i) - 2**63) for i in range(number_of_tokens)] 
['-9223372036854775808', '-7686143364045646507', '-6148914691236517206', 
'-4611686018427387905', '-3074457345618258604', '-1537228672809129303', '-2', 
'1537228672809129299', '3074457345618258600', '4611686018427387901', 
'6148914691236517202', '7686143364045646503'] ``` Using manual initial_token 
(your idea), how could i add a new node to a long running cluster (the 
procedure)?

Fwd: Re: Re: how to configure the Token Allocation Algorithm

2018-10-01 Thread onmstester onmstester
Thanks Alex, You are right, that would be a mistake. Sent using Zoho Mail 
 Forwarded message  From : Oleksandr Shulgin 
 To : "User" Date : 
Mon, 01 Oct 2018 13:53:37 +0330 Subject : Re: Re: how to configure the Token 
Allocation Algorithm  Forwarded message  On Mon, Oct 1, 
2018 at 12:18 PM onmstester onmstester  wrote: What if 
instead of running that python and having one node with non-vnode config, i 
remove the first seed node and re-add it after cluster was fully up ? so the 
token ranges of first seed node would also be assigned by Allocation Alg I 
think this is tricky because the random allocation of the very first tokens 
from the first seed affects the choice of tokens made by the algorithm on the 
rest of the nodes: it basically tries to divide the token ranges in more or 
less equal parts.  If your very first 8 tokens resulted in really bad balance, 
you are not going to remove that imbalance by removing the node, it would still 
have the lasting effect on the rest of your cluster. -- Alex

Fwd: Re: how to configure the Token Allocation Algorithm

2018-10-01 Thread onmstester onmstester
Thanks Alain, What if instead of running that python and having one node with 
non-vnode config, i remove the first seed node and re-add it after cluster was 
fully up ? so the token ranges of first seed node would also be assigned by 
Allocation Alg  Forwarded message  From : Alain 
RODRIGUEZ  To : "user 
cassandra.apache.org" Date : Mon, 01 Oct 2018 
13:14:21 +0330 Subject : Re: how to configure the Token Allocation Algorithm 
 Forwarded message  Hello, Your process looks good to 
me :). Still a couple of comments to make it more efficient (hopefully). - 
Improving step 2: I believe you can actually get a slightly better distribution 
picking the tokens for the (first) seed node. This is to prevent the node from 
randomly calculating its token ranges. You can calculate the token ranges using 
the following python code:  $ python # Start the python shell [...] >>> 
number_of_tokens = 8 >>> [str(((2**64 / number_of_tokens) * i) - 2**63) for i 
in range(number_of_tokens)] ['-9223372036854775808', '-6917529027641081856', 
'-4611686018427387904', '-2305843009213693952', '0', '2305843009213693952', 
'4611686018427387904', '6917529027641081856'] Set the 'initial_token' with the 
above list (coma separated list) and the number of vnodes to 'num_tokens: 8'. 
This technique proved to be way more efficient (especially for low token 
numbers / small number of nodes). Luckily it's also easy to test.

how to configure the Token Allocation Algorithm

2018-09-30 Thread onmstester onmstester
Since i failed to find a document on how to configure and use the Token 
Allocation Algorithm (to replace the random Algorithm), just wanted to be sure 
about the procedure i've done: 1. Using Apache Cassandra 3.11.2 2. Configured 
one of seed nodes with num_tokens=8 and started it. 3. Using Cqlsh created 
keyspace test with NetworkTopologyStrategy and RF=3. 4. Stopped the seed node. 
5. add this line to cassandra.yaml of all nodes (all have num_tokens=8) and 
started the cluster: allocate_tokens_for_keyspace=test My cluster Size won't go 
beyond 150 nodes, should i still use The Allocation Algorithm instead of random 
with 256 tokens (performance wise or load-balance wise)? Is the Allocation 
Algorithm, widely used and tested with Community and can we migrate all 
clusters with any size to use this Algorithm Safely? Out of Curiosity, i wonder 
how people (i.e, in Apple) config and maintain token management of clusters 
with thousands of nodes? Sent using Zoho Mail

High CPU usage on writer application

2018-09-24 Thread onmstester onmstester
Hi,  My app writes 100K rows per seconds to a C* cluster (including 30 nodes 
and using version 3.11.2). There are 20 threads, each writing 10K (list size in 
below code is 100K) statements using async API: for (Statement s:list) { 
ResultSetFuture future = session.executeAsync(s); tasks.add(future); if 
(tasks.size() < 1) continue; for (ResultSetFuture t:tasks)  
   t.getUninterruptibly(1, TimeUnit.MILLISECONDS); tasks.clear(); } if 
(tasks.size() != 0) { for (ResultSetFuture t:tasks) 
t.getUninterruptibly(1, TimeUnit.MILLISECONDS); } CPU usage for my loader 
application is > 80% on a Xeon 20 core, using sample on jvisualvm find out 
these at top by percentage of all CPU time: 
io.netty.channel.epoll.Native.epollWait0 40% 
shade.com.datastax.spark.connecto.google.common.util.concurrent.AbstractFuture$Sync.get()
 10% com.datastax.driver.core.RequestHanlder.init 10% It seems like that, it 
checks for finishing all tasks every some nano seconds. Is there any workaround 
to decrease CPU usage of my application, which currently is the bottleneck? 
Sent using Zoho Mail

Re: node replacement failed

2018-09-22 Thread onmstester onmstester
Another question, Is there a management tool to do nodetool cleanup one by one 
(wait until finish of cleaning up one node then start clean up for the next 
node in cluster)?  On Sat, 22 Sep 2018 16:02:17 +0330 onmstester onmstester 
 wrote  I have a cunning plan (Baldrick wise) to solve 
this problem: stop client application run nodetool flush on all nodes to save 
memtables to disk stop cassandra on all of the nodes rename original Cassandra 
data directory to data-old start cassandra on all the nodes to create a fresh 
cluster including the old dead nodes again create the application related 
keyspaces in cqlsh and this time set rf=2 on system keyspaces (to never 
encounter this problem again!) move sstables from data-backup dir to current 
data dirs and restart cassandra or reload sstables Should this work and solve 
my problem?  On Mon, 10 Sep 2018 17:12:48 +0430 onmstester onmstester 
 wrote  Thanks Alain, First here it is more detail 
about my cluster: 10 racks + 3 nodes on each rack nodetool status: shows 27 
nodes UN and 3 nodes all related to single rack as DN version 3.11.2 Option 1: 
(Change schema and) use replace method (preferred method) * Did you try to have 
the replace going, without any former repairs, ignoring the fact 
'system_traces' might be inconsistent? You probably don't care about this 
table, so if Cassandra allows it with some of the nodes down, going this way is 
relatively safe probably. I really do not see what you could lose that matters 
in this table. * Another option, if the schema first change was accepted, is to 
make the second one, to drop this table. You can always rebuild it in case you 
need it I assume. I really love to let the replace going, but it stops with the 
error: java.lang.IllegalStateException: unable to find sufficient sources for 
streaming range in keyspace system_traces Also i could delete system_traces 
which is empty anyway, but there is a system_auth and system_distributed 
keyspace too and they are not empty, Could i delete them safely too? If i could 
just somehow skip streaming the system keyspaces from node replace phase, the 
option 1 would be great. P.S: Its clear to me that i should use at least RF=3 
in production, but could not manage to acquire enough resources yet (i hope 
would be fixed in recent future) Again Thank you for your time Sent using Zoho 
Mail  On Mon, 10 Sep 2018 16:20:10 +0430 Alain RODRIGUEZ 
 wrote  Hello, I am sorry it took us (the community) 
more than a day to answer to this rather critical situation. That being said, 
my recommendation at this point would be for you to make sure about the impacts 
of whatever you would try. Working on a broken cluster, as an emergency might 
lead you to a second mistake, possibly more destructive than the first one. It 
happened to me and around, for many clusters. Move forward even more carefuly 
in these situations as a global advice. Suddenly i lost all disks of 
cassandar-data on one of my racks With RF=2, I guess operations use LOCAL_ONE 
consistency, thus you should have all the data in the safe rack(s) with your 
configuration, you probably did not lose anything yet and have the service only 
using the nodes up, that got the right data.  tried to replace the nodes with 
same ip using this: 
https://blog.alteroot.org/articles/2014-03-12/replace-a-dead-node-in-cassandra.html
 As a side note, I would recommend you to use 'replace_address_first_boot' 
instead of 'replace_address'. This does basically the same but will be ignored 
after the first bootstrap. A detail, but hey, it's there and somewhat safer, I 
would use this one. java.lang.IllegalStateException: unable to find sufficient 
sources for streaming range in keyspace system_traces By default, non-user 
keyspace use 'SimpleStrategy' and a small RF. Ideally, this should be changed 
in a production cluster, and you're having an example of why. Now when i 
altered the system_traces keyspace startegy to NetworkTopologyStrategy and RF=2 
but then running nodetool repair failed: Endpoint not alive /IP of dead node 
that i'm trying to replace. Changing the replication strategy you made the dead 
rack owner of part of the token ranges, thus repairs just can't work as there 
will always be one of the nodes involved down as the whole rack is down. Repair 
won't work, but you probably do not need it! 'system_traces' is a temporary / 
debug table. It's probably empty or with irrelevant data. Here are some 
thoughts: * It would be awesome at this point for us (and for you if you did 
not) to see the status of the cluster: ** 'nodetool status' ** 'nodetool 
describecluster' --> This one will tell if the nodes agree on the schema (nodes 
up). I have seen schema changes with nodes down inducing some issues. ** 
Cassandra version ** Number of racks (I assumer #racks >= 2 in this email) 
Option 1: (Change schema and) use replace method (preferred method) * Did you 
try to have the replace going, without any 

Re: node replacement failed

2018-09-22 Thread onmstester onmstester
I have a cunning plan (Baldrick wise) to solve this problem: stop client 
application run nodetool flush on all nodes to save memtables to disk stop 
cassandra on all of the nodes rename original Cassandra data directory to 
data-old start cassandra on all the nodes to create a fresh cluster including 
the old dead nodes again create the application related keyspaces in cqlsh and 
this time set rf=2 on system keyspaces (to never encounter this problem again!) 
move sstables from data-backup dir to current data dirs and restart cassandra 
or reload sstables Should this work and solve my problem?  On Mon, 10 Sep 
2018 17:12:48 +0430 onmstester onmstester  wrote  
Thanks Alain, First here it is more detail about my cluster: 10 racks + 3 nodes 
on each rack nodetool status: shows 27 nodes UN and 3 nodes all related to 
single rack as DN version 3.11.2 Option 1: (Change schema and) use replace 
method (preferred method) * Did you try to have the replace going, without any 
former repairs, ignoring the fact 'system_traces' might be inconsistent? You 
probably don't care about this table, so if Cassandra allows it with some of 
the nodes down, going this way is relatively safe probably. I really do not see 
what you could lose that matters in this table. * Another option, if the schema 
first change was accepted, is to make the second one, to drop this table. You 
can always rebuild it in case you need it I assume. I really love to let the 
replace going, but it stops with the error: java.lang.IllegalStateException: 
unable to find sufficient sources for streaming range in keyspace system_traces 
Also i could delete system_traces which is empty anyway, but there is a 
system_auth and system_distributed keyspace too and they are not empty, Could i 
delete them safely too? If i could just somehow skip streaming the system 
keyspaces from node replace phase, the option 1 would be great. P.S: Its clear 
to me that i should use at least RF=3 in production, but could not manage to 
acquire enough resources yet (i hope would be fixed in recent future) Again 
Thank you for your time Sent using Zoho Mail  On Mon, 10 Sep 2018 16:20:10 
+0430 Alain RODRIGUEZ  wrote  Hello, I am sorry it took 
us (the community) more than a day to answer to this rather critical situation. 
That being said, my recommendation at this point would be for you to make sure 
about the impacts of whatever you would try. Working on a broken cluster, as an 
emergency might lead you to a second mistake, possibly more destructive than 
the first one. It happened to me and around, for many clusters. Move forward 
even more carefuly in these situations as a global advice. Suddenly i lost all 
disks of cassandar-data on one of my racks With RF=2, I guess operations use 
LOCAL_ONE consistency, thus you should have all the data in the safe rack(s) 
with your configuration, you probably did not lose anything yet and have the 
service only using the nodes up, that got the right data.  tried to replace the 
nodes with same ip using this: 
https://blog.alteroot.org/articles/2014-03-12/replace-a-dead-node-in-cassandra.html
 As a side note, I would recommend you to use 'replace_address_first_boot' 
instead of 'replace_address'. This does basically the same but will be ignored 
after the first bootstrap. A detail, but hey, it's there and somewhat safer, I 
would use this one. java.lang.IllegalStateException: unable to find sufficient 
sources for streaming range in keyspace system_traces By default, non-user 
keyspace use 'SimpleStrategy' and a small RF. Ideally, this should be changed 
in a production cluster, and you're having an example of why. Now when i 
altered the system_traces keyspace startegy to NetworkTopologyStrategy and RF=2 
but then running nodetool repair failed: Endpoint not alive /IP of dead node 
that i'm trying to replace. Changing the replication strategy you made the dead 
rack owner of part of the token ranges, thus repairs just can't work as there 
will always be one of the nodes involved down as the whole rack is down. Repair 
won't work, but you probably do not need it! 'system_traces' is a temporary / 
debug table. It's probably empty or with irrelevant data. Here are some 
thoughts: * It would be awesome at this point for us (and for you if you did 
not) to see the status of the cluster: ** 'nodetool status' ** 'nodetool 
describecluster' --> This one will tell if the nodes agree on the schema (nodes 
up). I have seen schema changes with nodes down inducing some issues. ** 
Cassandra version ** Number of racks (I assumer #racks >= 2 in this email) 
Option 1: (Change schema and) use replace method (preferred method) * Did you 
try to have the replace going, without any former repairs, ignoring the fact 
'system_traces' might be inconsistent? You probably don't care about this 
table, so if Cassandra allows it with some of the nodes down, going this way is 
relatively safe probably. I really do not see what you coul

Re: stuck with num_tokens 256

2018-09-22 Thread onmstester onmstester
If you have problems with balance you can add new nodes using the algorithm and 
it'll balance out the cluster. You probably want to stick to 256 tokens though. 
I read somewhere (don't remember the ref) that all nodes of the cluster should 
use the same algorithm, so if my cluster suffer from imbalanced nodes using 
random algorithm i can not add new nodes that are using Allocation algorithm. 
isn't that correct?

Re: stuck with num_tokens 256

2018-09-22 Thread onmstester onmstester
Thanks, Because all my clusters are already balanced, i won't change their 
config But one more question, should i use num_tokens : 8 (i would follow 
datastax recommendation) and allocate_tokens_for_local_replication_factor=3 
(which is max RF among my keyspaces) for new clusters which i'm going to setup? 
Is the Allocation algorithm, now recommended algorithm and mature enough to 
replace the Random algorithm? if its so, it should be the default one at 4.0? 
 On Sat, 22 Sep 2018 13:41:47 +0330 kurt greaves  
wrote  If you have problems with balance you can add new nodes using the 
algorithm and it'll balance out the cluster. You probably want to stick to 256 
tokens though. To reduce your # tokens you'll have to do a DC migration (best 
way). Spin up a new DC using the algorithm on the nodes and set a lower number 
of tokens. You'll want to test first but if you create a new keyspace for the 
new DC prior to creation of the new nodes with the desired RF (ie. a keyspace 
just in the "new" DC with your RF) then add your nodes using that keyspace for 
allocation tokens should be distributed evenly amongst that DC, and when 
migrate you can decommission the old DC and hopefully end up with a balanced 
cluster. Definitely test beforehand though because that was just me 
theorising... I'll note though that if your existing clusters don't have any 
major issues it's probably not worth the migration at this point. On Sat, 22 
Sep 2018 at 17:40, onmstester onmstester  wrote: I noticed 
that currently there is a discussion in ML with subject: changing default token 
behavior for 4.0. Any recommendation to guys like me who already have multiple 
clusters ( > 30 nodes in each cluster) with random partitioner and num_tokens = 
256? I should also add some nodes to existing clusters, is it possible with 
num_tokens = 256? How could we fix this bug (reduce num_tokens in existent 
clusters)? Cassandra version: 3.11.2 Sent using Zoho Mail

stuck with num_tokens 256

2018-09-22 Thread onmstester onmstester
I noticed that currently there is a discussion in ML with subject: changing 
default token behavior for 4.0. Any recommendation to guys like me who already 
have multiple clusters ( > 30 nodes in each cluster) with random partitioner 
and num_tokens = 256? I should also add some nodes to existing clusters, is it 
possible with num_tokens = 256? How could we fix this bug (reduce num_tokens in 
existent clusters)? Cassandra version: 3.11.2 Sent using Zoho Mail

Scale SASI index

2018-09-17 Thread onmstester onmstester
By adding new nodes to cluster, should i rebuild SASI indexes on all nodes ?

Re: node replacement failed

2018-09-14 Thread onmstester onmstester
Thanks, I am still thinking about it, but before going deeper, is this still an 
issue for you at the moment? Yes, It is.

Re: node replacement failed

2018-09-10 Thread onmstester onmstester
e rack isolation guarantee will no longer be valid. It's 
hard to reason about what would happen to the data and in terms of streaming. * 
Alternatively, if you don't have enough space, you can even 'force' the 
'nodetool removenode'. See the documentation. Forcing it will prevent streaming 
and remove the node (token ranges handover, but not the data). If that does not 
work you can use the 'nodetool assassinate' command as well. When adding nodes 
back to the broken DC, the first nodes will take probably 100% of the 
ownership, which is often too much. You can consider adding back all the nodes 
with 'auto_bootstrap: false' before repairing them once they have their final 
token ownership, the same ways we do when building a new data center. This 
option is not really clean, and have some caveats that you need to consider 
before starting as there are token range movements and nodes available that do 
not have the data. Yet this should work. I imagine it would work nicely with 
RF=3 and QUORUM and with RF=2 (if you have 2+ racks), I guess it should work as 
well but you will have to pick one of availability or consistency while 
repairing the data. Be aware that read requests hitting these nodes will not 
find data! Plus, you are using an RF=2. Thus using consistency of 2+ (TWO, 
QUORUM, ALL), for at least one of reads or writes is needed to preserve 
consistency while re-adding the nodes in this case. Otherwise, reads will not 
detect the mismatch with certainty and might show inconsistent data the time 
for the nodes to be repaired. I must say, that I really prefer odd values for 
the RF, starting with RF=3. Using RF=2 you will have to pick. Consistency or 
Availability. With a consistency of ONE everywhere, the service is available, 
no single point of failure. using anything bigger than this, for writes or 
read, brings consistency but it creates single points of failures (actually any 
node becomes a point of failure). RF=3 and QUORUM for both write and reads take 
the best of the 2 worlds somehow. The tradeoff with RF=3 and quorum reads is 
the latency increase and the resource usage. Maybe is there a better approach, 
I am not too sure, but I think I would try option 1 first in any case. It's 
less destructive, less risky, no token range movements, no empty nodes 
available. I am not sure about limitation you might face though and that's why 
I suggest a second option for you to consider if the first is not actionable. 
Let us know how it goes, C*heers, --- Alain Rodriguez - 
@arodream - al...@thelastpickle.com France / Spain The Last Pickle - Apache 
Cassandra Consulting http://www.thelastpickle.com Le lun. 10 sept. 2018 à 
09:09, onmstester onmstester  a écrit : Any idea? Sent 
using Zoho Mail  On Sun, 09 Sep 2018 11:23:17 +0430 onmstester onmstester 
 wrote  Hi, Cluster Spec: 30 nodes RF = 2 
NetworkTopologyStrategy GossipingPropertyFileSnitch + rack aware Suddenly i 
lost all disks of cassandar-data on one of my racks, after replacing the disks, 
tried to replace the nodes with same ip using this: 
https://blog.alteroot.org/articles/2014-03-12/replace-a-dead-node-in-cassandra.html
 starting the to-be-replace-node fails with: java.lang.IllegalStateException: 
unable to find sufficient sources for streaming range in keyspace system_traces 
the problem is that i did not changed default replication config for System 
keyspaces, but Now when i altered the system_traces keyspace startegy to 
NetworkTopologyStrategy and RF=2 but then running nodetool repair failed: 
Endpoint not alive /IP of dead node that i'm trying to replace. What should i 
do now? Can i just remove previous nodes, change dead nodes IPs and re-join 
them to cluster? Sent using Zoho Mail

Re: node replacement failed

2018-09-10 Thread onmstester onmstester
Any idea? Sent using Zoho Mail  On Sun, 09 Sep 2018 11:23:17 +0430 
onmstester onmstester  wrote  Hi, Cluster Spec: 30 
nodes RF = 2 NetworkTopologyStrategy GossipingPropertyFileSnitch + rack aware 
Suddenly i lost all disks of cassandar-data on one of my racks, after replacing 
the disks, tried to replace the nodes with same ip using this: 
https://blog.alteroot.org/articles/2014-03-12/replace-a-dead-node-in-cassandra.html
 starting the to-be-replace-node fails with: java.lang.IllegalStateException: 
unable to find sufficient sources for streaming range in keyspace system_traces 
the problem is that i did not changed default replication config for System 
keyspaces, but Now when i altered the system_traces keyspace startegy to 
NetworkTopologyStrategy and RF=2 but then running nodetool repair failed: 
Endpoint not alive /IP of dead node that i'm trying to replace. What should i 
do now? Can i just remove previous nodes, change dead nodes IPs and re-join 
them to cluster? Sent using Zoho Mail

node replacement failed

2018-09-09 Thread onmstester onmstester
Hi, Cluster Spec: 30 nodes RF = 2 NetworkTopologyStrategy 
GossipingPropertyFileSnitch + rack aware Suddenly i lost all disks of 
cassandar-data on one of my racks, after replacing the disks, tried to replace 
the nodes with same ip using this: 
https://blog.alteroot.org/articles/2014-03-12/replace-a-dead-node-in-cassandra.html
 starting the to-be-replace-node fails with: java.lang.IllegalStateException: 
unable to find sufficient sources for streaming range in keyspace system_traces 
the problem is that i did not changed default replication config for System 
keyspaces, but Now when i altered the system_traces keyspace startegy to 
NetworkTopologyStrategy and RF=2 but then running nodetool repair failed: 
Endpoint not alive /IP of dead node that i'm trying to replace. What should i 
do now? Can i just remove previous nodes, change dead nodes IPs and re-join 
them to cluster? Sent using Zoho Mail

Re: [EXTERNAL] Re: adding multiple node to a cluster, cleanup and num_tokens

2018-09-08 Thread onmstester onmstester
Thanks Jeff, You mean that with RF=2, num_tokens = 256 and having less than 256 
nodes i should not worry about data distribution? Sent using Zoho Mail  On 
Sat, 08 Sep 2018 21:30:28 +0430 Jeff Jirsa  wrote  
Virtual nodes accomplish two primary goals 1) it makes it easier to gradually 
add/remove capacity to your cluster by distributing the new host capacity 
around the ring in smaller increments 2) it increases the number of sources for 
streaming, which speeds up bootstrap and decommission Whether or not either of 
these actually is true depends on a number of factors, like your cluster size 
(for #1) and your replication factor (for #2). If you have 4 hosts and 4 tokens 
per host and add a 5th host, you’ll probably add a neighbor near each existing 
host (#1) and stream from every other host (#2), so that’s great. If you have 
20 hosts and add a new host with 4 tokens, most of your existing ranges won’t 
change at all - you’re nominally adding 5% of your cluster capacity but you 
won’t see a 5% improvement because you don’t have enough tokens to move 5% of 
your ranges. If you had 32 tokens, you’d probably actually see that 5% 
improvement, because you’d likely add a new range near each of the existing 
ranges. Going down to 1 token would mean you’d probably need to manually move 
tokens after each bootstrap to rebalance, which is fine, it just takes more 
operator awareness. I don’t know how DSE calculates which replication factor to 
use for their token allocation logic, maybe they guess or take the highest or 
something. Cassandra doesn’t - we require you to be explicit, but we could 
probably do better here.

  1   2   >