C* clusters with node counts > 100

2020-12-10 Thread Reid Pinchback
About a year ago I remember a conversation about C* clusters with large
numbers of nodes.  I think Jon Haddad had raised the point that > 100 nodes
you start to run into issues, something related to a thread pool with a
size proportionate to the number of nodes, but that this problem would be
mitigated in C* 4.  However that's about all I recall, and I haven't found
anything via Googling that talks about this concern.

I have a current project where I need to know the specifics a bit better.
In particular:

1. In a multi-DC cluster, is this size concern about the size of a DC, or
does it apply to the aggregate number of nodes in the entire cluster?

2. What specific misbehaviors manifest?  Does this show as a memory drain,
a latency increase, network congestion, etc?

3. How sharp is the cliff in terms of seeing that misbehavior?

4. Any pointer at the code artifact that causes this would be of interest.
We're using a mix of 3.7 and 3.11 in our clusters, so any git pointer
appropriate to that would be great.

Long story short, I'm planning out a bunch of upgrades, with details I
won't get into here, but spinning up a new DC matching a desired final
configuration looks to be the healthier path so long as I don't slam face
first into a problem related to node count while in the middle of it.

-- 
Reid M. Pinchback
Owner & CEO
CodeKami Consulting LLC


Re: Running Large Clusters in Production

2020-07-13 Thread Reid Pinchback
I don’t know if it’s the OPs intent in this case, but the response latency 
profile will likely be different for two clusters equivalent in total storage 
but different in node count. Multiple reasons for that, but probably the 
biggest would be that you’re changing a divisor in I/O queuing statistics that 
matter to compaction-triggered dirty page flushes, and I’d expect you would see 
that in latencies.  Speculative retry stats to bounce past slow nodes busy with 
garbage collections might shift a bit too.

R

From: "Durity, Sean R" 
Reply-To: "user@cassandra.apache.org" 
Date: Monday, July 13, 2020 at 10:48 AM
To: "user@cassandra.apache.org" 
Subject: RE: Running Large Clusters in Production

Message from External Sender
I’m curious – is the scaling needed for the amount of data, the amount of user 
connections, throughput or what? I have a 200ish cluster, but it is primarily a 
disk space issue. When I can have (and administer) nodes with large disks, the 
cluster size will shrink.


Sean Durity

From: Isaac Reath (BLOOMBERG/ 919 3RD A) 
Sent: Monday, July 13, 2020 10:35 AM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: Running Large Clusters in Production

Thanks for the info Jeff, all very helpful!
From: user@cassandra.apache.org At: 07/11/20 
12:30:36
To: user@cassandra.apache.org
Subject: Re: Running Large Clusters in Production

Gossip related stuff eventually becomes the issue

For example, when a new host joins the cluster (or replaces a failed host), the 
new bootstrapping tokens go into a “pending range” set. Writes then merge 
pending ranges with final ranges, and the data structures involved here weren’t 
necessarily designed for hundreds of thousands of ranges, so it’s likely they 
stop behaving at some point 
(https://issues.apache.org/jira/browse/CASSANDRA-6345 
[issues.apache.org]
 , https://issues.apache.org/jira/browse/CASSANDRA-6127 
[issues.apache.org]
   as an example, but there have been others)

Unrelated to vnodes, until cassandra 4.0, the internode messaging requires 
basically 6 threads per instance - 3 for ingress and 3 for egress, to every 
other host in the cluster. The full mesh gets pretty expensive, it was 
rewritten in 4.0 and that thousand number may go up quite a bit after that.



On Jul 11, 2020, at 9:16 AM, Isaac Reath (BLOOMBERG/ 919 3RD A) 
mailto:ire...@bloomberg.net>> wrote:
Thank you John and Jeff, I was leaning towards sharding and this really helps 
support that opinion. Would you mind explaining a bit more what about vnodes 
caused those issues?
From: user@cassandra.apache.org At: 07/10/20 
19:06:27
To: user@cassandra.apache.org
Cc: Isaac Reath (BLOOMBERG/ 919 3RD A )
Subject: Re: Running Large Clusters in Production

I worked on a handful of large clusters (> 200 nodes) using vnodes, and there 
were some serious issues with both performance and availability.  We had to put 
in a LOT of work to fix the problems.

I agree with Jeff - it's way better to manage multiple clusters than a really 
large one.


On Fri, Jul 10, 2020 at 2:49 PM Jeff Jirsa 
mailto:jji...@gmail.com>> wrote:
1000 instances are fine if you're not using vnodes.

I'm not sure what the limit is if you're using vnodes.

If you might get to 1000, shard early before you get there. Running 8x100 host 
clusters will be easier than one 800 host cluster.


On Fri, Jul 10, 2020 at 2:19 PM Isaac Reath (BLOOMBERG/ 919 3RD A) 
mailto:ire...@bloomberg.net>> wrote:
Hi All,

I’m currently dealing with a use case that is running on around 200 nodes, due 
to growth of their product as well as onboarding additional data sources, we 
are looking at having to expand that to around 700 nodes, and potentially 
beyond to 1000+. To that end I have a couple of questions:

1) For those who have experienced managing clusters at that scale, what types 
of operational challenges have you run into that you might not see when 
operating 100 node clusters? A couple that come to mind are version (especially 
major version) upgrades become a lot more risky as it no longer becomes 
feasible to do a blue / green style deployment of the database and backup & 
restore operations seem far more error prone as well for the same reason 
(having to do an in-place restore instead of being able to spin up a new 
cluster to restore to).

2) Is there a cluster size beyond which sharding across multiple clusters 
becomes the recommended approach?

Thanks,
Isaac






The information in this Internet Email is confidential and may be 

Re: Corrupt sstables_activity

2020-07-02 Thread Reid Pinchback
Here’s an article link for repairing table corruption, something I’d saved back 
last year in case I ever needed it:

https://blog.pythian.com/so-you-have-a-broken-cassandra-sstable-file/

Hope it helps.

R

From: F 
Reply-To: "user@cassandra.apache.org" 
Date: Thursday, July 2, 2020 at 12:50 PM
To: "user@cassandra.apache.org" 
Subject: Corrupt sstables_activity

Message from External Sender
Good afternoon! I have a question related to a system level keyspace.

Problem: While running a routine full repair on a specific keyspace and table 
where i had to remove one of the big data portions for corruption (sstablescrub 
failed), the system.log indicated that the specific keyspace and table repaired 
successfully. Nevertheless, the system.log also indicated that one of the big 
data files related to system.sstable_activity was corrupted.

I tried running nodetool scrub on the system.sstable_activity, but it 
tombstoned 0 rows and still states it is corrupted. I also verified, via cqlsh, 
that the specific node's sstable_activity is not queryable either.

Because the system keyspace is local, i don't think i can repair it with 
nodetool repair. What would be the steps involved to correct this table?

Version of Apache Cassandra: 3.9 ( its old )

Current ideas i have not yet performed:

1.) Stop cassandra manually. Remove the offending .db file throwing the 
corruption error. Restart cassandra. See if cassandra will rebuild the table 
automatically.

2.) As (1) above, but remove the entire folder instead of specific db file.

3.) Drop the node from the cluster ( would like to avoid this, some data is RF 
1, and i still need to complete other full repairs)

4.) Sstablescrub on system.sstable_activity? ( i dont believe this will do 
anything because RF is still local )



Any suggestions moving forward would be appreciated! Thanks!


Re: Can cassandra pick configuration from environment variables

2020-06-29 Thread Reid Pinchback
It’s pretty easy to make Ansible, or Python with Jinja by itself if you don’t 
use Ansible, and just templatize your config file so the environment variables 
get substituted.

From: Jeff Jirsa 
Reply-To: "user@cassandra.apache.org" 
Date: Monday, June 29, 2020 at 10:36 AM
To: cassandra 
Subject: Re: Can cassandra pick configuration from environment variables

Message from External Sender
You can probably implement a custom config loader that pulls all the config 
from env vars, if you're so inclined (a bit of java, the interface has a single 
method, maybe one or two hooks into the db, which may be suitable for 
committing for general purpose use).

On Mon, Jun 29, 2020 at 4:35 AM Angelo Polo 
mailto:language.de...@gmail.com>> wrote:
You can, however, set the environment variable CASSANDRA_CONF to direct the 
startup script to the configuration directory that holds cassandra.yaml, 
cassandra-env.sh, etc. So while you can't set individual C* configuration 
parameters from environment variables, you could have different configuration 
directories (can think of them as different profiles) and specify at startup 
which to use.

Best,
Angelo Polo

On Sun, Jun 28, 2020 at 2:28 PM Erick Ramirez 
mailto:erick.rami...@datastax.com>> wrote:
You can't. You can only configure Cassandra by setting the properties in 
cassandra.yaml file. Cheers!


Re: Encryption at rest

2020-06-25 Thread Reid Pinchback
If you’re using AWS with EBS then you can just handle that with KMS to encrypt 
the volumes.  If you’re using local storage on EC2, or you aren’t on AWS, then 
you’ll have to do heavier lifting with luks and dm-crypt, or eCryptfs, etc.  If 
you’re using a container mechanism for your C* deployments, you might prefer 
options that encrypt based on directory hierarchies instead of block storage or 
filesystems, if you want some security isolation between co-tenants on a box.  
I was trying to jog my memory on the current state of the art and hit a decent 
summary on the Arch Linux site that you may wish to eyeball:

https://wiki.archlinux.org/index.php/Data-at-rest_encryption


From: Arvinder Dhillon 
Reply-To: "user@cassandra.apache.org" 
Date: Thursday, June 25, 2020 at 1:12 AM
To: "user@cassandra.apache.org" 
Subject: Re: Encryption at rest

Message from External Sender
Do it at storage level.


On Wed, Jun 24, 2020, 1:01 PM Jeff Jirsa 
mailto:jji...@gmail.com>> wrote:
Not really, no.


On Wed, Jun 24, 2020 at 1:00 PM Abdul Patel 
mailto:abd786...@gmail.com>> wrote:
Team,

Do we have option in open source to do encryption at rest in cassandra ?


Re: Memory decline

2020-06-18 Thread Reid Pinchback
Just to confirm, is this memory decline outside of the Cassandra process?  If 
so, I’d look at crond and at memory held for network traffic.  Those are the 
two areas I’ve seen leak.  If you’ve configured to have swap=0, then you end up 
in a position where even if the memory usage is stale, nothing can push the 
stale pages out of the way.

From: Rahul Reddy 
Reply-To: "user@cassandra.apache.org" 
Date: Thursday, June 18, 2020 at 10:27 AM
To: "user@cassandra.apache.org" 
Subject: Memory decline

Message from External Sender
Hello,

Im seeing continuous decline in memory on a Cassandra instance used to have 20g 
free memory 15 days back and now its 15g and continue to go down. Same instance 
it caused the cassandra instance crash before. Can you please give me some 
pointers to look for which is causing continuous decline in memory


Re: Cassandra crashes when using offheap_objects for memtable_allocation_type

2020-06-02 Thread Reid Pinchback
I’d also take a look at the O/S level.  You might be queued up on flushing of 
dirty pages, which would also throttle your ability to write mempages.  Once 
the I/O gets throttled badly, I’ve seen it push back into what you see in C*. 
To Aaron’s point, you want a balance in memory between C* and O/S buffer cache, 
because to write to disk you pass through buffer cache first.

From: Aaron Ploetz 
Reply-To: "user@cassandra.apache.org" 
Date: Tuesday, June 2, 2020 at 9:38 AM
To: "user@cassandra.apache.org" 
Subject: Re: Cassandra crashes when using offheap_objects for 
memtable_allocation_type

Message from External Sender
I would try running it with memtable_offheap_space_in_mb at the default for 
sure, but definitely lower than 8GB.  With 32GB of RAM, you're already 
allocating half of that for your heap, and then halving the remainder for off 
heap memtables.  What's left may not be enough for the OS, etc.  Giving some of 
that back, will allow more to be used for page cache, which always helps.

"JVM heap size: 16GB, CMS, 1GB newgen"

For CMS GC with a 16GB heap, 1GB is way too small for new gen.  You're going to 
want that to be at least 40% of the max heap size.  Some folks here even 
advocate for setting Xmn as high as 50% of Xmx/s.

If you want to stick with CMS GC, take a look at 
https://issues.apache.org/jira/browse/CASSANDRA-8150.
  There's plenty of good info in there on CMS GC tuning.  Make sure to read 
through the whole ticket, so that you understand what each setting does.  You 
can't just pick-and-choose.

Regards,

Aaron


On Tue, Jun 2, 2020 at 1:31 AM onmstester onmstester 
 wrote:
I just changed these properties to increase flushed file size (decrease number 
of compactions):

  *   memtable_allocation_type from heap_buffers to offheap_objects
  *   memtable_offheap_space_in_mb: from default (2048) to 8192
Using default value for other memtable/compaction/commitlog configurations .

After a few hours some of nodes stopped to do any mutations (dropped mutaion 
increased) and also pending flushes increased, they were just up and running 
and there was only a single CPU core with 100% usage(other cores was 0%). other 
nodes on the cluster determines the node as DN. Could not access 7199 and also 
could not create thread dump even with jstack -F.

Restarting Cassandra service fixes the problem but after a while some other 
node would be DN.

Am i missing some configurations?  What should i change in cassandra default 
configuration to maximize write throughput in single node/cluster in 
write-heavy scenario for the data model:
Data mode is a single table:
  create table test(
  text partition_key,
  text clustering_key,
  set rows,
  primary key ((partition_key, clustering_key))


vCPU: 12
Memory: 32GB
Node data size: 2TB
Apache cassandra 3.11.2
JVM heap size: 16GB, CMS, 1GB newgen


Sent using Zoho 
Mail





Re: Cassandra Bootstrap Sequence

2020-06-02 Thread Reid Pinchback
Would updating disk boundaries be sensitive to disk I/O tuning?  I’m 
remembering Jon Haddad’s talk about typical throughput problems in disk page 
sizing.

From: Jai Bheemsen Rao Dhanwada 
Reply-To: "user@cassandra.apache.org" 
Date: Tuesday, June 2, 2020 at 10:48 AM
To: "user@cassandra.apache.org" 
Subject: Re: Cassandra Bootstrap Sequence

Message from External Sender
3000 tables

On Tuesday, June 2, 2020, Durity, Sean R 
mailto:sean_r_dur...@homedepot.com>> wrote:
How many total tables in the cluster?


Sean Durity

From: Jai Bheemsen Rao Dhanwada 
mailto:jaibheem...@gmail.com>>
Sent: Monday, June 1, 2020 8:36 PM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: Cassandra Bootstrap Sequence

Thanks Erick,

I see below tasks are being run mostly. I didn't quite understand what exactly 
these scheduled tasks are for? Is there a way to reduce the boot-up time or do 
I have to live with this delay?

$ zgrep "CompactionStrategyManager.java:380 - Recreating compaction strategy" 
debug.log*  | wc -l
3249
$ zgrep "DiskBoundaryManager.java:53 - Refreshing disk boundary cache for" 
debug.log*  | wc -l
6293
$ zgrep "DiskBoundaryManager.java:92 - Got local ranges" debug.log*  | wc -l
6308
$ zgrep "DiskBoundaryManager.java:56 - Updating boundaries from DiskBoundaries" 
debug.log*  | wc -l
3249





On Mon, Jun 1, 2020 at 5:01 PM Erick Ramirez 
mailto:erick.rami...@datastax.com>> wrote:
There's quite a lot of steps that takes place during the startup sequence 
between these 2 lines:

INFO  [main] 2020-05-31 23:51:15,555 Gossiper.java:1723 - No gossip backlog; 
proceeding
INFO  [main] 2020-05-31 23:54:06,867 NativeTransportService.java:70 - Netty 
using native Epoll event loop

For the most part, it's taken up by CompactionStrategyManager and 
DiskBoundaryManager. If you check debug.log, you'll see that it's mostly 
updating disk boundaries. The length of time it takes is proportional to the 
number of tables in the cluster.

Have a look at this section [1] of CassandraDaemon if you're interested in the 
details of the startup sequence. Cheers!

[1] 
https://github.com/apache/cassandra/blob/cassandra-3.11.3/src/java/org/apache/cassandra/service/CassandraDaemon.java#L399-L435
 
[github.com]



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.


Re: Cassandra Bootstrap Sequence

2020-06-01 Thread Reid Pinchback
The thing to look for in GC logs would be signs that you’re bouncing against 
your memory limits and spending a lot of time in full GC collections.

I’m not sure at what phase it kicks in but definitely there is the potential 
for memory issues when you have large column families (large in the number of 
columns I mean), and you’re mentioning that the situation gets worse in 
proportion to the number of tables brought GC to mind.  Not sure about 
proportion of nodes, I think there are thread counts that increase with the 
number of nodes, and increased threads also can add to GC load, particularly in 
G1GC.

I’m speculating a bit on possible causes, but basically the idea was to look 
for GC load during those 3 minutes, because if you see it then you’re not 
hunting for a timeout tuning or anything like that, you’re hunting for a 
resource allocation tuning.

From: Jai Bheemsen Rao Dhanwada 
Reply-To: "user@cassandra.apache.org" 
Date: Monday, June 1, 2020 at 7:15 PM
To: "user@cassandra.apache.org" 
Subject: Re: Cassandra Bootstrap Sequence

Message from External Sender
Is there anything specific to for in GC logs?
b/w this delay happens always whenever I bootstrap the node or restart a C* 
process.

I don't believe it's a GC issue and correction from initial question, it's not 
just bootstrap, but every restart of C* process is causing this.

On Mon, Jun 1, 2020 at 3:22 PM Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>> wrote:
That gap seems a long time.  Have you checked GC logs around the timeframe?

From: Jai Bheemsen Rao Dhanwada 
mailto:jaibheem...@gmail.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Date: Monday, June 1, 2020 at 3:52 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Subject: Cassandra Bootstrap Sequence

Message from External Sender
Hello Team,

When I am bootstrapping/restarting a Cassandra Node, there is a delay between 
gossip settle and port opening. Can someone please explain me where this delay 
is configured and can this be changed? I don't see any information in the logs

In my case if you see there is  a ~3 minutes delay and this increases if I 
increase the #of tables and #of nodes and DC.

INFO  [main] 2020-05-31 23:51:07,554 Gossiper.java:1692 - Waiting for gossip to 
settle...
INFO  [main] 2020-05-31 23:51:15,555 Gossiper.java:1723 - No gossip backlog; 
proceeding
INFO  [main] 2020-05-31 23:54:06,867 NativeTransportService.java:70 - Netty 
using native Epoll event loop
INFO  [main] 2020-05-31 23:54:06,913 Server.java:155 - Using Netty Version: 
[netty-buffer=netty-buffer-4.0.44.Final.452812a, 
netty-codec=netty-codec-4.0.44.Final.452812a, 
netty-codec-haproxy=netty-codec-haproxy-4.0.44.Final.452812a, 
netty-codec-http=netty-codec-http-4.0.44.Final.452812a, 
netty-codec-socks=netty-codec-socks-4.0.44.Final.452812a, 
netty-common=netty-common-4.0.44.Final.452812a, 
netty-handler=netty-handler-4.0.44.Final.452812a, 
netty-tcnative=netty-tcnative-1.1.33.Fork26.142ecbb, 
netty-transport=netty-transport-4.0.44.Final.452812a, 
netty-transport-native-epoll=netty-transport-native-epoll-4.0.44.Final.452812a, 
netty-transport-rxtx=netty-transport-rxtx-4.0.44.Final.452812a, 
netty-transport-sctp=netty-transport-sctp-4.0.44.Final.452812a, 
netty-transport-udt=netty-transport-udt-4.0.44.Final.452812a]
INFO  [main] 2020-05-31 23:54:06,913 Server.java:156 - Starting listening for 
CQL clients on /x.x.x.x:9042 (encrypted)...

Also during this 3 minutes delay, I am losing all my metrics from the C* 
nodes(basically the metrics are not returned within 10s).

Can someone please help me understand the delay here?

Cassandra Version: 3.11.3
Metrics: Using telegraf to collect metrics.


Re: Cassandra Bootstrap Sequence

2020-06-01 Thread Reid Pinchback
That gap seems a long time.  Have you checked GC logs around the timeframe?

From: Jai Bheemsen Rao Dhanwada 
Reply-To: "user@cassandra.apache.org" 
Date: Monday, June 1, 2020 at 3:52 PM
To: "user@cassandra.apache.org" 
Subject: Cassandra Bootstrap Sequence

Message from External Sender
Hello Team,

When I am bootstrapping/restarting a Cassandra Node, there is a delay between 
gossip settle and port opening. Can someone please explain me where this delay 
is configured and can this be changed? I don't see any information in the logs

In my case if you see there is  a ~3 minutes delay and this increases if I 
increase the #of tables and #of nodes and DC.

INFO  [main] 2020-05-31 23:51:07,554 Gossiper.java:1692 - Waiting for gossip to 
settle...
INFO  [main] 2020-05-31 23:51:15,555 Gossiper.java:1723 - No gossip backlog; 
proceeding
INFO  [main] 2020-05-31 23:54:06,867 NativeTransportService.java:70 - Netty 
using native Epoll event loop
INFO  [main] 2020-05-31 23:54:06,913 Server.java:155 - Using Netty Version: 
[netty-buffer=netty-buffer-4.0.44.Final.452812a, 
netty-codec=netty-codec-4.0.44.Final.452812a, 
netty-codec-haproxy=netty-codec-haproxy-4.0.44.Final.452812a, 
netty-codec-http=netty-codec-http-4.0.44.Final.452812a, 
netty-codec-socks=netty-codec-socks-4.0.44.Final.452812a, 
netty-common=netty-common-4.0.44.Final.452812a, 
netty-handler=netty-handler-4.0.44.Final.452812a, 
netty-tcnative=netty-tcnative-1.1.33.Fork26.142ecbb, 
netty-transport=netty-transport-4.0.44.Final.452812a, 
netty-transport-native-epoll=netty-transport-native-epoll-4.0.44.Final.452812a, 
netty-transport-rxtx=netty-transport-rxtx-4.0.44.Final.452812a, 
netty-transport-sctp=netty-transport-sctp-4.0.44.Final.452812a, 
netty-transport-udt=netty-transport-udt-4.0.44.Final.452812a]
INFO  [main] 2020-05-31 23:54:06,913 Server.java:156 - Starting listening for 
CQL clients on /x.x.x.x:9042 (encrypted)...

Also during this 3 minutes delay, I am losing all my metrics from the C* 
nodes(basically the metrics are not returned within 10s).

Can someone please help me understand the delay here?

Cassandra Version: 3.11.3
Metrics: Using telegraf to collect metrics.


Re: any risks with changing replication factor on live production cluster without downtime and service interruption?

2020-05-26 Thread Reid Pinchback
By retry logic, I’m going to guess you are doing some kind of version 
consistency trick where you have a non-key column managing a visibility horizon 
to simulate a transaction, and you poll for a horizon value >= some threshold 
that the app is keeping aware of.

Note that these assorted variations on trying to do battle with eventual 
consistency can generate a lot of load on the cluster, unless there is enough 
latency in the progress of the logical flow at the app level that the 
optimistic concurrency hack almost always succeeds the first time anyways.

If this generates the degree of java garbage collection that I suspect, then 
the advice to upgrade C* becomes even more significant.  Repairs themselves can 
generate substantial memory load, and you could have a node or two drop out on 
you if they OOM. I’d definitely take Jeff’s advice about switching your reads 
to LOCAL_QUORUM until you’re done to buffer yourself from that risk.


From: Leena Ghatpande 
Reply-To: "user@cassandra.apache.org" 
Date: Tuesday, May 26, 2020 at 1:20 PM
To: "user@cassandra.apache.org" 
Subject: Re: any risks with changing replication factor on live production 
cluster without downtime and service interruption?

Message from External Sender
Thank you for the response. Will follow the recommendation for the update. So 
with Read=LOCAL_QUORUM we should see some latency, but not failures during RF 
change right?

We do mitigate the issue of not seeing writes when set to Local_one, by having 
a Retry logic in the app



From: Leena Ghatpande 
Sent: Friday, May 22, 2020 11:51 AM
To: cassandra cassandra 
Subject: any risks with changing replication factor on live production cluster 
without downtime and service interruption?

We are on Cassandra 3.7 and have a 12 node cluster , 2DC, with 6 nodes in each 
DC. RF=3
We have around 150M rows across tables.

We are planning to add more nodes to the cluster, and thinking of changing the 
replication factor to 5 for each DC.

Our application uses the below consistency level
 read-level: LOCAL_ONE
 write-level: LOCAL_QUORUM

if we change the RF=5 on live cluster, and run full repairs, would we see 
read/write errors while data is being replicated?
if so, This is not something that we can afford in production, so how would we 
avoid this?


Re: Bootstraping is failing

2020-05-11 Thread Reid Pinchback
If you’re correct that the issue you linked to is the bug you are hitting, then 
it was fixed in 3.11.3.  You may have no choice but to upgrade.  From the 
discussion it doesn’t read as if any tuning tweaks avoided the issue, just the 
patch fixed it.

If you do, I’d suggest going to at least 3.11.5.

Note that usable memory for a setting > 31 gb may not be what you think. At 
32gb you cross a boundary that triggers object pointers to double in size.  The 
only way you really win is when an app has only a modest number of objects, but 
some of those objects have large non-object-granularity allocations, e.g. like 
a few huge byte arrays.  C* does use some large buffers, but it also generates 
a lot of small objects.

I’d consider TCP tunings a likely red herring in this, if you are correct about 
the leak.  Doesn’t mean you can’t have better settings per suggestions made, 
just that it seems like it could be a case of refining behavior on the 
periphery of the problem, not anything directly addressing it.


From: Surbhi Gupta 
Reply-To: "user@cassandra.apache.org" 
Date: Saturday, May 9, 2020 at 11:51 AM
To: "user@cassandra.apache.org" 
Subject: Re: Bootstraping is failing

Message from External Sender
I tried to change the heap size from 31GB to 62GB on the bootstrapping node 
because , I noticed that , when it reached the mid way of bootstrapping , heap 
reached to around 90% or more and node just freeze .
But still it is the same behavior , it again reached midway and heap again 
reached 90% or more and node just freeze and none of the node tool command 
returns the output, other node also removed this node from the joining as they 
were not able to gossip.
We are on 3.11.0 .

I tried to take heap dump when the node had 90% + heap utilization of 62GB heap 
size and opened the leak report and found 3 leak suspect and out of three 2 
were as below:

1. The thread io.netty.util.concurrent.FastThreadLocalThread @ 0x7fbe9533bf98 
StreamReceiveTask:26 keeps local variables with total size 16,898,023,552 
(31.10%)bytes.
The memory is accumulated in one instance of 
"io.netty.util.Recycler$DefaultHandle[]" loaded by 
"sun.misc.Launcher$AppClassLoader @ 0x7fb917c76dc8".

2. The thread io.netty.util.concurrent.FastThreadLocalThread @ 0x7fbb846fb800 
StreamReceiveTask:29 keeps local variables with total size 11,696,214,424 
(21.53%)bytes.
The memory is accumulated in one instance of 
"io.netty.util.Recycler$DefaultHandle[]" loaded by 
"sun.misc.Launcher$AppClassLoader @ 0x7fb917c76dc8".

Am I getting hit by 
https://issues.apache.org/jira/browse/CASSANDRA-13929

I haven't changed the tcp settings . My tcp settings are more than recommended, 
what I wanted to understand , how tcp settings can effect the bootstrapping 
process ?

Thanks
Surbhi

On Thu, 7 May 2020 at 17:01, Surbhi Gupta 
mailto:surbhi.gupt...@gmail.com>> wrote:
When we are starting the node, it is starting bootstrap automatically and 
restreaming the whole data again.  It is not resuming .

On Thu, May 7, 2020 at 4:47 PM Adam Scott 
mailto:adam.c.sc...@gmail.com>> wrote:
I think you want to run `nodetool bootstrap resume` 
(https://cassandra.apache.org/doc/latest/tools/nodetool/bootstrap.html)
  to pick up where it last left off. Sorry for the late reply.


On Thu, May 7, 2020 at 2:22 PM Surbhi Gupta 
mailto:surbhi.gupt...@gmail.com>> wrote:
So after failed bootstrapped , if we start cassandra again on the new node , 
will it resume bootstrap or will it start over?

On Thu, 7 May 2020 at 13:32, Adam Scott 
mailto:adam.c.sc...@gmail.com>> wrote:
I recommend it on all nodes.  This will eliminate that as a source of trouble 
further on down the road.


On Thu, May 7, 2020 at 1:30 PM Surbhi Gupta 
mailto:surbhi.gupt...@gmail.com>> wrote:
streaming_socket_timeout_in_ms is 24 hour.
  So tcp settings should be changed on the new bootstrap node or on all nodes ?


On Thu, 7 May 2020 at 13:23, Adam Scott 
mailto:adam.c.sc...@gmail.com>> wrote:

edit /etc/sysctl.conf


net.ipv4.tcp_keepalive_time=60

net.ipv4.tcp_keepalive_probes=3

net.ipv4.tcp_keepalive_intvl=10
then run sysctl -p to cause the kernel to reload the settings

5 minutes (300) seconds is probably too long.

On Thu, May 7, 2020 at 1:09 PM Surbhi Gupta 
mailto:surbhi.gupt...@gmail.com>> wrote:

[root@abc cassandra]# cat /proc/sys/net/ipv4/tcp_keepalive_time

300

[root@abc cassandra]# cat 

Re: Impact of setting low value for flag -XX:MaxDirectMemorySize

2020-04-22 Thread Reid Pinchback
If the memory wasn’t being used, and it got pushed to swap, then the right 
thing happened.  It’s a common misconception that swap is bad.  The use of swap 
isn’t bad.  What is bad is if you find data churning in and out of swap space a 
lot so that your latency increases either due to the page faults or due to 
contention between swap activity and other disk I/O.  For the case it sounds 
like we’ve been discussing, where the buffers aren’t in use, basically all that 
would happen is that memory garbage would be shoved out of the way.  Honestly 
the thought I’d had in mind when you first described this would be to 
intentionally use cgroups to twiddle swappiness so that a short-term co-tenant 
load could be prioritized and shove stale C* memory out of the way, then 
twiddle the settings back when you prefer C* to be the winner in resource 
demand.

From: manish khandelwal 
Reply-To: "user@cassandra.apache.org" 
Date: Wednesday, April 22, 2020 at 7:23 AM
To: "user@cassandra.apache.org" 
Subject: Re: Impact of setting low value for flag -XX:MaxDirectMemorySize

Message from External Sender
I am running spark (max heap 4G) and a java application (4G) with my Cassandra 
server (8G).

After heavy loading, if I run a spark process some main memory is pushed into 
swap. But if a restart Cassandra and execute the spark process memory is not 
pushed into the swap.

Idea behind asking the above question was is -XX:MaxDirectMemorySize is the 
right knob to use to contain the off heap memory. I understand that I have to 
test as Eric said that I might get outOfMemoryError issue. Or are there any 
other better options available for handling such situations?



On Tue, Apr 21, 2020 at 9:52 PM Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>> wrote:
Note that from a performance standpoint, it’s hard to see a reason to care 
about releasing the memory unless you are co-tenanting C* with something else 
that’s significant in its memory demands, and significant on a schedule 
anti-correlated with when C* needs that memory.

If you aren’t doing that, then conceivably the only other time you’d care is if 
you are seeing read or write stalls on disk I/O because O/S buffer cache is too 
small.  But if you were getting a lot of impact from stalls, then it would mean 
C* was very busy… and if it’s very busy then it’s likely using it’s buffers as 
they are intended.

From: HImanshu Sharma 
mailto:himanshusharma0...@gmail.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Date: Saturday, April 18, 2020 at 2:06 AM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Subject: Re: Impact of setting low value for flag -XX:MaxDirectMemorySize

Message from External Sender
From the codebase as much I understood, if once a buffer is being allocated, 
then it is not freed and added to a recyclable pool. When a new request comes 
effort is made to fetch memory from recyclable pool and if is not available new 
allocation request is made. And while allocating a new request if memory limit 
is breached then we get this oom error.

I would like to know is my understanding correct
If what I am thinking is correct, is there way we can get this buffer pool 
reduced when there is low traffic because what I have observed in my system 
this memory remains static even if there is no traffic.

Regards
Manish

On Sat, Apr 18, 2020 at 11:13 AM Erick Ramirez 
mailto:erick.rami...@datastax.com>> wrote:
Like most things, it depends on (a) what you're allowing and (b) how much your 
nodes require. MaxDirectMemorySize is the upper-bound for off-heap memory used 
for the direct byte buffer. C* uses it for Netty so if your nodes are busy 
servicing requests, they'd have more IO threads consuming memory.

During low traffic periods, there's less memory allocated to service requests 
and they eventually get freed up by GC tasks. But if traffic volumes are high, 
memory doesn't get freed up quick enough so the max is reached. When this 
happens, you'll see OOMs like "OutOfMemoryError: Direct buffer memory" show up 
in the logs.

You can play around with different values but make sure you test it 
exhaustively before trying it out in production. Cheers!

GOT QUESTIONS? Apache Cassandra experts from the community and DataStax have 
answers! Share your expertise on 
https://community.datastax.com/<https://urldefense.proofpoint.com/v2/url?u=https-3A__community.datastax.com_=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=apwt6kYkqM08s5ZleVol85wvX321I_IHDfnOvY63Bys=-61m1DcTl2BYpbQu-d9iHpsXBgdyaQg0E_hfRoCbHvQ=>.


Re: Issues, understanding how CQL works

2020-04-21 Thread Reid Pinchback
Marc, have you had any exposure to DynamoDB at all?  The API approach is 
different, but the fundamental concepts are similar.  That’s actually a better 
reference point to have than an RDBMS, because really it’s a small subset of 
usage patterns that would overlap with CQL.  If you were, for example, dealing 
with databases that did a lot of table partitions and supported apps that 
focused bulk loads and analytics on a partition level, then you would be in a 
space somewhat similar to C*.

C* is at its best when your common usage pattern, at least on reads, is 
effectively “I want a bunch of stuff, so you may as well give it to me by the 
bunch… what I do with the bunch after is my problem”.  That’s very different 
from an RDBMS, which historically has always tried to find some balance between 
minimizing disk I/O and network I/O… but if it takes developers a lot more head 
scratching to get there, it was considered an acceptable investment to help 
scale the usage of an expensive resource.

As a result, language features for the two cases are quite different.

From: Elliott Sims 
Reply-To: "user@cassandra.apache.org" 
Date: Tuesday, April 21, 2020 at 12:13 PM
To: "user@cassandra.apache.org" 
Subject: Re: Issues, understanding how CQL works

Message from External Sender
The short answer is that CQL isn't SQL.  It looks a bit like it, but the 
structure of the data is totally different.  Essentially (ignoring secondary 
indexes, which have some issues in practice and I think are generally not 
recommended) the only way to look the data up is by the partition key.  
Anything else is a full-table scan and if you need more querying flexibility 
Cassandra is probably not your best option.   With only 260GB, I think I'd lean 
towards suggesting PostgreSQL or MySQL.

On Tue, Apr 21, 2020 at 7:20 AM Marc Richter 
mailto:m...@marc-richter.info>> wrote:
Hi everyone,

I'm very new to Cassandra. I have, however, some experience with SQL.

I need to extract some information from a Cassandra database that has
the following table definition:

CREATE TABLE tagdata.central (
signalid int,
monthyear int,
fromtime bigint,
totime bigint,
avg decimal,
insertdate bigint,
max decimal,
min decimal,
readings text,
PRIMARY KEY (( signalid, monthyear ), fromtime, totime)
)

The database is already of round about 260 GB in size.
I now need to know what is the most recent entry in it; the correct
column to learn this would be "insertdate".

In SQL I would do something like this:

SELECT insertdate FROM tagdata.central
ORDER BY insertdate DESC LIMIT 1;

In CQL, however, I just can't get it to work.

What I have tried already is this:

SELECT insertdate FROM "tagdata.central"
ORDER BY insertdate DESC LIMIT 1;

But this gives me an error:
ERROR: ORDER BY is only supported when the partition key is restricted
by an EQ or an IN.

So, after some trial and error and a lot of Googling, I learned that I
must include all rows from the PRIMARY KEY from left to right in my
query. Thus, this is the "best" I can get to work:


SELECT
*
FROM
"tagdata.central"
WHERE
"signalid" = 4002
AND "monthyear" = 201908
ORDER BY
"fromtime" DESC
LIMIT 10;


The "monthyear" column, I crafted like a fool by incrementing the date
one month after another until no results could be found anymore.
The "signalid" I grabbed from one of the unrestricted "SELECT * FROM" -
query results. But these can't be as easily guessed as the "monthyear"
values could.

This is where I'm stuck!

1. This does not really feel like the ideal way to go. I think there is
something more mature in modern IT systems. Can anyone tell me what is a
better way to get these informations?

2. I need a way to learn all values that are in the "monthyear" and
"signalid" columns in order to be able to craft that query.
How can I achieve that in a reasonable way? As I said: The DB is round
about 260 GB which makes it next to impossible to just "have a look" at
the output of "SELECT *"..

Thanks for your help!

Best regards,
Marc Richter


-
To unsubscribe, e-mail: 
user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: 
user-h...@cassandra.apache.org


Re: Impact of setting low value for flag -XX:MaxDirectMemorySize

2020-04-21 Thread Reid Pinchback
Note that from a performance standpoint, it’s hard to see a reason to care 
about releasing the memory unless you are co-tenanting C* with something else 
that’s significant in its memory demands, and significant on a schedule 
anti-correlated with when C* needs that memory.

If you aren’t doing that, then conceivably the only other time you’d care is if 
you are seeing read or write stalls on disk I/O because O/S buffer cache is too 
small.  But if you were getting a lot of impact from stalls, then it would mean 
C* was very busy… and if it’s very busy then it’s likely using it’s buffers as 
they are intended.

From: HImanshu Sharma 
Reply-To: "user@cassandra.apache.org" 
Date: Saturday, April 18, 2020 at 2:06 AM
To: "user@cassandra.apache.org" 
Subject: Re: Impact of setting low value for flag -XX:MaxDirectMemorySize

Message from External Sender
From the codebase as much I understood, if once a buffer is being allocated, 
then it is not freed and added to a recyclable pool. When a new request comes 
effort is made to fetch memory from recyclable pool and if is not available new 
allocation request is made. And while allocating a new request if memory limit 
is breached then we get this oom error.

I would like to know is my understanding correct
If what I am thinking is correct, is there way we can get this buffer pool 
reduced when there is low traffic because what I have observed in my system 
this memory remains static even if there is no traffic.

Regards
Manish

On Sat, Apr 18, 2020 at 11:13 AM Erick Ramirez 
mailto:erick.rami...@datastax.com>> wrote:
Like most things, it depends on (a) what you're allowing and (b) how much your 
nodes require. MaxDirectMemorySize is the upper-bound for off-heap memory used 
for the direct byte buffer. C* uses it for Netty so if your nodes are busy 
servicing requests, they'd have more IO threads consuming memory.

During low traffic periods, there's less memory allocated to service requests 
and they eventually get freed up by GC tasks. But if traffic volumes are high, 
memory doesn't get freed up quick enough so the max is reached. When this 
happens, you'll see OOMs like "OutOfMemoryError: Direct buffer memory" show up 
in the logs.

You can play around with different values but make sure you test it 
exhaustively before trying it out in production. Cheers!

GOT QUESTIONS? Apache Cassandra experts from the community and DataStax have 
answers! Share your expertise on 
https://community.datastax.com/.


Re: Cassandra node JVM hang during node repair a table with materialized view

2020-04-17 Thread Reid Pinchback
I would pay attention to the dirty background writer activity at the O/S level. 
 If you see that it isn’t keeping up with flushing changes to disk, then you’ll 
be in an even worse situation as you increase the JVM heap size, because that 
will be done at the cost of the size of available buffer cache.  When Linux 
can’t flush to disk, it can manifest as malloc failures (although if your C* is 
configured to have the JVM pre-touch all memory allocations, that shouldn’t 
happen… I don’t know if C* versions as old as yours do that, current ones 
definitely are configured that way).

If you get stuck, you may want to consider upgrading to something recent in the 
3.11 versions, 3.11.5 or newer.  A setting for controlling merkle-tree height 
was back-ported from the work on C* version 4, and that lets you tune some of 
the memory pressure on repairs, trading memory-related performance for 
network-related performance.  Networks are faster these days, it can be a 
reasonable tradeoff to consider. We used to periodically knock over C* nodes 
during repairs, until we incorporated a patch for that issue.

From: Ben G 
Reply-To: "user@cassandra.apache.org" 
Date: Thursday, April 16, 2020 at 3:32 AM
To: "user@cassandra.apache.org" 
Subject: Re: Cassandra node JVM hang during node repair a table with 
materialized view

Message from External Sender
Thanks a lot. We are working on removing views and control the partition size.  
I hope the improvements help us

Best regards

Gb

Erick Ramirez mailto:erick.rami...@datastax.com>> 
于2020年4月16日周四 下午2:08写道:
GC collector is G1.  I ever repair the node after scale up. The JVM issue 
reproduced.  Can I increase the heap to 40 GB on a 64GB VM?

I wouldn't recommend going beyond 31GB on G1. It will be diminishing returns as 
I mentioned before.

Do you think the issue is related to materialized view or big partition?

Yes, materialised views are problematic and I don't recommend them for 
production since they're still experimental. But if I were to guess, I'd say 
your problem is more an issue with large partitions and too many tombstones 
both putting pressure on the heap.

The thing is if you can't bootstrap because you're running into the 
TombstoneOverwhelmException (I'm guessing), I can't see how you wouldn't run 
into it with repairs. In any case, try running repairs on the smaller tables 
first and work on the remaining tables one-by-one. But bootstrapping a node 
with repairs is a very expensive exercise than just plain old bootstrap. I get 
that you're in a tough spot right now so good luck!


--

Thanks
Guo Bin


Re: Disabling Swap for Cassandra

2020-04-17 Thread Reid Pinchback
I think there is some potential yak shaving to worrying excessively about swap. 
The reality is that you should know the memory demands of what you are running 
on your C* nodes and have things configured so that significant swap would be a 
highly abnormal situation.  

I'd expect to see excessive churn on buffer cache long before I'd see excessive 
swap kicking in, but sometimes a little swap usage doesn't mean much beyond the 
O/S detecting that some memory allocation is so stale that it may as well push 
it out of the way.  This can happen for perfectly reasonable situations if, for 
example, you make heavy use of crond for automating system maintenance.  Also, 
if you are running on Dell boxes, Dell software updates can get a bit cranky 
and you see resource locking that has zilch to do with your application stack.

I'd worry less about how to crank down swap beyond the advice to make it a last 
resort, and more about how to monitor and alert on abnormal system behavior.  
When it's abnormal, you want a chance to see what is going on so you can fix 
it.  OOM'ing problems out of visibility makes it hard to investigate root 
causes.  I'd rather be paged while the cause is visible, than be paged anyways 
for the down node and have nothing to inspect.

R


On 4/17/20, 6:12 AM, "Alex Ott"  wrote:

 Message from External Sender

I usually recommend following document:

https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.datastax.com_en_dse_5.1_dse-2Ddev_datastax-5Fenterprise_config_configRecommendedSettings.html=DwIFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=TQPtBiV2cZow-OW1xEFTxqIlaA6VWPwM9PbMdScHIGw=_YR4k-l76UU-LxTd7WCtHAV6_LdRP2qzNiBAD1dAzdU=
 
- it's about DSE, but applicable to OSS Cassandra as well...

Kunal  at "Thu, 16 Apr 2020 15:49:35 -0700" wrote:
 K> Hello,

 K>  

 K> I need some suggestion from you all. I am new to Cassandra and was 
reading Cassandra best practices. On one document, it was
 K> mentioned that Cassandra should not be using swap, it degrades the 
performance.

 K> My question is instead of disabling swap system wide, can we force 
Cassandra not to use swap? Some documentation suggests to use
 K> memory_locking_policy in cassandra.yaml.

 K> How do I check if our Cassandra already has this parameter and still 
uses swap ? Is there any way i can check this. I already
 K> checked cassandra.yaml and dont see this parameter. Is there any other 
place i can check and confirm?

 K> Also, Can I set memlock parameter to unlimited (64kB default), so 
entire Heap (Xms = Xmx) can be locked at node startup ? Will that
 K> help?

 K> Or if you have any other suggestions, please let me know.

 K>  

 K>  

 K> Regards,

 K> Kunal

 K>  



-- 
With best wishes,Alex Ott
Principal Architect, DataStax

https://urldefense.proofpoint.com/v2/url?u=http-3A__datastax.com_=DwIFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=TQPtBiV2cZow-OW1xEFTxqIlaA6VWPwM9PbMdScHIGw=ddXQN2wa2-ikDaE4LFM7Z-g-V369ObwXmt6_BeWRXPU=
 

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org





Re: How quickly off heap memory freed by compacted tables is reclaimed

2020-04-16 Thread Reid Pinchback
If I understand the logic of things like SlabAllocator properly, this is 
essentially buffer space that has been allocated for the purpose and C* pulls 
off ByteBuffer hunks of it as needed.  The notion of reclaiming by the kernel 
wouldn’t apply, C* would be managing the use of the space itself.

Whether GC cycles matter at all isn’t obvious at a quick glance.  C* makes use 
of weak and phantom references so it’s possible that there is a code path where 
release of a ByteBuffer would wait upon a GC, but I can’t say for sure.

From: HImanshu Sharma 
Reply-To: "user@cassandra.apache.org" 
Date: Wednesday, April 15, 2020 at 10:34 PM
To: "user@cassandra.apache.org" 
Subject: How quickly off heap memory freed by compacted tables is reclaimed

Message from External Sender
Hi

As we know data structures like bloom filters, compression metadata, index 
summary are kept off heap. But once a table gets compacted, how quickly that 
memory is reclaimed by kernel.
Is it instant or it depends when reference if GCed?

Regards
Himanshu


Re: OOM only on one datacenter nodes

2020-04-06 Thread Reid Pinchback
Centos 6.10 is a bit aged as a production server O/S platform, and I recall 
some odd-ball interactions with hardware variations, particularly around 
high-priority memory and network cards.  How good is your O/S level metric 
monitoring?  Not beyond the realm of possibility that your memory issues are 
outside of the JVM.  It isn’t easy to tell you what to specifically look for, 
but I would begin with metrics around memory and swap.  If you don’t see high 
consistent memory use outside of the JVM usage, saves wasting time chasing down 
details that are unlikely to matter.  You need to be used to seeing what those 
metrics are normally like though, so you aren’t chasing phantoms.

I second Jeff’s feedback.  You need the information you need.  It seems 
counterproductive to not configure these nodes to do what you need.  A 
fundamental value of C* is the ability to bring nodes up and down without 
risking availability.  When your existing technology approach is part of why 
you can’t gather the data you need, it helps to give yourself permission to 
improve what you have so you don’t remain in that situation.


From: Surbhi Gupta 
Date: Monday, April 6, 2020 at 12:44 AM
To: "user@cassandra.apache.org" 
Cc: Reid Pinchback 
Subject: Re: OOM only on one datacenter nodes

Message from External Sender
We are using JRE and not JDK , hence not able to take heap dump .

On Sun, 5 Apr 2020 at 19:21, Jeff Jirsa 
mailto:jji...@gmail.com>> wrote:

Set the jvm flags to heap dump on oom

Open up the result in a heap inspector of your preference (like yourkit or 
similar)

Find a view that counts objects by total retained size. Take a screenshot. Send 
that.




On Apr 5, 2020, at 6:51 PM, Surbhi Gupta 
mailto:surbhi.gupt...@gmail.com>> wrote:
I just checked, we have setup the Heapsize to be 31GB not 32GB in DC2.

I checked the CPU and RAM both are same on all the nodes in DC1 and DC2.
What specific parameter I should check on OS ?
We are using CentOS release 6.10.

Currently disk_access_modeis not set hence it is auto in our env. Should 
setting disk_access_mode  to mmap_index_only  will help ?

Thanks
Surbhi

On Sun, 5 Apr 2020 at 01:31, Alex Ott 
mailto:alex...@gmail.com>> wrote:
Have you set -Xmx32g ? In this case you may get significantly less
available memory because of switch to 64-bit references.  See
http://java-performance.info/over-32g-heap-java/<https://urldefense.proofpoint.com/v2/url?u=http-3A__java-2Dperformance.info_over-2D32g-2Dheap-2Djava_=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=J9_U9AIV95JbVJ-0c1OyjqGOmdLCltCRwMPnOsS7BCE=rB9HFbb7t-FJQZUGJNtN0wOPIGZj7Fn8cE271bR63HE=>
 for details, and set
slightly less than 32Gb

Reid Pinchback  at "Sun, 5 Apr 2020 00:50:43 +" wrote:
 RP> Surbi:

 RP> If you aren’t seeing connection activity in DC2, I’d check to see if the 
operations hitting DC1 are quorum ops instead of local quorum.  That
 RP> still wouldn’t explain DC2 nodes going down, but would at least explain 
them doing more work than might be on your radar right now.

 RP> The hint replay being slow to me sounds like you could be fighting GC.

 RP> You mentioned bumping the DC2 nodes to 32gb.  You might have already been 
doing this, but if not, be sure to be under 32gb, like 31gb.
 RP> Otherwise you’re using larger object pointers and could actually have less 
effective ability to allocate memory.

 RP> As the problem is only happening in DC2, then there has to be a thing that 
is true in DC2 that isn’t true in DC1.  A difference in hardware, a
 RP> difference in O/S version, a difference in networking config or physical 
infrastructure, a difference in client-triggered activity, or a
 RP> difference in how repairs are handled. Somewhere, there is a difference.  
I’d start with focusing on that.

 RP> From: Erick Ramirez 
mailto:erick.rami...@datastax.com>>
 RP> Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
 RP> Date: Saturday, April 4, 2020 at 8:28 PM
 RP> To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
 RP> Subject: Re: OOM only on one datacenter nodes

 RP> Message from External Sender

 RP> With a lack of heapdump for you to analyse, my hypothesis is that your DC2 
nodes are taking on traffic (from some client somewhere) but you're
 RP> just not aware of it. The hints replay is just a side-effect of the nodes 
getting overloaded.

 RP> To rule out my hypothesis in the first instance, my recommendation is to 
monitor the incoming connections to the nodes in DC2. If you don't
 RP> have monitoring in place, you could simply run netstat at regular 
intervals and go from there. Cheers!

 RP> GOT QUESTIONS? Apache Cassandra experts from the community and DataStax 
have answers

Re: OOM only on one datacenter nodes

2020-04-04 Thread Reid Pinchback
Surbi:

If you aren’t seeing connection activity in DC2, I’d check to see if the 
operations hitting DC1 are quorum ops instead of local quorum.  That still 
wouldn’t explain DC2 nodes going down, but would at least explain them doing 
more work than might be on your radar right now.

The hint replay being slow to me sounds like you could be fighting GC.

You mentioned bumping the DC2 nodes to 32gb.  You might have already been doing 
this, but if not, be sure to be under 32gb, like 31gb.  Otherwise you’re using 
larger object pointers and could actually have less effective ability to 
allocate memory.

As the problem is only happening in DC2, then there has to be a thing that is 
true in DC2 that isn’t true in DC1.  A difference in hardware, a difference in 
O/S version, a difference in networking config or physical infrastructure, a 
difference in client-triggered activity, or a difference in how repairs are 
handled. Somewhere, there is a difference.  I’d start with focusing on that.


From: Erick Ramirez 
Reply-To: "user@cassandra.apache.org" 
Date: Saturday, April 4, 2020 at 8:28 PM
To: "user@cassandra.apache.org" 
Subject: Re: OOM only on one datacenter nodes

Message from External Sender
With a lack of heapdump for you to analyse, my hypothesis is that your DC2 
nodes are taking on traffic (from some client somewhere) but you're just not 
aware of it. The hints replay is just a side-effect of the nodes getting 
overloaded.

To rule out my hypothesis in the first instance, my recommendation is to 
monitor the incoming connections to the nodes in DC2. If you don't have 
monitoring in place, you could simply run netstat at regular intervals and go 
from there. Cheers!

GOT QUESTIONS? Apache Cassandra experts from the community and DataStax have 
answers! Share your expertise on 
https://community.datastax.com/.



Re: Minimum System Requirements

2020-03-30 Thread Reid Pinchback
I’ll add a few cautionary notes:

  *   JVM object overhead has memory allocation efficiency issues possible with 
heap >= 32gig, but yes to the added memory for off-heap storage and O/S buffer 
cache.
  *   C* creates a lot of threads, but the number active can sometimes be 
rather small. Depending on your usage pattern you can find a lot of cores idle; 
performance testing for your specifics is important so you don’t blow out a 
budget in the wrong direction. For example, more smaller boxes can work better 
than fewer larger boxes, in some cases.
  *   ‘stable’ has several possible interpretations in C*; does it keep 
running, does it have good throughput, does it have consistent low latency. 
Hardware selection is a variable in a non-linear function also shaped by your 
storage, performance, and fault tolerance expectations.
From: Erick Ramirez 
Reply-To: "user@cassandra.apache.org" 
Date: Saturday, March 28, 2020 at 12:03 AM
To: "user@cassandra.apache.org" 
Subject: Re: Minimum System Requirements

Message from External Sender
It really depends on your definition of "stable" but you can run C* on as 
little as a single-core machine with 4-6GB of RAM. It will be stable enough to 
do 1 or 2 queries per second (or maybe a bit more) if you were just doing some 
development and wanted to do minimal testing. There are even examples out there 
where users have clusters deployed on Raspberry Pis as a fun project.

But if you're talking about production deployments, the general recommendation 
is to run it on at least 4 core machines + 16-20GB RAM if you're just testing 
it out and only expect low traffic volumes. Otherwise, it is recommended to 
deploy it on 8-core machines with at least 32-40GB RAM so you have enough 
memory to allocate to the heap. Cheers!

GOT QUESTIONS? Apache Cassandra experts from the community and DataStax have 
answers! Share your expertise on 
https://community.datastax.com/.


Re: Performance of Data Types used for Primary keys

2020-03-06 Thread Reid Pinchback
If you care about low-latency reads, I’d worry less about columnar data types, 
and more about the general quality of the data modeling and usage patterns, and 
tuning the things that you see cause latency spikes.  There isn’t just a single 
cause to latency spikes, so expect to spend a couple of months playing 
whack-a-mole as you identify root causes.

What you’re likely going to see most impacting latency variance are GC and I/O 
artifacts.  That’s a quick thing to say, but isolating what specifically to do, 
that’s where the hard work comes in.  Overly-simplistic guesses on what to do, 
I haven’t seen pan out very well. A lot of the tuning knobs in C* can start to 
feel like a kid’s teeter-totter, because making one dynamic better is sometimes 
at the expense of making something else be worse. Quality metric gathering and 
heap examinations will be your friend, and expect to do bursts of per-second 
and sometimes sub-second metric examinations.  I/O in particular, you often 
won’t realize what is going on without a high enough metric frequency to see 
when and how I/O ops are suddenly getting queued up.

Throughput in C* is easier to tune for than latency, and writes are easier to 
have fast than the reads because of how C* is designed.  Latency on reads, 
you’re in your worst-case tuning scenario. particularly if you’re looking for 
tight latency at 3 9’s.

Don’t forget to see how your numbers stack up during repairs.  That includes 
both nodetool or reaper-managed repairs, but per my comment on usage patterns, 
if you have antipatterns like write-then-read-back going on, under the hood 
you’ll be triggering the equivalent of localized repairs.  All of that adds to 
GC pressure, and hence to latency variance.
From: "Hanauer, Arnulf, Vodacom South Africa (External)" 

Reply-To: "user@cassandra.apache.org" 
Date: Friday, March 6, 2020 at 5:15 AM
To: "user@cassandra.apache.org" 
Subject: Performance of Data Types used for Primary keys

Message from External Sender
Hi Cassandra folks,

Is there any difference in performance of general operations if using a TEXT 
based Primary key versus a BIGINT Primary key.

Our use-case requires low latency reads but currently the Primary key is TEXT 
based but the data could work on BIGINT. We are trying to optimise where 
possible.
Any experiences that could point to a winner?


Kind regards
Arnulf Hanauer











"This e-mail is sent on the Terms and Conditions that can be accessed by 
Clicking on this link 
https://webmail.vodacom.co.za/tc/default.html
 "


Re: Hints replays very slow in one DC

2020-02-27 Thread Reid Pinchback
Our experience with G1GC was that 31gb wasn’t optimal (for us) because while 
you have less frequent full GCs they are bigger when they do happen.  But even 
so, not to the point of a 9.5s full collection.

Unless it is a rare event associated with something weird happening outside of 
the JVM (there are some whacky interactions between memory and dirty page 
writing that could cause it, but not typically), then that is evidence of a 
really tough fight to reclaim memory.  There are a lot of things that can 
impact garbage collection performance.  Something is either being pushed very 
hard, or something is being constrained very tightly compared to resource 
demand.

I’m with Erick, I wouldn’t be putting my attention right now on anything but 
the GC issue. Everything else that happens within the JVM envelope is going to 
be a misread on timing until you have stable garbage collection.  You might 
have other issues later, but you aren’t going to know what those are yet.

One thing you could at least try to eliminate quickly as a factor.  Are repairs 
running at the time that things are slow?  In prior to 3.11.5 you lack one of 
the tuning knobs for doing a tradeoff on memory vs network bandwidth when doing 
repairs.

I’d also make sure you have tuned C* to migrate whatever you reasonably can to 
be off-heap.

Another thought for surprise demands on memory.  I don’t know if this is in 
3.11.0, you’ll have to check the C* bash scripts for launching the service.  
The number of malloc arenas haven’t always been curtailed, and that could 
result in an explosion in memory demand.  I just don’t recall where in C* 
version history that was addressed.


From: Erick Ramirez 
Reply-To: "user@cassandra.apache.org" 
Date: Wednesday, February 26, 2020 at 9:55 PM
To: "user@cassandra.apache.org" 
Subject: Re: Hints replays very slow in one DC

Message from External Sender
Nodes are going down due to Out of Memory and we are using 31GB heap size in 
DC1 , however in DC2 (Which serves the traffic) has 16GB heap .
Why we had to increase heap in DC1 is because , DC1 nodes were going down due 
Out of Memory issue but DC2 nodes never went down .

It doesn't sound right that the primary DC is DC2 but DC1 is under load. You 
might not be aware of it but the symptom suggests DC1 is getting hit with lots 
of traffic. If you run netstat (or whatever utility/tool of your choice), you 
should see established connections to the cluster. That should give you clues 
as to where it's coming from.

We also noticed below kind of messages in system.log
FailureDetector.java:288 - Not marking nodes down due to local pause of 
9532654114 > 50

That's another smoking gun that the nodes are buried in GC. A 9.5-second pause 
is significant. The slow hinted handoffs is really the least of your problem 
right now. If nodes weren't going down, there wouldn't be hints to handoff in 
the first place. Cheers!

GOT QUESTIONS? Apache Cassandra experts from the community and DataStax have 
answers! Share your expertise on 
https://community.datastax.com/.


Re: Mechanism to Bulk Export from Cassandra on daily Basis

2020-02-19 Thread Reid Pinchback
To the question of ‘best approach’, so far the comments have been about 
alternatives in tools.

Another axis you might want to consider is from the data model viewpoint.  So, 
for example, let’s say you have 600M rows.  You want to do a daily transfer of 
data for some reason.  First question that comes to mind is, do you need all 
the data every day?  Usually that would only be the case if all of the data is 
at risk of changing.

Generally the way I’d cut down the pain on something like this is to figure out 
if the data model currently does, or could be made to, only mutate in a limited 
subset.  Then maybe all you are transferring are the daily changes.  Systems 
based on catching up to daily changes will usually be pulling single-digit 
percentages of data volume compared to the entire storage footprint.  That’s 
not only a lot less data to pull, it’s also a lot less impact on the ongoing 
operations of the cluster while you are pulling that data.

R

From: "JOHN, BIBIN" 
Reply-To: "user@cassandra.apache.org" 
Date: Wednesday, February 19, 2020 at 1:13 PM
To: "user@cassandra.apache.org" 
Subject: Mechanism to Bulk Export from Cassandra on daily Basis

Message from External Sender
Team,
We have a requirement to bulk export data from Cassandra on daily basis? Table 
contain close to 600M records and cluster is having 12 nodes. What is the best 
approach to do this?


Thanks
Bibin John


Re: AWS I3.XLARGE retiring instances advices

2020-02-16 Thread Reid Pinchback
No actually in this case I didn’t really have an opinion because C* is an 
architecturally different beast than an RDBMS.  That’s kinda what ticked the 
curiosity when you made the suggestion about co-locating commit and data.  It 
raises an interesting question for me.  As for the 10 seconds delay, I’m used 
to looking at graphite, so bad is relative. 

The question that pops to mind is this. If a commit log isn’t really an 
important recovery mechanism…. should one even be part of C* at all?  It’s a 
lot of code complexity and I/O volume and O/S tuning complexity to worry about 
having good I/O resiliency and performance with both commit and data volumes.

If the proper way to deal with all data volume problems in C* would be to burn 
the node (or at least, it’s state) and rebuild via the state of its neighbours, 
then repairs (whether administratively triggered, or as a side-effect of 
ongoing operations) should always catch up with any mutations anyways so long 
as the data is appropriately replicated.  The benefit to the having a commit 
log would seem limited to data which isn’t replicated.

However, I shouldn’t derail Sergio’s thread.  It just was something that caught 
my interest and got me mulling, but it’s a tangent.

From: Erick Ramirez 
Reply-To: "user@cassandra.apache.org" 
Date: Friday, February 14, 2020 at 9:04 PM
To: "user@cassandra.apache.org" 
Subject: Re: AWS I3.XLARGE retiring instances advices

Message from External Sender
Erick, a question purely as a point of curiosity.  The entire model of a commit 
log, historically (speaking in RDBS terms), depended on a notion of stable 
store. The idea being that if your data volume lost recent writes, the failure 
mode there would be independent of writes to the volume holding the commit log, 
so that replay of the commit log could generally be depended on to recover the 
missing data.  I’d be curious what the C* expert viewpoint on that would be, 
with the commit log and data on the same volume.

Those are fair points so thanks for bringing them up. I'll comment from a 
personal viewpoint and others can provide their opinions/feedback.

If you think about it, you've lost the data volume -- not just the recent 
writes. Replaying the mutations in the commit log is probably insignificant 
compared to having to recover the data through various ways (re-bootstrap, 
refresh from off-volume/off-server snapshots, etc). The data and redo/archive 
logs being on the same volume (in my opinion) is more relevant in RDBMS since 
they're mostly deployed on SANs compared to the nothing-shared architecture of 
C*. I know that's debatable and others will have their own view. :)

How about you, Reid? Do you have concerns about both data and commitlog being 
on the same disk? And slightly off-topic but by extension, do you also have 
concerns about the default commitlog fsync() being 10 seconds? Cheers!


Re: AWS I3.XLARGE retiring instances advices

2020-02-14 Thread Reid Pinchback
I was curious and did some digging.  400k is the max read IOPs on the 1-device 
instance types, 3M IOPS is for the 8-device instance types.

From: Reid Pinchback 
Reply-To: "user@cassandra.apache.org" 
Date: Friday, February 14, 2020 at 11:24 AM
To: "user@cassandra.apache.org" 
Subject: Re: AWS I3.XLARGE retiring instances advices

Message from External Sender
I’ve seen claims of 3M IOPS on reads for AWS, not sure about writes.  I think 
you just need a recent enough kernel to not get in the way of doing multiqueue 
operations against the NVMe device.

Erick, a question purely as a point of curiosity.  The entire model of a commit 
log, historically (speaking in RDBS terms), depended on a notion of stable 
store.  The idea being that if your data volume lost recent writes, the failure 
mode there would be independent of writes to the volume holding the commit log, 
so that replay of the commit log could generally be depended on to recover the 
missing data.  I’d be curious what the C* expert viewpoint on that would be, 
with the commit log and data on the same volume.

From: Sergio 
Reply-To: "user@cassandra.apache.org" 
Date: Thursday, February 13, 2020 at 10:46 PM
To: "user@cassandra.apache.org" 
Subject: Re: AWS I3.XLARGE retiring instances advices

A little off-topic but personally I would co-locate the commitlog on the same 
950GB NVMe SSD as the data files. You would get a much better write performance 
from the nodes compared to EBS and they shouldn't hurt your reads since the 
NVMe disks have very high IOPS. I think they can sustain 400K+ IOPS (don't 
quote me). I'm sure others will comment if they have a different experience. 
And of course, YMMV. Cheers!


Re: AWS I3.XLARGE retiring instances advices

2020-02-14 Thread Reid Pinchback
I’ve seen claims of 3M IOPS on reads for AWS, not sure about writes.  I think 
you just need a recent enough kernel to not get in the way of doing multiqueue 
operations against the NVMe device.

Erick, a question purely as a point of curiosity.  The entire model of a commit 
log, historically (speaking in RDBS terms), depended on a notion of stable 
store.  The idea being that if your data volume lost recent writes, the failure 
mode there would be independent of writes to the volume holding the commit log, 
so that replay of the commit log could generally be depended on to recover the 
missing data.  I’d be curious what the C* expert viewpoint on that would be, 
with the commit log and data on the same volume.

From: Sergio 
Reply-To: "user@cassandra.apache.org" 
Date: Thursday, February 13, 2020 at 10:46 PM
To: "user@cassandra.apache.org" 
Subject: Re: AWS I3.XLARGE retiring instances advices

A little off-topic but personally I would co-locate the commitlog on the same 
950GB NVMe SSD as the data files. You would get a much better write performance 
from the nodes compared to EBS and they shouldn't hurt your reads since the 
NVMe disks have very high IOPS. I think they can sustain 400K+ IOPS (don't 
quote me). I'm sure others will comment if they have a different experience. 
And of course, YMMV. Cheers!


Re: Connection reset by peer

2020-02-13 Thread Reid Pinchback
Since ping is ICMP, not TCP, you probably want to investigate a mix of TCP and 
CPU stats to see what is behind the slow pings. I’d guess you are getting 
network impacts beyond what the ping times are hinting at.  ICMP isn’t subject 
to retransmission, so your TCP situation could be far worse than ping latencies 
may suggest.

From: "Hanauer, Arnulf, Vodacom South Africa (External)" 

Reply-To: "user@cassandra.apache.org" 
Date: Thursday, February 13, 2020 at 2:06 AM
To: "user@cassandra.apache.org" 
Subject: RE: Connection reset by peer

Message from External Sender

Thanks to both Erik/Shaun for your responses,

Both your explanations are plausible in my scenario, this is what I have done 
subsequently which seems to have improved the situation,



  1.  The cluster was very busy trying to run repairs/sync the new replicas 
(about 350GB)  in the new DC (Gossip was temporarily marking down the source 
nodes at different points in time)

  *   Disabled Reaper, stopped all validation/repairs



  1.  I removed the new replica’s to stop any potential read_repair across the 
WAN

  *   I will recreate the replica’s over the weekend during quiet time & run 
the repair to sync



  1.  The network ping response time was quite high around 10-15msec at error 
times

  *   This dropped to under 1ms later in the day when some jobs were rerun 
successfully



  1.  I will apply some of the recommended TCP_KEEPALIVE settings Shaun pointed 
me to



Last question: In all your experiences, how high can the latency (simple ping 
response times go) before it becomes a problem? (Obviously the lower the better 
but is there some sort of cut off/formula where problems can be expected 
intermittently like the connection resets)




Kind regards

Arnulf Hanauer



From: Erick Ramirez 
Sent: Thursday, 13 February 2020 03:10
To: user@cassandra.apache.org
Subject: Re: Connection reset by peer

I generally see these exceptions when the cluster is overloaded. I think what's 
happening is that when the app/driver sends a read request, the coordinator 
takes a long time to respond because the nodes are busy serving other requests. 
The driver gives up (client-side timeout reached) and the socket is closed. 
Meanwhile, the coordinator eventually gets results from replicas and tries to 
send the response back to the app/driver but can't because the connection is no 
longer there. Does this scenario sound plausible for your cluster?


Erick Ramirez  |  Developer Relations

erick.rami...@datastax.com | 
datastax.com
[Image removed by 
sender.][Image
 removed by 
sender.][Image
 removed by 
sender.][Image
 removed by 
sender.][Image
 removed by 
sender.]

[Image removed by 
sender.]


On Wed, 12 Feb 2020 at 21:13, Hanauer, Arnulf, Vodacom South Africa (External) 
mailto:arnulf.hana...@vcontractor.co.za>> 
wrote:
Hi Cassandra folks,

We are getting a lot of these errors and transactions are timing out and I was 
wondering if this can be caused by Cassandra itself or if this is a genuine 
Linux network issue only. The client job reports Cassandra node down after this 
occurs but I suspect this is 

Re: [EXTERNAL] Cassandra 3.11.X upgrades

2020-02-12 Thread Reid Pinchback
Hi Sergio,

We have a production cluster with vnodes=4 that is a bit larger than that, so 
yes it is possible to do so.  That said, we aren’t wedded to vnodes=4 and are 
paying attention to discussions happening around the 4.0 work and mulling the 
possibility of shifting to 16.

Note though, we didn’t just pick 4 based on blind faith.  There was work done 
to evaluate the state of our token distribution before and after.  It’s pretty 
much true of all the settings we have for C* (or anything else for that 
matter).  We started from a lot of helpful articles and docs and mail threads 
and JIRA issues, but then we evaluated to see what we thought of the results.  
It was the only way for us to start building up some understanding of the 
details.

From: Sergio 
Reply-To: "user@cassandra.apache.org" 
Date: Wednesday, February 12, 2020 at 2:50 PM
To: "user@cassandra.apache.org" 
Subject: Re: [EXTERNAL] Cassandra 3.11.X upgrades

Message from External Sender
Thanks, everyone! @Jon 
https://lists.apache.org/thread.html/rd18814bfba487824ca95a58191f4dcdb86f15c9bb66cf2bcc29ddf0b%40%3Cuser.cassandra.apache.org%3E
I have a side response to something that looks to be controversial with the 
response from Anthony.
So is it safe to go to production in a 1TB cluster with vnodes = 4?
Do we need to follow these steps 
https://thelastpickle.com/blog/2019/02/21/set-up-a-cluster-with-even-token-distribution.html?
 because from Anthony response what I got that this is just an example and 
vnodes = 4 it is not ready for production.
https://lists.apache.org/thread.html/r21cd99fa269076d186a82a8b466eb925681373302dd7aa6bb26e5bde%40%3Cuser.cassandra.apache.org%3E

Best,

Sergio


Il giorno mer 12 feb 2020 alle ore 11:42 Durity, Sean R 
mailto:sean_r_dur...@homedepot.com>> ha scritto:
>>A while ago, on my first cluster

Understatement used so effectively. Jon is a master.



On Wed, Feb 12, 2020 at 11:02 AM Sergio 
mailto:lapostadiser...@gmail.com>> wrote:
Thanks for your reply!

So unless the sstable format has not been changed I can avoid to do that.

Correct?

Best,

Sergio

On Wed, Feb 12, 2020, 10:58 AM Durity, Sean R 
mailto:sean_r_dur...@homedepot.com>> wrote:
Check the readme.txt for any upgrade notes, but the basic procedure is to:

  *   Verify that nodetool upgradesstables has completed successfully on all 
nodes from any previous upgrade
  *   Turn off repairs and any other streaming operations (add/remove nodes)
  *   Stop an un-upgraded node (seeds first, preferably)
  *   Install new binaries and configs on the down node
  *   Restart that node and make sure it comes up clean (it will function 
normally in the cluster – even with mixed versions)
  *   Repeat for all nodes
  *   Run upgradesstables on each node (as many at a time as your load will 
allow). Minor upgrades usually don’t require this step (only if the sstable 
format has changed), but it is good to check.
  *   NOTE: in most cases applications can keep running and will not notice 
much impact – unless the cluster is overloaded and a single node down causes 
impact.



Sean Durity – Staff Systems Engineer, Cassandra

From: Sergio mailto:lapostadiser...@gmail.com>>
Sent: Wednesday, February 12, 2020 11:36 AM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Cassandra 3.11.X upgrades

Hi guys!

How do you usually upgrade your cluster for minor version upgrades?

I tried to add a node with 3.11.5 version to a test cluster with 3.11.4 nodes.

Is there any restriction?

Best,

Sergio



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When 

Re: Overload because of hint pressure + MVs

2020-02-11 Thread Reid Pinchback
A caveat to the 31GB recommendation for G1GC.  If you have tight latency SLAs 
instead of throughput SLAs then this doesn’t necessary pan out to be beneficial.

Yes the GCs are less frequent, but they can hurt more when they do happen. The 
win is if your usage pattern is such that the added time helps you squeak past 
deciding copying into old gen when a smaller heap/more frequent GC cycle would 
have decided it had to do promotions.  C* tends to have a lot of 
medium-lifetime objects on the heap so it can really come down to the specifics 
of what your clients are typically doing.

Also, reallocation of RAM from O/S buffer cache to Java heap will also change 
the dynamics of dirty page flushes from your writes, which again directly 
surfaces in C* read latency numbers during I/O stalls from the write spikes in 
the background.  So really bumping up heap is an alteration that can be a 
double-whammy for the latency sensitive.  Those only caring about throughput 
won’t care and it’s probably unconditionally a win to go to 31GB.

R

From: Erick Ramirez 
Reply-To: "user@cassandra.apache.org" 
Date: Monday, February 10, 2020 at 3:55 PM
To: "user@cassandra.apache.org" 
Subject: Re: Overload because of hint pressure + MVs

Message from External Sender
Currently the value of phi_convict_threshold is not set which makes it to 8 
(default) .
Can this also cause hints buildup even when we can see that all nodes are UP ?

You can bump it up to 12 to reduce the sensitivity but it's likely GC pauses 
causing it. Phi convict is the side-effect, not the cause.

Just to add , we are using 24GB heap size.

Are you using CMS? If using G1, I'd recommend bumping it up to 31GB if the 
servers have 40+ GB of RAM. Cheers!


Re: sstableloader: How much does it actually need?

2020-02-07 Thread Reid Pinchback
Just mulling this based on some code and log digging I was doing while trying 
to have Reaper stay on top of our cluster.

I think maybe the caveat here relates to eventual consistency.  C* doesn’t do 
state changes as distributed transactions.  The assumption here is that RF=3 is 
implying that at any given instant in real time, either the data is visible 
nowhere, or it is visible in 3 places.  That’s a conceptual simplification but 
not a real time invariant when you don’t have a transactional horizon to 
perfectly determine visibility of data.

When you have C* usage antipatterns like a client that is determined to read 
back data that it just wrote, as though there was a session context that 
somehow provided repeatable read guarantees, under the covers in the logs you 
can see C* fighting to do on-the-fly repairs to push through the requested 
level of consistency before responding to the query.  Which means, for some 
period of time, that achieving consistency was still work in flight.

I’ve also read about some boundary screw cases like drift in time resolution 
between servers creating the opportunity for stale data, and repairs I think 
would fix that. I haven’t tested the scenario though, so I’m not sure how real 
the situation is.

Bottom line though, minus repairs, I think having all the nodes is getting you 
all your chances to repair the problems.  And if the data is mutating as you 
are grabbing it, the entire frontier of changes is ‘minus repairs’.  Since 
tokens are distributed somewhat randomly, you don’t know where you need to make 
up the differences after.

That’s about as far as my navel gazing goes on that.

From: manish khandelwal 
Reply-To: "user@cassandra.apache.org" 
Date: Friday, February 7, 2020 at 12:22 AM
To: "user@cassandra.apache.org" 
Subject: Re: sstableloader: How much does it actually need?

Message from External Sender
Yes you will have all the data in two nodes provided there is no mutation drop 
at node level or data is repaired

For example if you data A,B,C and D. with RF=3 and 4 nodes (node1, node2, node3 
and node4)

Data A is in node1, node2 and node3
Data B is in node2, node3, and node4
Data C is in node3, node4 and node1
Data D is in node4, node1 and node2

With this configuration, any two nodes combined will give all the data.


Regards
Manish

On Fri, Feb 7, 2020 at 12:53 AM Voytek Jarnot 
mailto:voytek.jar...@gmail.com>> wrote:
Been thinking about it, and I can't really see how with 4 nodes and RF=3, any 2 
nodes would *not* have all the data; but am more than willing to learn.

On the other thing: that's an attractive option, but in our case, the target 
cluster will likely come into use before the source-cluster data is available 
to load. Seemed to me the safest approach was sstableloader.

Thanks

On Wed, Feb 5, 2020 at 6:56 PM Erick Ramirez 
mailto:flightc...@gmail.com>> wrote:
Unfortunately, there isn't a guarantee that 2 nodes alone will have the full 
copy of data. I'd rather not say "it depends". 

TIP: If the nodes in the target cluster have identical tokens allocated, you 
can just do a straight copy of the sstables node-for-node then do nodetool 
refresh. If the target cluster is already built and you can't assign the same 
tokens then sstableloader is your only option. Cheers!

P.S. No need to apologise for asking questions. That's what we're all here for. 
Just keep them coming. 


Re: Query timeouts after Cassandra Migration

2020-02-07 Thread Reid Pinchback
Ankit, are the instance types identical in the new cluster, with I/O 
configuration identical at the system level, and are the Java settings for C* 
identical between the two clusters?  With radical timing differences happening 
periodically, the two things I’d have on my radar would be garbage collections 
and problems in flushing dirty pages.  Even if neither of those are the issue, 
one way or another, timeouts make me hunt for the resource everybody is queued 
up on.

From: Erick Ramirez 
Reply-To: "user@cassandra.apache.org" 
Date: Thursday, February 6, 2020 at 10:08 PM
To: "user@cassandra.apache.org" 
Subject: Re: Query timeouts after Cassandra Migration

Message from External Sender
So do you advise copying tokens in such cases ? What procedure is advisable ?

Specifically for your case with 3 nodes + RF=3, it won't make a difference so 
leave it as it is.

Latency increased on target cluster.

Have you tried to run a trace of the queries which are slow? It will help you 
identify where the slowness is coming from. Cheers!


Re: [EXTERNAL] Re: Running select against cassandra

2020-02-06 Thread Reid Pinchback
I defer to Sean’s comment on materialized views.  I’m more familiar with 
DynamoDB on that front, where you do this pretty routinely.  I was curious so I 
went looking. This appears to be the C* Jira that points to many of the problem 
points:

https://issues.apache.org/jira/browse/CASSANDRA-13826

Abdul, you’d probably want to refer to that or similar info.  Could be that the 
more practical resolution is to just have the client write the data twice, if 
there are two very different query patterns to support.  Writes usually have 
quite low latency in C*, so double-writing may be less of a performance hit, 
and later drag on memory on I/O, than a query model that makes you browse 
through more data than necessary.

From: "Durity, Sean R" 
Reply-To: "user@cassandra.apache.org" 
Date: Thursday, February 6, 2020 at 4:24 PM
To: "user@cassandra.apache.org" 
Subject: RE: [EXTERNAL] Re: Running select against cassandra

Message from External Sender
Reid is right. You build the tables to easily answer the queries you want. So, 
start with the query! I inferred a query for you based on what you mentioned. 
If my inference is wrong, the table structure is likely wrong, too.

So, what kind of query do you want to run?

(NOTE: a select count(*) that is not restricted to within a single partition is 
a very bad option. Don’t do that)

The query for my table below is simply:
select user_count [, other columns] from users_by_day where date = ? and hour = 
? and minute = ?


Sean Durity

From: Reid Pinchback 
Sent: Thursday, February 6, 2020 4:10 PM
To: user@cassandra.apache.org
Subject: Re: [EXTERNAL] Re: Running select against cassandra

Abdul,

When in doubt, have a query model that immediately feeds you exactly what you 
are looking for. That’s kind of the data model philosophy that you want to 
shoot for as much as feasible with C*.

The point of Sean’s table isn’t the similarity to yours, it is how he has it 
keyed because it suits a partition structure much better aligned with what you 
want to request.  So I’d say yes, if a materialized view is how you want to 
achieve a denormalized state where the query model directly supports giving you 
want you want to query for, that sounds like an appropriate option to consider. 
 You might want a composite partition key for having an efficient selection of 
narrow time ranges.

From: Abdul Patel mailto:abd786...@gmail.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Date: Thursday, February 6, 2020 at 2:42 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Subject: Re: [EXTERNAL] Re: Running select against cassandra

Message from External Sender
this is the schema similar to what we have , they want to get user connected  - 
concurrent count for every say 1-5 minutes.
i am thinking will simple select will have performance issue or we can go for 
materialized views ?

CREATE TABLE  usr_session (
userid bigint,
session_usr text,
last_access_time timestamp,
login_time timestamp,
status int,
PRIMARY KEY (userid, session_usr)
) WITH CLUSTERING ORDER BY (session_usr ASC)


On Thu, Feb 6, 2020 at 2:09 PM Durity, Sean R 
mailto:sean_r_dur...@homedepot.com>> wrote:
Do you only need the current count or do you want to keep the historical counts 
also? By active users, does that mean some kind of user that the application 
tracks (as opposed to the Cassandra user connected to the cluster)?

I would consider a table like this for tracking active users through time:

Create table users_by_day (
app_date date,
hour integer,
minute integer,
user_count integer,
longest_login_user text,
longest_login_seconds integer,
last_login datetime,
last_login_user text )
primary key (app_date, hour, minute);

Then, your reporting can easily select full days or a specific, one-minute 
slice. Of course, the app would need to have a timer and write out the data. I 
would also suggest a TTL on the data so that you only keep what you need (a 
week, a year, whatever). Of course, if your reporting requires different 
granularities, you could consider a different time bucket for the table (by 
hour, by week, etc.)


Sean Durity – Staff Systems Engineer, Cassandra

From: Abdul Patel mailto:abd786...@gmail.com>>
Sent: Thursday, February 6, 2020 1:54 PM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: [EXTERNAL] Re: Running select against cassandra

Its sort of user connected, app team needa number of active users connected say 
 every 1 to 5 mins.
The timeout at app end is 120ms.



On Thursday, February 6, 2020, Michael Shuler 
mailto:mich...@pbandjelly.org>> wrote:
You'll have to be more specific. What is your table schema and what is the 
SELECT query? What is the normal response time?

As a basic guide for your general question, if

Re: [EXTERNAL] Re: Running select against cassandra

2020-02-06 Thread Reid Pinchback
Abdul,

When in doubt, have a query model that immediately feeds you exactly what you 
are looking for. That’s kind of the data model philosophy that you want to 
shoot for as much as feasible with C*.

The point of Sean’s table isn’t the similarity to yours, it is how he has it 
keyed because it suits a partition structure much better aligned with what you 
want to request.  So I’d say yes, if a materialized view is how you want to 
achieve a denormalized state where the query model directly supports giving you 
want you want to query for, that sounds like an appropriate option to consider. 
 You might want a composite partition key for having an efficient selection of 
narrow time ranges.

From: Abdul Patel 
Reply-To: "user@cassandra.apache.org" 
Date: Thursday, February 6, 2020 at 2:42 PM
To: "user@cassandra.apache.org" 
Subject: Re: [EXTERNAL] Re: Running select against cassandra

Message from External Sender
this is the schema similar to what we have , they want to get user connected  - 
concurrent count for every say 1-5 minutes.
i am thinking will simple select will have performance issue or we can go for 
materialized views ?

CREATE TABLE  usr_session (
userid bigint,
session_usr text,
last_access_time timestamp,
login_time timestamp,
status int,
PRIMARY KEY (userid, session_usr)
) WITH CLUSTERING ORDER BY (session_usr ASC)


On Thu, Feb 6, 2020 at 2:09 PM Durity, Sean R 
mailto:sean_r_dur...@homedepot.com>> wrote:
Do you only need the current count or do you want to keep the historical counts 
also? By active users, does that mean some kind of user that the application 
tracks (as opposed to the Cassandra user connected to the cluster)?

I would consider a table like this for tracking active users through time:

Create table users_by_day (
app_date date,
hour integer,
minute integer,
user_count integer,
longest_login_user text,
longest_login_seconds integer,
last_login datetime,
last_login_user text )
primary key (app_date, hour, minute);

Then, your reporting can easily select full days or a specific, one-minute 
slice. Of course, the app would need to have a timer and write out the data. I 
would also suggest a TTL on the data so that you only keep what you need (a 
week, a year, whatever). Of course, if your reporting requires different 
granularities, you could consider a different time bucket for the table (by 
hour, by week, etc.)


Sean Durity – Staff Systems Engineer, Cassandra

From: Abdul Patel mailto:abd786...@gmail.com>>
Sent: Thursday, February 6, 2020 1:54 PM
To: user@cassandra.apache.org
Subject: [EXTERNAL] Re: Running select against cassandra

Its sort of user connected, app team needa number of active users connected say 
 every 1 to 5 mins.
The timeout at app end is 120ms.



On Thursday, February 6, 2020, Michael Shuler 
mailto:mich...@pbandjelly.org>> wrote:
You'll have to be more specific. What is your table schema and what is the 
SELECT query? What is the normal response time?

As a basic guide for your general question, if the query is something sort of 
irrelevant that should be stored some other way, like a total row count, or 
most any SELECT that requires ALLOW FILTERING, you're doing it wrong and should 
re-evaluate your data model.

1 query per minute is a minuscule fraction of the basic capacity of queries per 
minute that a Cassandra cluster should be able to handle with good data 
modeling and table-relevant query. All depends on the data model and query.

Michael

On 2/6/20 12:20 PM, Abdul Patel wrote:
Hi,

Is it advisable to run select query to fetch every minute to grab data from 
cassandra for reporting purpose, if no then whats the alternative?

-
To unsubscribe, e-mail: 
user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: 
user-h...@cassandra.apache.org



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages 

Re: Cassandra OS Patching.

2020-02-04 Thread Reid Pinchback
Another thing I'll add, since I don't think any of the other responders brought 
it up.

This all assumes that you already believe that the update is safe.  If you have 
any kind of test cluster, I'd evaluate the change there first.  

While I haven't hit it with C* specifically, I have seen database problems from 
the O/S related to shared library updates.  With database technologies mostly 
formed from HA pairs your blast radius on data corruption can be limited, so by 
the time you realize the problem you haven't risked more than you intended. C* 
is a bit more biased to propagate a lot of data across a common infrastructure. 
 Just keep that in mind when reviewing what the changes are in the O/S updates. 
 When the updates are related to external utilities and services, but not to 
shared libraries, usually your only degradation risks relate to performance and 
availability so you are in a safer position to forge ahead and then it only 
comes down to proper C* hygiene. Performance and connectivity risks can be 
mitigated by having one DC you update first, and then letting stew awhile as 
you evaluate the results. Plan in advance what you want to evaluate before 
continuing on.

R

On 1/30/20, 11:09 AM, "Michael Shuler"  wrote:

 Message from External Sender

That is some good info. To add just a little more, knowing what the 
pending security updates are for your nodes helps in knowing what to do 
after. Read the security update notes from your vendor.

Java or Cassandra update? Of course the service needs restarted - 
rolling upgrade and restart the `cassandra` service as usual.

Linux kernel update? Node needs a full reboot, so follow a rolling 
reboot plan.

Other OS updates? Most can be done while not affecting Cassandra. For 
instance, an OpenSSH security update to patch some vulnerability should 
most certainly be done as soon as possible, and the node updates can be 
even be in parallel without causing any problems with the JVM or 
Cassandra service. Most intelligent package update systems will install 
the update and restart the affected service, in this hypothetical case 
`sshd`.

Michael

On 1/30/20 3:56 AM, Erick Ramirez wrote:
> There is no need to shutdown the application because you should be able 
> to carry out the operating system upgraded without an outage to the 
> database particularly since you have a lot of nodes in your cluster.
> 
> Provided your cluster has sufficient capacity, you might even have the 
> ability to upgrade multiple nodes in parallel to reduce the upgrade 
> window. If you decide to do nodes in parallel and you fully understand 
> the token allocations and where the nodes are positioned in the ring in 
> each DC, make sure you only upgrade nodes which are at least 5 nodes 
> "away" to the right so you know none of the nodes would have overlapping 
> token ranges and they're not replicas of each other.
> 
> Other points to consider are:
> 
>   * If a node goes down (for whatever reason), I suggest you upgrade the
> OS on the node before bringing back up. It's already down so you
> might as well take advantage of it since you have so many nodes to
> upgrade.
>   * Resist the urge to run nodetool decommission or nodetool removenode
> if you encounter an issue while upgrading a node. This is a common
> knee-jerk reaction which can prove costly because the cluster will
> rebalance automatically, adding more time to your upgrade window.
> Either fix the problem on the server or replace node using the
> "replace_address" flag.
>   * Test, test, and test again. Familiarity with the process is your
> friend when the unexpected happens.
>   * Plan ahead and rehearse your recovery method (i.e. replace the node)
> should you run into unexpected issues.
>   * Stick to the plan and be prepared to implement it -- don't deviate.
> Don't spend 4 hours or more investigating why a server won't start.
>   * Be decisive. Activate your recovery/remediation plan immediately.
> 
> I'm sure others will chime in with their recommendations. Let us know 
> how you go as I'm sure others would be interested in hearing from your 
> experience. Not a lot of shops have a deployment as large as yours so 
> you are in an enviable position. Good luck!
> 
> On Thu, Jan 30, 2020 at 3:45 PM Anshu Vajpayee  > wrote:
> 
> Hi Team,
> What is the best way to patch OS of 1000 nodes Multi DC Cassandra
> cluster where we cannot suspend application traffic( we can redirect
> traffic to one DC).
> 
> Please suggest if anyone has any best practice around it.
> 
> -- 
> *C*heers,*
> *Anshu V*
> *
> *

Re: Cassandra going OOM due to tombstones (heapdump screenshots provided)

2020-01-24 Thread Reid Pinchback
Just a thought along those lines.  If the memtable flush isn’t keeping up, you 
might find that manifested in the I/O queue length and dirty page stats leading 
into the time the OOM event took place.  If you do see that, then you might 
need to do some I/O tuning as well.

From: Jeff Jirsa 
Reply-To: "user@cassandra.apache.org" 
Date: Friday, January 24, 2020 at 12:09 PM
To: cassandra 
Subject: Re: Cassandra going OOM due to tombstones (heapdump screenshots 
provided)

Message from External Sender
Ah, I focused too much on the literal meaning of startup. If it's happening 
JUST AFTER startup, it's probably getting flooded with hints from the other 
hosts when it comes online.

If that's the case, it may be just simply overrunning the memtable, or it may 
be a deadlock like 
https://issues.apache.org/jira/browse/CASSANDRA-15367
 (which benedict just updated this morning, good timing)


If it's after the host comes online and it's hint replay from the other hosts, 
you probably want to throttle hint replay significantly on the rest of the 
cluster. Whatever your hinted handoff throttle is, consider dropping it by 
50-90% to work around whichever of those two problems it is.


On Fri, Jan 24, 2020 at 9:06 AM Jeff Jirsa 
mailto:jji...@gmail.com>> wrote:
6 GB of mutations on heap
Startup would replay commitlog, which would re-materialize all of those 
mutations and put them into the memtable. The memtable would flush over time to 
disk, and clear the commitlog.

It looks like PERHAPS the commitlog replay is faster than the memtable flush, 
so you're blowing out the memtable while you're replaying the commitlog.

How much memory does the machine have? How much of that is allocated to the 
heap? What are your memtable settings? Do you see log lines about flushing 
memtables to free room (probably something like the slab pool cleaner)?



On Fri, Jan 24, 2020 at 3:16 AM Behroz Sikander 
mailto:bsikan...@apache.org>> wrote:
We recently had a lot of OOM in C* and it was generally happening during 
startup.
We took some heap dumps but still cannot pin point the exact reason. So, we 
need some help from experts.

Our clients are not explicitly deleting data but they have TTL enabled.

C* details:
> show version
[cqlsh 5.0.1 | Cassandra 2.2.9 | CQL spec 3.3.1 | Native protocol v4]

Most of the heap was allocated was the object[]
- org.apache.cassandra.db.Cell

Heap dump images:
Heap usage by class: 
https://pasteboard.co/IRrfu70.png
Classes using most heap: 
https://pasteboard.co/IRrgszZ.png
Overall heap usage: 
https://pasteboard.co/IRrg7t1.png

What could be the reason for such OOM? Something that we can tune to improve 
this?
Any help would be much appreciated.


-
To unsubscribe, e-mail: 
user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: 
user-h...@cassandra.apache.org


Re: sstableloader & num_tokens change

2020-01-24 Thread Reid Pinchback
Jon Haddad has previously made the case for num_tokens=4.  His Accelerate 2019 
talk is available at:

https://www.youtube.com/watch?v=swL7bCnolkU

You might want to check that out.  Also I think the amount of effort you put 
into evening out the token distribution increases as vnode count shrinks.  The 
caveats are explored at:

https://thelastpickle.com/blog/2019/02/21/set-up-a-cluster-with-even-token-distribution.html


From: Voytek Jarnot 
Reply-To: "user@cassandra.apache.org" 
Date: Friday, January 24, 2020 at 10:39 AM
To: "user@cassandra.apache.org" 
Subject: sstableloader & num_tokens change

Message from External Sender
Running 3.11.x, 4 nodes RF=3, default 256 tokens; moving to a different 4 node 
RF=3 cluster.

I've read that 256 is not an optimal default num_tokens value, and that 32 is 
likely a better option.

We have the "opportunity" to switch, as we're migrating environments and will 
likely be using sstableloader to do so. I'm curious if there are any gotchas 
with using sstableloader to restore snapshots taken from 256-token nodes into a 
cluster with 32-token nodes (otherwise same # of nodes and same RF).

Thanks in advance.


Re: Is there any concern about increasing gc_grace_seconds from 5 days to 8 days?

2020-01-22 Thread Reid Pinchback
I have plans to do so in the near-ish future.  People keep adding things to my 
to-do list, and I don’t have something on my to-do list yet saying “stop people 
from adding things to my to-do list”.  

Assuming I get to that point, if I answer something and I think something I 
wrote is relevant, I’ll point to it for those who want more details.  Email 
discussion threads, sometimes it is more helpful to say things a bit more 
abbreviated.  Not everybody needs details, many people have more context than I 
do and can fill in the backstory on their own.

R

From: Sergio 
Date: Wednesday, January 22, 2020 at 4:46 PM
To: Reid Pinchback 
Cc: "user@cassandra.apache.org" 
Subject: Re: Is there any concern about increasing gc_grace_seconds from 5 days 
to 8 days?

Message from External Sender
Thanks for the explanation. It should deserve a blog post

Sergio

On Wed, Jan 22, 2020, 1:22 PM Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>> wrote:
The reaper logs will say if nodes are being skipped.  The web UI isn’t that 
good at making it apparent.  You can sometimes tell it is likely happening when 
you see time gaps between parts of the repair.  This is for when nodes are 
skipped because of a timeout, but not only that.  The gaps are mostly 
controlled by the combined results of segmentCountPerNode, repairIntensity, and 
hangingRepairTimeoutMins.  The last of those three is the most obvious 
influence on timeouts, but the other two have some impact on the work attempted 
and the size of the time gaps.  However the C* version also has some bearing, 
as it influences how hard it is to process the data needed for repairs.

The more subtle aspect of node skipping isn’t the hanging repairs.  When repair 
of a token range is first attempted, Reaper uses JMX to ask C* if a repair is 
already underway.  The way it asks is very simplistic, so it doesn’t mean a 
repair is underway for that particular token range.  It just means something 
looking like a repair is going on.  Basically it just asks “hey is there a 
thread with the right magic naming pattern?”  The problem I think is that when 
you get some repair activity triggered on reads and writes for inconsistent 
data, I believe they show up as these kinds of threads too.  If you have a bad 
usage pattern of C* (where you write then very soon read back) then logically 
you’d expect this to happen quite a lot.

I’m not an expert on the internals since I’m not one of the C* contributors, 
but having stared at that part of the source quite a bit this year, that’s my 
take on what can happen.  And if I’m correct, that’s not a thing you can tune 
for. It is a consequence of C*-unfriendly usage patterns.

Bottom line though is that tuning repairs is only something you do if you find 
that repairs are taking longer than makes sense to you.  It’s totally separate 
from the notion that you should be able to run reaper-controlled repairs at 
least 2x per gc grace seconds.  That’s just a case of making some observations 
on the arithmetic of time intervals.


From: Sergio mailto:lapostadiser...@gmail.com>>
Date: Wednesday, January 22, 2020 at 4:08 PM
To: Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>>
Cc: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Subject: Re: Is there any concern about increasing gc_grace_seconds from 5 days 
to 8 days?

Message from External Sender
Thank you very much for your extended response.
Should I look in the log some particular message to detect such behavior?
How do you tune it ?

Thanks,

Sergio

On Wed, Jan 22, 2020, 12:59 PM Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>> wrote:
Kinda. It isn’t that you have to repair twice per se, just that the possibility 
of running repairs at least twice before GC grace seconds elapse means that 
clearly there is no chance of a tombstone not being subject to repair at least 
once before you hit your GC grace seconds.

Imagine a tombstone being created on the very first node that Reaper looked at 
in a repair cycle, but one second after Reaper completed repair of that 
particular token range.  Repairs will be complete, but that particular 
tombstone just missed being part of the effort.

Now your next repair run happens.  What if Reaper doesn’t look at that same 
node first?  It is easy to have happen, as there is a bunch of logic related to 
detection of existing repairs or things taking too long.  So the box that was 
“the first node” in that first repair run, through bad luck gets kicked down to 
later in the second run.  I’ve seen nodes get skipped multiple times (you can 
tune to reduce that, but bottom line… it happens).  So, bad luck you’ve got.  
Eventually the node does get repaired, and the aging tombstone finally gets 
removed.  All fine and dandy…

Provided that the second repair run got to that point BEFORE you hit your GC 
grace seconds.

That’s why you need enough time 

Re: Is there any concern about increasing gc_grace_seconds from 5 days to 8 days?

2020-01-22 Thread Reid Pinchback
The reaper logs will say if nodes are being skipped.  The web UI isn’t that 
good at making it apparent.  You can sometimes tell it is likely happening when 
you see time gaps between parts of the repair.  This is for when nodes are 
skipped because of a timeout, but not only that.  The gaps are mostly 
controlled by the combined results of segmentCountPerNode, repairIntensity, and 
hangingRepairTimeoutMins.  The last of those three is the most obvious 
influence on timeouts, but the other two have some impact on the work attempted 
and the size of the time gaps.  However the C* version also has some bearing, 
as it influences how hard it is to process the data needed for repairs.

The more subtle aspect of node skipping isn’t the hanging repairs.  When repair 
of a token range is first attempted, Reaper uses JMX to ask C* if a repair is 
already underway.  The way it asks is very simplistic, so it doesn’t mean a 
repair is underway for that particular token range.  It just means something 
looking like a repair is going on.  Basically it just asks “hey is there a 
thread with the right magic naming pattern?”  The problem I think is that when 
you get some repair activity triggered on reads and writes for inconsistent 
data, I believe they show up as these kinds of threads too.  If you have a bad 
usage pattern of C* (where you write then very soon read back) then logically 
you’d expect this to happen quite a lot.

I’m not an expert on the internals since I’m not one of the C* contributors, 
but having stared at that part of the source quite a bit this year, that’s my 
take on what can happen.  And if I’m correct, that’s not a thing you can tune 
for. It is a consequence of C*-unfriendly usage patterns.

Bottom line though is that tuning repairs is only something you do if you find 
that repairs are taking longer than makes sense to you.  It’s totally separate 
from the notion that you should be able to run reaper-controlled repairs at 
least 2x per gc grace seconds.  That’s just a case of making some observations 
on the arithmetic of time intervals.


From: Sergio 
Date: Wednesday, January 22, 2020 at 4:08 PM
To: Reid Pinchback 
Cc: "user@cassandra.apache.org" 
Subject: Re: Is there any concern about increasing gc_grace_seconds from 5 days 
to 8 days?

Message from External Sender
Thank you very much for your extended response.
Should I look in the log some particular message to detect such behavior?
How do you tune it ?

Thanks,

Sergio

On Wed, Jan 22, 2020, 12:59 PM Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>> wrote:
Kinda. It isn’t that you have to repair twice per se, just that the possibility 
of running repairs at least twice before GC grace seconds elapse means that 
clearly there is no chance of a tombstone not being subject to repair at least 
once before you hit your GC grace seconds.

Imagine a tombstone being created on the very first node that Reaper looked at 
in a repair cycle, but one second after Reaper completed repair of that 
particular token range.  Repairs will be complete, but that particular 
tombstone just missed being part of the effort.

Now your next repair run happens.  What if Reaper doesn’t look at that same 
node first?  It is easy to have happen, as there is a bunch of logic related to 
detection of existing repairs or things taking too long.  So the box that was 
“the first node” in that first repair run, through bad luck gets kicked down to 
later in the second run.  I’ve seen nodes get skipped multiple times (you can 
tune to reduce that, but bottom line… it happens).  So, bad luck you’ve got.  
Eventually the node does get repaired, and the aging tombstone finally gets 
removed.  All fine and dandy…

Provided that the second repair run got to that point BEFORE you hit your GC 
grace seconds.

That’s why you need enough time to run it twice.  Because you need enough time 
to catch the oldest possible tombstone, even if it is dealt with at the very 
end of a repair run.  Yes, it sounds like a bit of a degenerate case, but if 
you are writing a lot of data, the probability of not having the degenerate 
cases become real cases becomes vanishingly small.

R


From: Sergio mailto:lapostadiser...@gmail.com>>
Date: Wednesday, January 22, 2020 at 1:41 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>, Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>>
Subject: Re: Is there any concern about increasing gc_grace_seconds from 5 days 
to 8 days?

Message from External Sender
I was wondering if I should always complete 2 repairs cycles with reaper even 
if one repair cycle finishes in 7 hours.

Currently, I have around 200GB in column family data size to be repaired and I 
was scheduling once repair a week and I was not having too much stress on my 8 
nodes cluster with i3xlarge nodes.

Thanks,

Sergio

Il giorno mer 22 gen 2020 alle ore 08:28 Sergio 
mailto:lapostadi

Re: Is there any concern about increasing gc_grace_seconds from 5 days to 8 days?

2020-01-22 Thread Reid Pinchback
Kinda. It isn’t that you have to repair twice per se, just that the possibility 
of running repairs at least twice before GC grace seconds elapse means that 
clearly there is no chance of a tombstone not being subject to repair at least 
once before you hit your GC grace seconds.

Imagine a tombstone being created on the very first node that Reaper looked at 
in a repair cycle, but one second after Reaper completed repair of that 
particular token range.  Repairs will be complete, but that particular 
tombstone just missed being part of the effort.

Now your next repair run happens.  What if Reaper doesn’t look at that same 
node first?  It is easy to have happen, as there is a bunch of logic related to 
detection of existing repairs or things taking too long.  So the box that was 
“the first node” in that first repair run, through bad luck gets kicked down to 
later in the second run.  I’ve seen nodes get skipped multiple times (you can 
tune to reduce that, but bottom line… it happens).  So, bad luck you’ve got.  
Eventually the node does get repaired, and the aging tombstone finally gets 
removed.  All fine and dandy…

Provided that the second repair run got to that point BEFORE you hit your GC 
grace seconds.

That’s why you need enough time to run it twice.  Because you need enough time 
to catch the oldest possible tombstone, even if it is dealt with at the very 
end of a repair run.  Yes, it sounds like a bit of a degenerate case, but if 
you are writing a lot of data, the probability of not having the degenerate 
cases become real cases becomes vanishingly small.

R


From: Sergio 
Date: Wednesday, January 22, 2020 at 1:41 PM
To: "user@cassandra.apache.org" , Reid Pinchback 

Subject: Re: Is there any concern about increasing gc_grace_seconds from 5 days 
to 8 days?

Message from External Sender
I was wondering if I should always complete 2 repairs cycles with reaper even 
if one repair cycle finishes in 7 hours.

Currently, I have around 200GB in column family data size to be repaired and I 
was scheduling once repair a week and I was not having too much stress on my 8 
nodes cluster with i3xlarge nodes.

Thanks,

Sergio

Il giorno mer 22 gen 2020 alle ore 08:28 Sergio 
mailto:lapostadiser...@gmail.com>> ha scritto:
Thank you very much! Yes I am using reaper!

Best,

Sergio

On Wed, Jan 22, 2020, 8:00 AM Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>> wrote:
Sergio, if you’re looking for a new frequency for your repairs because of the 
change, if you are using reaper, then I’d go for repair_freq <= gc_grace / 2.

Just serendipity with a conversation I was having at work this morning.  When 
you actually watch the reaper logs then you can see situations where unlucky 
timing with skipped nodes can make the time to remove a tombstone be up to 2 x 
repair_run_time.

If you aren’t using reaper, your mileage will vary, particularly if your 
repairs are consistent in the ordering across nodes.  Reaper can be moderately 
non-deterministic hence the need to be sure you can complete at least two 
repair runs.

R

From: Sergio mailto:lapostadiser...@gmail.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Date: Tuesday, January 21, 2020 at 7:13 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Subject: Re: Is there any concern about increasing gc_grace_seconds from 5 days 
to 8 days?

Message from External Sender
Thank you very much for your response.
The considerations mentioned are the ones that I was expecting.
I believe that I am good to go.
I just wanted to make sure that there was no need to run any other extra 
command beside that one.

Best,

Sergio

On Tue, Jan 21, 2020, 3:55 PM Jeff Jirsa 
mailto:jji...@gmail.com>> wrote:
Note that if you're actually running repairs within 5 days, and you adjust this 
to 8, you may stream a bunch of tombstones across in that 5-8 day window, which 
can increase disk usage / compaction (because as you pass 5 days, one replica 
may gc away the tombstones, the others may not because the tombstones shadow 
data, so you'll re-stream the tombstone to the other replicas)

On Tue, Jan 21, 2020 at 3:28 PM Elliott Sims 
mailto:elli...@backblaze.com>> wrote:
In addition to extra space, queries can potentially be more expensive because 
more dead rows and tombstones will need to be scanned.  How much of a 
difference this makes will depend drastically on the schema and access pattern, 
but I wouldn't expect going from 5 days to 8 to be very noticeable.

On Tue, Jan 21, 2020 at 2:14 PM Sergio 
mailto:lapostadiser...@gmail.com>> wrote:
https://stackoverflow.com/a/22030790<https://urldefense.proofpoint.com/v2/url?u=https-3A__stackoverflow.com_a_22030790=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=qt1NAYTks84VVQ4WG

Re: Is there any concern about increasing gc_grace_seconds from 5 days to 8 days?

2020-01-22 Thread Reid Pinchback
Sergio, if you’re looking for a new frequency for your repairs because of the 
change, if you are using reaper, then I’d go for repair_freq <= gc_grace / 2.

Just serendipity with a conversation I was having at work this morning.  When 
you actually watch the reaper logs then you can see situations where unlucky 
timing with skipped nodes can make the time to remove a tombstone be up to 2 x 
repair_run_time.

If you aren’t using reaper, your mileage will vary, particularly if your 
repairs are consistent in the ordering across nodes.  Reaper can be moderately 
non-deterministic hence the need to be sure you can complete at least two 
repair runs.

R

From: Sergio 
Reply-To: "user@cassandra.apache.org" 
Date: Tuesday, January 21, 2020 at 7:13 PM
To: "user@cassandra.apache.org" 
Subject: Re: Is there any concern about increasing gc_grace_seconds from 5 days 
to 8 days?

Message from External Sender
Thank you very much for your response.
The considerations mentioned are the ones that I was expecting.
I believe that I am good to go.
I just wanted to make sure that there was no need to run any other extra 
command beside that one.

Best,

Sergio

On Tue, Jan 21, 2020, 3:55 PM Jeff Jirsa 
mailto:jji...@gmail.com>> wrote:
Note that if you're actually running repairs within 5 days, and you adjust this 
to 8, you may stream a bunch of tombstones across in that 5-8 day window, which 
can increase disk usage / compaction (because as you pass 5 days, one replica 
may gc away the tombstones, the others may not because the tombstones shadow 
data, so you'll re-stream the tombstone to the other replicas)

On Tue, Jan 21, 2020 at 3:28 PM Elliott Sims 
mailto:elli...@backblaze.com>> wrote:
In addition to extra space, queries can potentially be more expensive because 
more dead rows and tombstones will need to be scanned.  How much of a 
difference this makes will depend drastically on the schema and access pattern, 
but I wouldn't expect going from 5 days to 8 to be very noticeable.

On Tue, Jan 21, 2020 at 2:14 PM Sergio 
mailto:lapostadiser...@gmail.com>> wrote:
https://stackoverflow.com/a/22030790


For CQLSH

alter table  with GC_GRACE_SECONDS = ;



Il giorno mar 21 gen 2020 alle ore 13:12 Sergio 
mailto:lapostadiser...@gmail.com>> ha scritto:
Hi guys!

I just wanted to confirm with you before doing such an operation. I expect to 
increase the space but nothing more than this. I  need to perform just :

UPDATE COLUMN FAMILY cf with GC_GRACE = 691,200; //8 days
Is it correct?

Thanks,

Sergio


Re: inter dc bandwidth calculation

2020-01-15 Thread Reid Pinchback
Oh, duh.  Revise that.  I was forgetting that multi-dc writes are sent to a 
single node in the other dc and tagged to be forwarded to other nodes within 
the dc.

So your quick-and-dirty estimate would be more like (write volume) x 2 to leave 
headroom for random other mechanics.

R


On 1/15/20, 11:07 AM, "Reid Pinchback"  wrote:

 Message from External Sender

I would think that it would be largely driven by the replication factor.  
It isn't that the sstables are forklifted from one dc to another, it's just 
that the writes being made to the memtables are also shipped around by the 
coordinator nodes as the writes happen.  Operations at the sstable level, like 
compactions, are local to the node.

One potential wrinkle that I'm unclear on, is related to repairs.  I don't 
know if merkle trees are biased to mostly bounce around only intra-dc, versus 
how often they are communicated inter-dc.  Note that even queries can trigger 
some degree of repair traffic if you have a usage pattern of trying to read 
data recently written, because at the bleeding edge of the recent changes 
you'll have more cases of rows not having had time to settle to a consistent 
state.

If you want a quick-and-dirty heuristic, I'd probably take (write volume) x 
(replication factor) x 2 as a guestimate so you have some headroom for C* and 
TCP mechanics, but then monitor to see what your real use is.

R


On 1/15/20, 4:14 AM, "Osman Yozgatlıoğlu"  
wrote:

 Message from External Sender

Hello,

Is there any way to calculate inter dc bandwidth requirements for
proper operation?
I can't find any info about this subject.
Can we say, how much sstable collected at one dc has to be transferred 
to other?
I can calculate bandwidth with generated sstable then.
I have twcs with one hour window.

Regards,
Osman

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org







Re: inter dc bandwidth calculation

2020-01-15 Thread Reid Pinchback
I would think that it would be largely driven by the replication factor.  It 
isn't that the sstables are forklifted from one dc to another, it's just that 
the writes being made to the memtables are also shipped around by the 
coordinator nodes as the writes happen.  Operations at the sstable level, like 
compactions, are local to the node.

One potential wrinkle that I'm unclear on, is related to repairs.  I don't know 
if merkle trees are biased to mostly bounce around only intra-dc, versus how 
often they are communicated inter-dc.  Note that even queries can trigger some 
degree of repair traffic if you have a usage pattern of trying to read data 
recently written, because at the bleeding edge of the recent changes you'll 
have more cases of rows not having had time to settle to a consistent state.

If you want a quick-and-dirty heuristic, I'd probably take (write volume) x 
(replication factor) x 2 as a guestimate so you have some headroom for C* and 
TCP mechanics, but then monitor to see what your real use is.

R


On 1/15/20, 4:14 AM, "Osman Yozgatlıoğlu"  wrote:

 Message from External Sender

Hello,

Is there any way to calculate inter dc bandwidth requirements for
proper operation?
I can't find any info about this subject.
Can we say, how much sstable collected at one dc has to be transferred to 
other?
I can calculate bandwidth with generated sstable then.
I have twcs with one hour window.

Regards,
Osman

-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org





Re: cassandra_migration_wait

2020-01-13 Thread Reid Pinchback
I can’t find it anywhere either, but I’m looking at a 3.11.4 source image.  
From the naming I’d bet that this is being used to feed the 
cassandra.migration_task_wait_in_seconds property.  It’s already coded to have 
a default of 1 second, which matches what you are seeing in the shell script 
var.  The relevant Java source is 
org.apache.cassandra.service.MigrationManager, line 62.

From: Ben Mills 
Reply-To: "user@cassandra.apache.org" 
Date: Monday, January 13, 2020 at 1:59 PM
To: "user@cassandra.apache.org" 
Subject: cassandra_migration_wait

Message from External Sender
Greetings,

We are running Cassandra 3.11.2 in Kubernetes and use a run.sh to set some 
environment variables and a few other things.

This script includes:

CASSANDRA_MIGRATION_WAIT="${CASSANDRA_MIGRATION_WAIT:-1}"

setting this environment variable to "1". I looked for documentation on this 
but cannot seem to find it anywhere. Anyone know what this is configuring and 
what the value implies?

Thanks in advance for your help.

Ben



Re: How bottom of cassandra save data efficiently?

2020-01-02 Thread Reid Pinchback
As others pointed out, compression will reduce the size and replication will 
(across nodes) increase the total size.

The other thing to note is that you can have multiple versions of the data in 
different sstables, and tombstones related to deletions and TTLs, and indexes, 
and any snapshots, and room for the temporary artifacts of compactions.   If 
you are just trying to have a quick guestimate of your space needs, I’d 
probably use your uncompressed calculation as a heuristic for the per-node 
storage required.

From: lampahome 
Reply-To: "user@cassandra.apache.org" 
Date: Monday, December 30, 2019 at 9:37 PM
To: "user@cassandra.apache.org" 
Subject: How bottom of cassandra save data efficiently?

Message from External Sender
If I use var a as primary key and var b as second key, and a and b are 16 bytes 
and 8 bytes.

And other data are 32 bytes.

In one row, I have a+b+data = 16+8+32 = 56 bytes.

If I have 100,000 rows to store in cassandra, will it occupy space 56x10 
bytes in my disk? Or data will be compressed?

thx


Re: Is cassandra schemaless?

2019-12-16 Thread Reid Pinchback
Once upon a time the implication of ‘nosql’ was ‘not SQL’, but these days it 
would be more accurate to characterize it as ‘not only SQL’.

‘schemaless’ also can be interpreted a little flexibly. In a relational 
database structure, you can think of ‘schema’ (with respect to tables) as 
meaning means two things: how a table is structured, and that each row in that 
table obeys that structure.  In C* and DynamoDB the first notion remains true, 
because we declare the structure.  However the storage representation is more 
fluid so individual rows depend on which attributes have actually been stored; 
what the rows contain will obey some part of the schema, but they may not 
contain everything that was declared in the schema.

This is a more structured notion than you get with things that are document 
stores, e.g. elasticsearch. Those are more schemaless, but sometimes that can 
feel like a shell game.  You declare indexes based on your expectations, and if 
the data doesn’t meet your expectations, you won’t find it.  So people work to 
make their data meet the expectations.  If the data is designed to meet the 
expectations and the machinery using it is configured with the expectations, 
talking about things being schemaless nudges towards being a word game to hide 
something which is trying to behave isomorphically.  You can find that you wind 
up in a similar place in the end, just with tools that are less computationally 
efficient.

From: Russell Spitzer 
Reply-To: "user@cassandra.apache.org" 
Date: Sunday, December 15, 2019 at 9:53 PM
To: user 
Subject: Re: Is cassandra schemaless?

Message from External Sender
Cassandra is not schemaless. Not all nosql databases are schemaless either, the 
term is a little outdated since many nosql databases now support some or all of 
ansi SQL. Cassandra does not though, just a very limited subset called CQL

On Sun, Dec 15, 2019, 8:04 PM lampahome 
mailto:pahome.c...@mirlab.org>> wrote:
I read some difference between nosql and sql, and one obvious differences is 
nosql supporting schemaless.

But I try it in cassandra and get result not like that.

Ex:
cqlsh:key> Create table if not exists yo (blk bigint primary key, count int);
cqlsh:key> insert into yo (blk, count, test) values (2,4,'123');

It shows message="Undefined column name test"

So cassandra isn't schemaless?


Re: Measuring Cassandra Metrics at a Sessions/Connection Levels

2019-12-12 Thread Reid Pinchback
Metrics are exposed via JMX.  You can use something like jmxtrans or collectd 
with the jmx plugin to capture metrics per-node and route them to whatever you 
use to aggregate metrics.

From: Fred Habash 
Reply-To: "user@cassandra.apache.org" 
Date: Thursday, December 12, 2019 at 9:38 AM
To: "user@cassandra.apache.org" 
Subject: Measuring Cassandra Metrics at a Sessions/Connection Levels

Message from External Sender
Hi all ...

We are facing a scenario where we have to measure for some metrics on a per 
connection or client basis. For example. count of read/write request by client 
IP/host/user/program. We want to know the source of C* requests for budgeting, 
capacity planing, or charge-backs.
We are running 2.2.8.

I did some research and I just wanted to verify my findings ...

1. C* 4+ has two instruments 'nodetool clientstats' & system_view.clinets
2. Earlier release have no native instruments to collect these metrics

Is there any other way to measure such metrics?


Thank you



Re: execute is faster than execute_async?

2019-12-11 Thread Reid Pinchback
Also note that you should be expecting async operations to be slower on a 
call-by-call basis.  Async protocols have added overhead.  The point of them 
really is to leave the client free to interleave other computing activity 
between the async calls.  It’s not usually a better way to do batch writing. 
That’s not an observation specific to C*, that’s just about understanding the 
role of async operations in computing.

There is some subtlety with distributed services like C* where you’re 
round-robining the calls around the cluster, where repeated async calls can win 
relative to sync because you aren’t waiting to hand off the next unit of work 
to a different node, but once the activity starts to queue up on any kind of 
resource, even just TCP buffering, you’ll likely be back to a situation where 
all you are measuring is the net difference in protocol overhead for async vs 
sync.

One of the challenges with performance testing is you have to be pretty clear 
on what exactly it is you are exercising, or all you can conclude from 
different numbers is that different numbers can exist.

R

From: Alexander Dejanovski 
Reply-To: "user@cassandra.apache.org" 
Date: Wednesday, December 11, 2019 at 7:44 AM
To: user 
Subject: Re: execute is faster than execute_async?

Message from External Sender
Hi,

you can check this piece of documentation from Datastax: 
https://docs.datastax.com/en/developer/python-driver/3.20/api/cassandra/cluster/#cassandra.cluster.Session.execute_async

The usual way of doing this is to send a bunch of execute_async() calls, adding 
the returned futures in a list. Once the list reaches the chosen threshold 
(usually we send around 100 queries and wait for them to finish before moving 
on the the next ones), loop through the futures and call the result() method to 
block until it completes.
Should look like this:


futures = []

for i in range(len(queries)):

futures.append(session.execute_async(queries[i]))

if len(futures) >= 100 or i == len(queries)-1:

for future in futures:

results = future.result() # will block until the query finishes

futures = []  # empty the list



Haven't tested the code above but it should give you an idea on how this can be 
implemented.
Sending hundreds/thousands of queries without waiting for a result will DDoS 
the cluster, so you should always implement some throttling.

Cheers,

-
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com


On Wed, Dec 11, 2019 at 10:42 AM Jordan West 
mailto:jorda...@gmail.com>> wrote:
I’m not very familiar with the python client unfortunately. If it helps: In 
Java, async would return futures and at the end of submitting each batch you 
would block on them by calling get.

Jordan

On Wed, Dec 11, 2019 at 1:37 AM lampahome 
mailto:pahome.c...@mirlab.org>> wrote:


Jordan West mailto:jorda...@gmail.com>> 於 2019年12月11日 週三 
下午4:34寫道:
Hi,

Have you tried batching calls to execute_async with periodic blocking for the 
batch’s responses?

Can you give me some keywords about calling execute_async batch?

PS: I use python version.


Re: Dynamo autoscaling: does it beat cassandra?

2019-12-10 Thread Reid Pinchback
Hi Carl,

I can’t speak to all of the internal mechanics and what the committers factored 
in.  I have no doubt that intelligent decisions were the goal given the context 
of the time.  More where I come from is that at least in our case, we see nodes 
with a fair hunk of file data sitting in buffer cache, and the benefits of that 
have to become throttled when the buffered data won’t be consumable until 
decompression.  Unfortunately with Java you aren’t in the tidy situation where 
you can just claim a page of memory from buffer cache with no real overhead, so 
its plausible that memory copy vs memory traversal for decompaction don’t 
differ terribly.  Definitely something I want to look into.

Most of where I’ve seen I/O stalling relates to flushing of dirty pages from 
the writes of sstable compaction.  What to do depends a lot on the specifics of 
your situation.  TL;DR summary is that short bursts of write ops can choke out 
everything else once the I/O queue is filled.  It doesn’t really pertain to 
mean/median/95-percentile performance.  It starts to show at 99, and definitely 
999.

I don’t know if the interleaving with I/O wait results in some of the 
decompression being effectively free, it’s entirely plausible that this has 
been observed and the current approach improved accordingly. It’s a pretty 
reasonable CPU scheduling behavior unless cores are otherwise being kept busy, 
e.g. with memory copies or pauses to yank things from memory to CPU cache.  Jon 
Haddad recently pointed me to some resources that might explain getting less 
from CPU than the actual CPU numbers suggest, but I haven’t yet really wrapped 
my head around the details enough to decide how I would want to investigate 
reductions in CPU instructions executed.

I do know that we definitely saw from our latency metrics that read times are 
impacted when writes flush in a spike, so we tuned to mitigate it.  It probably 
doesn’t take much to achieve a read stall, as anything that stats a filesystem 
entry (either via cache miss on the dirnodes, or if you haven’t disabled atime) 
might be just as subject to stalling as anything that tries to read content 
from the file itself.

No opinion on 3.11.x handling of column metadata.  I’ve read that it is a great 
deal more complicated and a factor in various performance headaches, but like 
you, I haven’t gotten into the source around that so I don’t have a mental 
model for the details.

From: Carl Mueller 
Reply-To: "user@cassandra.apache.org" 
Date: Tuesday, December 10, 2019 at 3:19 PM
To: "user@cassandra.apache.org" 
Subject: Re: Dynamo autoscaling: does it beat cassandra?

Message from External Sender
Dor and Reid: thanks, that was very helpful.

Is the large amount of compression an artifact of pre-cass3.11 where the column 
names were per-cell (combined with the cluster key for extreme verbosity, I 
think), so compression would at least be effective against those portions of 
the sstable data? IIRC the cass commiters figured as long as you can shrink the 
data, the reduced size drops the time to read off of the disk, maybe even the 
time to get into CPU cache from memory and the CPU to decompress is somewhat 
"free" at that point since everything else is stalled for I/O or memory reads?

But I don't know how the 3.11.x format works to avoid spamming of those column 
names, I haven't torn into that part of the code.

On Tue, Dec 10, 2019 at 10:15 AM Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>> wrote:
Note that DynamoDB I/O throughput scaling doesn’t work well with brief spikes.  
Unless you write your own machinery to manage the provisioning, by the time AWS 
scales the I/O bandwidth your incident has long since passed.  It’s not a thing 
to rely on if you have a latency SLA.  It really only works for situations like 
a sustained alteration in load, e.g. if you have a sinusoidal daily traffic 
pattern, or periodic large batch operations that run for an hour or two, and 
you need the I/O adjustment while that takes place.

Also note that DynamoDB routinely chokes on write contention, which C* would 
rarely do.  About the only benefit DynamoDB has over C* is that more of its 
operations function as atomic mutations of an existing row.

One thing to also factor into the comparison is developer effort.  The DynamoDB 
API isn’t exactly tuned to making developers productive.  Most of the AWS APIs 
aren’t, really, once you use them for non-toy projects. AWS scales in many 
dimensions, but total developer effort is not one of them when you are talking 
about high-volume tier one production systems.

To respond to one of the other original points/questions, yes key and row 
caches don’t seem to be a win, but that would vary with your specific usage 
pattern.  Caches need a good enough hit rate to offset the GC impact.  Even 
when C* lets you move things off heap, you’ll see a fair number of GC-able 
artifacts associated with data in caches.  

Re: Seeing tons of DigestMismatchException exceptions after upgrading from 2.2.13 to 3.11.4

2019-12-10 Thread Reid Pinchback
Colleen, to your question, yes there is a difference between 2.x and 3.x that 
would impact repairs.  The merkel tree computations changed, to having a 
default tree depth that is greater. That can cause significant memory drag, to 
the point that nodes sometimes even OOM.  This has been fixed in 4.x to make 
the setting tunable.  I think 3.11.5 now contains the same as a back-patch.

From: Reid Pinchback 
Reply-To: "user@cassandra.apache.org" 
Date: Tuesday, December 10, 2019 at 11:23 AM
To: "user@cassandra.apache.org" 
Subject: Re: Seeing tons of DigestMismatchException exceptions after upgrading 
from 2.2.13 to 3.11.4

Message from External Sender
Carl, your speculation matches our observations, and we have a use case with 
that unfortunate usage pattern.  Write-then-immediately-read is not friendly to 
eventually-consistent data stores. It makes the reading pay a tax that really 
is associated with writing activity.

From: Carl Mueller 
Reply-To: "user@cassandra.apache.org" 
Date: Monday, December 9, 2019 at 3:18 PM
To: "user@cassandra.apache.org" 
Subject: Re: Seeing tons of DigestMismatchException exceptions after upgrading 
from 2.2.13 to 3.11.4

Message from External Sender
My speculation on rapidly churning/fast reads of recently written data:

- data written at quorum (for RF3): write confirm is after two nodes reply
- data read very soon after (possibly code antipattern), and let's assume the 
third node update hasn't completed yet (e.g. AWS network "variance"). The read 
will pick a replica, and then there is a 50% chance the second replica chosen 
for quorum read is the stale node, which triggers a DigestMismatch read repair.

Is that plausible?

The code seems to log the exception in all read repair instances, so it doesn't 
seem to be an ERROR with red blaring klaxons, maybe it should be a WARN?

On Mon, Nov 25, 2019 at 11:12 AM Colleen Velo 
mailto:cmv...@gmail.com>> wrote:
Hello,

As part of the final stages of our 2.2 --> 3.11 upgrades, one of our clusters 
(on AWS/ 18 nodes/ m4.2xlarge) produced some post-upgrade fits. We started 
getting spikes of Cassandra read and write timeouts despite the fact the 
overall metrics volumes were unchanged. As part of the upgrade process, there 
was a TWCS table that we used a facade implementation to help change the 
namespace of the compaction class, but that has very low query volume.

The DigestMismatchException error messages, (based on sampling the hash keys 
and finding which tables have partitions for that hash key), seem to be 
occurring on the heaviest volume table (4,000 reads, 1600 writes per second per 
node approximately), and that table has semi-medium row widths with about 10-40 
column keys. (Or at least the digest mismatch partitions have that type of 
width). The keyspace is an RF3 using NetworkTopology, the CL is QUORUM for both 
reads and writes.

We have experienced the DigestMismatchException errors on all 3 of the 
Production clusters that we have upgraded (all of them are single DC in the 
us-east-1/eu-west-1/ap-northeast-2 AWS regions) and in all three cases, those 
DigestMismatchException errors were not there in either the  2.1.x or 2.2.x 
versions of Cassandra.
Does anyone know of changes from 2.2 to 3.11 that would produce additional 
timeout problems, such as heavier blocking read repair logic?  Also,

We ran repairs (via reaper v1.4.8) (much nicer in 3.11 than 2.1) on all of the 
tables and across all of the nodes, and our timeouts seemed to have 
disappeared, but we continue to see a rapid streaming of the Digest mismatches 
exceptions, so much so that our Cassandra debug logs are rolling over every 15 
minutes..   There is a mail list post from 2018 that indicates that some 
DigestMismatchException error messages are natural if you are reading while 
writing, but the sheer volume that we are getting is very concerning:
 - 
https://www.mail-archive.com/user@cassandra.apache.org/msg56078.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__www.mail-2Darchive.com_user-40cassandra.apache.org_msg56078.html=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=dwLj6E_WYM8uXYOVXSvTCxWeihgwwGEpbPrvDTOoQ24=2QbuYooXdG_wC9dKbsjNzdNLXkbXAW_517Xu7lqhKws=>

Is that level of DigestMismatchException unusual? Or is can that volume of 
mismatches appear if semi-wide rows simply require a lot of resolution because 
flurries of quorum reads/writes (RF3) on recent partitions have a decent chance 
of not having fully synced data on the replica reads? Does the digest mismatch 
error get debug-logged on every chance read repair? (edited)
Also, why are these DigestMismatchException only occurring once the upgrade to 
3.11 has occurred?

~

Sample DigestMismatchException error message:
DEBUG [ReadRepairStage:13]

Re: Seeing tons of DigestMismatchException exceptions after upgrading from 2.2.13 to 3.11.4

2019-12-10 Thread Reid Pinchback
Carl, your speculation matches our observations, and we have a use case with 
that unfortunate usage pattern.  Write-then-immediately-read is not friendly to 
eventually-consistent data stores. It makes the reading pay a tax that really 
is associated with writing activity.

From: Carl Mueller 
Reply-To: "user@cassandra.apache.org" 
Date: Monday, December 9, 2019 at 3:18 PM
To: "user@cassandra.apache.org" 
Subject: Re: Seeing tons of DigestMismatchException exceptions after upgrading 
from 2.2.13 to 3.11.4

Message from External Sender
My speculation on rapidly churning/fast reads of recently written data:

- data written at quorum (for RF3): write confirm is after two nodes reply
- data read very soon after (possibly code antipattern), and let's assume the 
third node update hasn't completed yet (e.g. AWS network "variance"). The read 
will pick a replica, and then there is a 50% chance the second replica chosen 
for quorum read is the stale node, which triggers a DigestMismatch read repair.

Is that plausible?

The code seems to log the exception in all read repair instances, so it doesn't 
seem to be an ERROR with red blaring klaxons, maybe it should be a WARN?

On Mon, Nov 25, 2019 at 11:12 AM Colleen Velo 
mailto:cmv...@gmail.com>> wrote:
Hello,

As part of the final stages of our 2.2 --> 3.11 upgrades, one of our clusters 
(on AWS/ 18 nodes/ m4.2xlarge) produced some post-upgrade fits. We started 
getting spikes of Cassandra read and write timeouts despite the fact the 
overall metrics volumes were unchanged. As part of the upgrade process, there 
was a TWCS table that we used a facade implementation to help change the 
namespace of the compaction class, but that has very low query volume.

The DigestMismatchException error messages, (based on sampling the hash keys 
and finding which tables have partitions for that hash key), seem to be 
occurring on the heaviest volume table (4,000 reads, 1600 writes per second per 
node approximately), and that table has semi-medium row widths with about 10-40 
column keys. (Or at least the digest mismatch partitions have that type of 
width). The keyspace is an RF3 using NetworkTopology, the CL is QUORUM for both 
reads and writes.

We have experienced the DigestMismatchException errors on all 3 of the 
Production clusters that we have upgraded (all of them are single DC in the 
us-east-1/eu-west-1/ap-northeast-2 AWS regions) and in all three cases, those 
DigestMismatchException errors were not there in either the  2.1.x or 2.2.x 
versions of Cassandra.
Does anyone know of changes from 2.2 to 3.11 that would produce additional 
timeout problems, such as heavier blocking read repair logic?  Also,

We ran repairs (via reaper v1.4.8) (much nicer in 3.11 than 2.1) on all of the 
tables and across all of the nodes, and our timeouts seemed to have 
disappeared, but we continue to see a rapid streaming of the Digest mismatches 
exceptions, so much so that our Cassandra debug logs are rolling over every 15 
minutes..   There is a mail list post from 2018 that indicates that some 
DigestMismatchException error messages are natural if you are reading while 
writing, but the sheer volume that we are getting is very concerning:
 - 
https://www.mail-archive.com/user@cassandra.apache.org/msg56078.html

Is that level of DigestMismatchException unusual? Or is can that volume of 
mismatches appear if semi-wide rows simply require a lot of resolution because 
flurries of quorum reads/writes (RF3) on recent partitions have a decent chance 
of not having fully synced data on the replica reads? Does the digest mismatch 
error get debug-logged on every chance read repair? (edited)
Also, why are these DigestMismatchException only occurring once the upgrade to 
3.11 has occurred?

~

Sample DigestMismatchException error message:
DEBUG [ReadRepairStage:13] 2019-11-22 01:38:14,448 ReadCallback.java:242 - 
Digest mismatch:
org.apache.cassandra.service.DigestMismatchException: Mismatch for key 
DecoratedKey(-6492169518344121155, 
66306139353831322d323064382d313037322d663965632d636565663165326563303965) 
(be2c0feaa60d99c388f9d273fdc360f7 vs 09eaded2d69cf2dd49718076edf56b36)
at 
org.apache.cassandra.service.DigestResolver.compareResponses(DigestResolver.java:92)
 ~[apache-cassandra-3.11.4.jar:3.11.4]
at 
org.apache.cassandra.service.ReadCallback$AsyncRepairRunner.run(ReadCallback.java:233)
 ~[apache-cassandra-3.11.4.jar:3.11.4]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 
[na:1.8.0_77]
at 

Re: Dynamo autoscaling: does it beat cassandra?

2019-12-10 Thread Reid Pinchback
Note that DynamoDB I/O throughput scaling doesn’t work well with brief spikes.  
Unless you write your own machinery to manage the provisioning, by the time AWS 
scales the I/O bandwidth your incident has long since passed.  It’s not a thing 
to rely on if you have a latency SLA.  It really only works for situations like 
a sustained alteration in load, e.g. if you have a sinusoidal daily traffic 
pattern, or periodic large batch operations that run for an hour or two, and 
you need the I/O adjustment while that takes place.

Also note that DynamoDB routinely chokes on write contention, which C* would 
rarely do.  About the only benefit DynamoDB has over C* is that more of its 
operations function as atomic mutations of an existing row.

One thing to also factor into the comparison is developer effort.  The DynamoDB 
API isn’t exactly tuned to making developers productive.  Most of the AWS APIs 
aren’t, really, once you use them for non-toy projects. AWS scales in many 
dimensions, but total developer effort is not one of them when you are talking 
about high-volume tier one production systems.

To respond to one of the other original points/questions, yes key and row 
caches don’t seem to be a win, but that would vary with your specific usage 
pattern.  Caches need a good enough hit rate to offset the GC impact.  Even 
when C* lets you move things off heap, you’ll see a fair number of GC-able 
artifacts associated with data in caches.  Chunk cache somewhat wins with being 
off-heap, because it isn’t just I/O avoidance with that cache, you’re also 
benefitting from the decompression.  However I’ve started to wonder how often 
sstable compression is worth the performance drag and internal C* complexity.  
If you compare to where a more traditional RDBMS would use compression, e.g. 
Postgres, use of compression is more selective; you only bear the cost in the 
places already determined to win from the tradeoff.

From: Dor Laor 
Reply-To: "user@cassandra.apache.org" 
Date: Monday, December 9, 2019 at 5:58 PM
To: "user@cassandra.apache.org" 
Subject: Re: Dynamo autoscaling: does it beat cassandra?

Message from External Sender
The DynamoDB model has several key benefits over Cassandra's.
The most notable one is the tablet concept - data is partitioned into 10GB
chunks. So scaling happens where such a tablet reaches maximum capacity
and it is automatically divided to two. It can happen in parallel across the 
entire
data set, thus there is no concept of growing the amount of nodes or vnodes.
As the actual hardware is multi-tenant, the average server should have plenty
of capacity to receive these streams.

That said, when we benchmarked DynamoDB and just hit it with ingest workload,
even when it was reserved, we had to slow down the pace since we received many
'error 500' which means internal server errors. Their hot partitions do not 
behave great
as well.

So I believe a growth of 10% the capacity with good key distribution can be 
handled well
but a growth of 2x in a short time will fail. It's something you're expect from 
any database
but Dynamo has an advantage with tablets and multitenancy and issues with hot 
partitions
and accounting of hot keys which will get cached in Cassandra better.

Dynamo allows you to detach compute from the storage which is a key benefit in 
a serverless, spiky deployment.

On Mon, Dec 9, 2019 at 1:02 PM Jeff Jirsa 
mailto:jji...@gmail.com>> wrote:
Expansion probably much faster in 4.0 with complete sstable streaming (skips 
ser/deser), though that may have diminishing returns with vnodes unless you're 
using LCS.

Dynamo on demand / autoscaling isn't magic - they're overprovisioning to give 
you the burst, then expanding on demand. That overprovisioning comes with a 
cost. Unless you're actively and regularly scaling, you're probably going to 
pay more for it.

It'd be cool if someone focused on this - I think the faster streaming goes a 
long way. The way vnodes work today make it difficult to add more than one at a 
time without violating consistency, and thats unlikely to change, but if each 
individual node is much faster, that may mask it a bit.



On Mon, Dec 9, 2019 at 12:35 PM Carl Mueller 
 wrote:
Dynamo salespeople have been pushing autoscaling abilities that have been one 
of the key temptations to our management to switch off of cassandra.

Has anyone done any numbers on how well dynamo will autoscale demand spikes, 
and how we could architect cassandra to compete with such abilities?

We probably could overprovision and with the presumably higher cost of dynamo 
beat it, although the sales engineers claim they are closing the cost factor 
too. We could vertically scale to some degree, but node expansion seems close.

VNode expansion is still limited to one at a time?

We use VNodes so we can't do netflix's cluster doubling, correct? With cass 
4.0's alleged segregation of the data by token we could though and possibly 
also "prep" the node by having the necessary 

Re: Predicting Read/Write Latency as a Function of Total Requests & Cluster Size

2019-12-10 Thread Reid Pinchback
Latency SLAs are very much *not* Cassandra’s sweet spot, scaling throughput and 
storage is more where C*’s strengths shine.  If you want just median latency 
you’ll find things a bit more amenable to modeling, but not if you have 2 nines 
and particularly not 3 nines SLA expectations.  Basically, the harder you push 
on the nodes, the more you get sporadic but non-ignorable timing artifacts due 
to garbage collection and IO stalls when the flushing of the writes can choke 
out the disk reads.  Also, running in AWS, you’ll find that noisy neighbors are 
a routine issue no matter what the specifics of your use.

What your actual data model is, and what your patterns of reads and writes are, 
the impact of deletes and TTLs requiring tombstone cleanup, etc., all 
dramatically change the picture.

If you aren’t already aware of it, there is something called cassandra-stress 
that can help you do some experiments. The challenge though is determining if 
the experiments are representative of what your actual usage will be.  Because 
of the GC issues in anything implemented in a JVM or interpreter, it’s pretty 
easy to fall off the cliff of relevance.  TLP wrote an article about some of 
the challenges of this with cassandra-stress:

https://thelastpickle.com/blog/2017/02/08/Modeling-real-life-workloads-with-cassandra-stress.html

Note that one way to not have to care a lot about variable latency is to make 
use of speculative retry.  Basically you’re trading off some of your median 
throughput to help achieve a latency SLA.  The tradeoff benefit breaks down 
when you get to 3 nines.

I’m actually hoping to start on some modeling of what the latency surface looks 
like with different assumptions in the new year, not because I expect the 
specific numbers to translate to anybody else but just to show how the 
underyling dynamics evidence themselves in metrics when C* nodes are under 
duress.

R


From: Fred Habash 
Reply-To: "user@cassandra.apache.org" 
Date: Tuesday, December 10, 2019 at 9:57 AM
To: "user@cassandra.apache.org" 
Subject: Predicting Read/Write Latency as a Function of Total Requests & 
Cluster Size

Message from External Sender
I'm looking for an empirical way to answer these two question:

1. If I increase application work load (read/write requests) by some 
percentage, how is it going to affect read/write latency. Of course, all other 
factors remaining constant e.g. ec2 instance class, ssd specs, number of nodes, 
etc.

2) How many nodes do I have to add to maintain a given read/write latency?

Are there are any methods or instruments out there that can help answer these 
que




Thank you



Re: AWS ephemeral instances + backup

2019-12-06 Thread Reid Pinchback
Correction:  “most of your database will be in chunk cache, or buffer cache 
anyways.

From: Reid Pinchback 
Reply-To: "user@cassandra.apache.org" 
Date: Friday, December 6, 2019 at 10:16 AM
To: "user@cassandra.apache.org" 
Subject: Re: AWS ephemeral instances + backup

Message from External Sender
If you’re only going to have a small storage footprint per node like 100gb, 
another option comes to mind. Use an instance type with large ram.  Use an EBS 
storage volume on an EBS-optimized instance type, and take EBS snapshots. Most 
of your database will be in chunk cache anyways, so you only need to make sure 
that the dirty background writer is keeping up.  I’d take a look at iowait 
during a snapshot and see if the results are acceptable for a running node.  
Even if it is marginal, if you’re only snapshotting one node at a time, then 
speculative retry would just skip over the temporary slowpoke.

From: Carl Mueller 
Reply-To: "user@cassandra.apache.org" 
Date: Thursday, December 5, 2019 at 3:21 PM
To: "user@cassandra.apache.org" 
Subject: AWS ephemeral instances + backup

Message from External Sender
Does anyone have experience tooling written to support this strategy:

Use case: run cassandra on i3 instances on ephemerals but synchronize the 
sstables and commitlog files to the cheapest EBS volume type (those have bad 
IOPS but decent enough throughput)

On node replace, the startup script for the node, back-copies the sstables and 
commitlog state from the EBS to the ephemeral.

As can be seen: 
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.aws.amazon.com_AWSEC2_latest_UserGuide_EBSVolumeTypes.html=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=vReT2cww6MdAQWz8b6u96QUK08ufU_4uP3X-zH4CyTc=CXEcXQAHUhdV8CrzCfURvvW9qRDp_Ji9TvbUgVwKxhA=>

the (presumably) spinning rust tops out at 2375 MB/sec (using multiple EBS 
volumes presumably) that would incur about a ten minute delay for node 
replacement for a 1TB node, but I imagine this would only be used on higher 
IOPS r/w nodes with smaller densities, so 100GB would be about a minute of 
delay only, already within the timeframes of an AWS node replacement/instance 
restart.




Re: AWS ephemeral instances + backup

2019-12-06 Thread Reid Pinchback
If you’re only going to have a small storage footprint per node like 100gb, 
another option comes to mind. Use an instance type with large ram.  Use an EBS 
storage volume on an EBS-optimized instance type, and take EBS snapshots. Most 
of your database will be in chunk cache anyways, so you only need to make sure 
that the dirty background writer is keeping up.  I’d take a look at iowait 
during a snapshot and see if the results are acceptable for a running node.  
Even if it is marginal, if you’re only snapshotting one node at a time, then 
speculative retry would just skip over the temporary slowpoke.

From: Carl Mueller 
Reply-To: "user@cassandra.apache.org" 
Date: Thursday, December 5, 2019 at 3:21 PM
To: "user@cassandra.apache.org" 
Subject: AWS ephemeral instances + backup

Message from External Sender
Does anyone have experience tooling written to support this strategy:

Use case: run cassandra on i3 instances on ephemerals but synchronize the 
sstables and commitlog files to the cheapest EBS volume type (those have bad 
IOPS but decent enough throughput)

On node replace, the startup script for the node, back-copies the sstables and 
commitlog state from the EBS to the ephemeral.

As can be seen: 
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/EBSVolumeTypes.html

the (presumably) spinning rust tops out at 2375 MB/sec (using multiple EBS 
volumes presumably) that would incur about a ten minute delay for node 
replacement for a 1TB node, but I imagine this would only be used on higher 
IOPS r/w nodes with smaller densities, so 100GB would be about a minute of 
delay only, already within the timeframes of an AWS node replacement/instance 
restart.



Re: "Maximum memory usage reached (512.000MiB), cannot allocate chunk of 1.000MiB"

2019-12-04 Thread Reid Pinchback
Probably helps to think of how swap actually functions.  It has a valid place, 
so long as the behavior of the kernel and the OOM killer are understood.

You can have a lot of cold pages that have nothing at all to do with C*.  If 
you look at where memory goes, it isn’t surprising to see things that the 
kernel finds it can page out, leaving RAM for better things.  I’ve seen crond 
soak up a lot of memory, and Dell’s assorted memory-bloated tooling, for 
example. Anything that is truly cold, swap is your friend because those things 
are infrequently used… swapping them in and out leaves more memory on average 
for what you want.  However, that’s not huge numbers, that could be something 
like a half gig of RAM kept routinely free, depending on the assorted tooling 
you have as a baseline install for servers.

If swap exists to avoid the OOM killer on truly active processes, the returns 
there diminish rapidly. Within seconds you’ll find you can’t even ssh into a 
box to investigate. In something like a traditional database it’s worth the 
pain because there are multiple child processes to the rdbms, and the OOM 
killer preferentially targets big process families.  Databases can go into a 
panic if you toast a child, and you have a full-blown recovery on your hands.  
Fortunately the more mature databases give you knobs for memory tuning, like 
being able to pin particular tables in memory if they are critical; anything 
not pinned (via madvise I believe) can get tossed when under pressure.

The situation is a bit different with C*.  By design, you have replicas that 
the clients automatically find, and things like speculative retry cause 
processing to skip over the slowpokes. The better-slow-than-dead argument seems 
more tenuous to me here than for an rdbms.  And if you have an SLA based on 
latency, you’ll never meet it if you have page faults happening during memory 
references in the JVM. So if you have swappiness enabled, probably best to keep 
it tuned low.  That way a busy C* JVM hopefully is one of the last victims in 
the race to shove pages to swap.



From: Shishir Kumar 
Reply-To: "user@cassandra.apache.org" 
Date: Wednesday, December 4, 2019 at 8:04 AM
To: "user@cassandra.apache.org" 
Subject: Re: "Maximum memory usage reached (512.000MiB), cannot allocate chunk 
of 1.000MiB"

Message from External Sender
Correct. Normally one should avoid this, as performance might degrade, but 
system will not die (until process gets paged out).

In production we haven't done this (just changed mmap_index_only). We have an 
environment which gets used for customer to train/beta test that grows rapidly. 
Investing on infra do not make sense from cost prospective, so swap as option.

But here if environment is up running it will be interesting to understand what 
is consuming memory and is infra sized correctly.

-Shishir
On Wed, 4 Dec 2019, 16:13 Hossein Ghiyasi Mehr, 
mailto:ghiyasim...@gmail.com>> wrote:
"3. Though Datastax do not recommended and recommends Horizontal scale, so 
based on your requirement alternate old fashion option is to add swap space."
Hi Shishir,
swap isn't recommended by DataStax!

---
VafaTech.com - A Total Solution for Data Gathering & Analysis
---


On Tue, Dec 3, 2019 at 5:53 PM Shishir Kumar 
mailto:shishirroy2...@gmail.com>> wrote:
Options: Assuming model and configurations are good and Data size per node less 
than 1 TB (though no such Benchmark).

1. Infra scale for memory
2. Try to change disk_access_mode to mmap_index_only.
In this case you should not have any in memory DB tables.
3. Though Datastax do not recommended and recommends Horizontal scale, so based 
on your requirement alternate old fashion option is to add swap space.

-Shishir

On Tue, 3 Dec 2019, 15:52 John Belliveau, 
mailto:belliveau.j...@gmail.com>> wrote:
Reid,

I've only been working with Cassandra for 2 years, and this echoes my 
experience as well.

Regarding the cache use, I know every use case is different, but have you 
experimented and found any performance benefit to increasing its size?

Thanks,
John Belliveau

On Mon, Dec 2, 2019, 11:07 AM Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>> wrote:
Rahul, if my memory of this is correct, that particular logging message is 
noisy, the cache is pretty much always used to its limit (and why not, it’s a 
cache, no point in using less than you have).

No matter what value you set, you’ll just change the “reached (….)” part of it. 
 I think what would help you more is to work with the team(s) that have apps 
depending upon C* and decide what your performance SLA is with them.  If you 
are meeting your SLA, you don’t care about noisy messages.  If you aren’t 
meeting your SLA, then the noisy messages become sources of ideas to look at.

One thing you’ll find out pretty quickly

Re: "Maximum memory usage reached (512.000MiB), cannot allocate chunk of 1.000MiB"

2019-12-03 Thread Reid Pinchback
John, anything I’ll say will be as a collective ‘we’ since it has been a team 
effort here at Trip, and I’ve just been the hired gun to help out a bit. I’m 
more of a Postgres and Java guy so filter my answers accordingly.

I can’t say we saw as much relevance to tuning chunk cache size, as we did to 
do everything possible to migrate things off-heap.  I haven’t worked with 2.x 
so I don’t know how much these options changed, but in 3.11.x anyways, you 
definitely can migrate a fair bit off-heap.  Our first use case was 3 9’s 
sensitive on latency, which turns out to be a rough go for C* particularly if 
the data model is a bit askew from C*’s sweet spot, as was true for us.  The 
deeper merkle trees that were introduced somewhere I think in the 3.0.x series, 
that was a bane of our existence, we back-patched the 4.0 work to tune the tree 
height so that we weren’t OOMing nodes during reaper repair runs.

As to Shishir’s notion of using swap, because latency mattered to us, we had 
RAM headroom on the boxes.  We couldn’t use it all without pushing on something 
that was hurting us on 3 9’s.  C* is like this over-constrained problem space 
when it comes to tuning, poking in one place resulted in a twitch somewhere 
else, and we had to see which twitches worked out in our favour. If, like us, 
you have RAM headroom, you’re unlikely to care about swap for obvious reasons.  
All you really need is enough room for the O/S file buffer cache.

Tuning related to I/O and file buffer cache mattered a fair bit.  As did GC 
tuning obviously.  Personally, if I were to look at swap as helpful, I’d be 
debating with myself if the sstables should just remain uncompressed in the 
first place.  After all, swap space is disk space so holding 
compressed+uncompressed at the same time would only make sense if the storage 
footprint was large but the hot data in use was routinely much smaller… yet 
stuck around long enough in a cold state that the kernel would target it to 
swap out.  That’s a lot of if’s to line up to your benefit.  When it comes to a 
system running based on garbage collection, I get skeptical of how effectively 
the O/S will determine what is good to swap. Most of the JVM memory in C* 
churns at a rate that you wouldn’t want swap i/o to combine with if you cared 
about latency.  Not everybody cares about tight variance on latency though, so 
there can be other rationales for tuning that would result in different 
conclusions from ours.

I might have more definitive statements to make in the upcoming months, I’m in 
the midst of putting together my own test cluster for more controlled analysis 
on C* and Kafka tuning. Tuning live environments I’ve found makes it hard to 
control the variables enough for my satisfaction. It can feel like a game of 
empirical whack-a-mole.


From: Shishir Kumar 
Reply-To: "user@cassandra.apache.org" 
Date: Tuesday, December 3, 2019 at 9:23 AM
To: "user@cassandra.apache.org" 
Subject: Re: "Maximum memory usage reached (512.000MiB), cannot allocate chunk 
of 1.000MiB"

Message from External Sender
Options: Assuming model and configurations are good and Data size per node less 
than 1 TB (though no such Benchmark).

1. Infra scale for memory
2. Try to change disk_access_mode to mmap_index_only.
In this case you should not have any in memory DB tables.
3. Though Datastax do not recommended and recommends Horizontal scale, so based 
on your requirement alternate old fashion option is to add swap space.

-Shishir

On Tue, 3 Dec 2019, 15:52 John Belliveau, 
mailto:belliveau.j...@gmail.com>> wrote:
Reid,

I've only been working with Cassandra for 2 years, and this echoes my 
experience as well.

Regarding the cache use, I know every use case is different, but have you 
experimented and found any performance benefit to increasing its size?

Thanks,
John Belliveau

On Mon, Dec 2, 2019, 11:07 AM Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>> wrote:
Rahul, if my memory of this is correct, that particular logging message is 
noisy, the cache is pretty much always used to its limit (and why not, it’s a 
cache, no point in using less than you have).

No matter what value you set, you’ll just change the “reached (….)” part of it. 
 I think what would help you more is to work with the team(s) that have apps 
depending upon C* and decide what your performance SLA is with them.  If you 
are meeting your SLA, you don’t care about noisy messages.  If you aren’t 
meeting your SLA, then the noisy messages become sources of ideas to look at.

One thing you’ll find out pretty quickly.  There are a lot of knobs you can 
turn with C*, too many to allow for easy answers on what you should do.  Figure 
out what your throughput and latency SLAs are, and you’ll know when to stop 
tuning.  Otherwise you’ll discover that it’s a rabbit hole you can dive into 
and not come out of for weeks.


From: Hossein Ghiyasi Mehr mailto:ghiyasim...@gmail.com>>

Re: "Maximum memory usage reached (512.000MiB), cannot allocate chunk of 1.000MiB"

2019-12-02 Thread Reid Pinchback
Rahul, if my memory of this is correct, that particular logging message is 
noisy, the cache is pretty much always used to its limit (and why not, it’s a 
cache, no point in using less than you have).

No matter what value you set, you’ll just change the “reached (….)” part of it. 
 I think what would help you more is to work with the team(s) that have apps 
depending upon C* and decide what your performance SLA is with them.  If you 
are meeting your SLA, you don’t care about noisy messages.  If you aren’t 
meeting your SLA, then the noisy messages become sources of ideas to look at.

One thing you’ll find out pretty quickly.  There are a lot of knobs you can 
turn with C*, too many to allow for easy answers on what you should do.  Figure 
out what your throughput and latency SLAs are, and you’ll know when to stop 
tuning.  Otherwise you’ll discover that it’s a rabbit hole you can dive into 
and not come out of for weeks.


From: Hossein Ghiyasi Mehr 
Reply-To: "user@cassandra.apache.org" 
Date: Monday, December 2, 2019 at 10:35 AM
To: "user@cassandra.apache.org" 
Subject: Re: "Maximum memory usage reached (512.000MiB), cannot allocate chunk 
of 1.000MiB"

Message from External Sender
It may be helpful: 
https://thelastpickle.com/blog/2018/08/08/compression_performance.html
It's complex. Simple explanation, cassandra keeps sstables in memory based on 
chunk size and sstable parts. It manage loading new sstables to memory based on 
requests on different sstables correctly . You should be worry about it 
(sstables loaded in memory)

VafaTech.com - A Total Solution for Data Gathering & Analysis


On Mon, Dec 2, 2019 at 6:18 PM Rahul Reddy 
mailto:rahulreddy1...@gmail.com>> wrote:
Thanks Hossein,

How does the chunks are moved out of memory (LRU?) if it want to make room for 
new requests to get chunks?if it has mechanism to clear chunks from cache what 
causes to cannot allocate chunk? Can you point me to any documention?

On Sun, Dec 1, 2019, 12:03 PM Hossein Ghiyasi Mehr 
mailto:ghiyasim...@gmail.com>> wrote:
Chunks are part of sstables. When there is enough space in memory to cache 
them, read performance will increase if application requests it again.

Your real answer is application dependent. For example write heavy applications 
are different than read heavy or read-write heavy. Real time applications are 
different than time series data environments and ... .



On Sun, Dec 1, 2019 at 7:09 PM Rahul Reddy 
mailto:rahulreddy1...@gmail.com>> wrote:
Hello,

We are seeing memory usage reached 512 mb and cannot allocate 1MB.  I see this 
because file_cache_size_mb by default set to 512MB.

Datastax document recommends to increase the file_cache_size.

We have 32G over all memory allocated 16G to Cassandra. What is the recommended 
value in my case. And also when does this memory gets filled up frequent does 
nodeflush helps in avoiding this info messages?


Re: Cassandra 3.0.18 went OOM several hours after joining a cluster

2019-11-06 Thread Reid Pinchback
Just food for thought.

Elevated read requests won’t result in escalating pending compactions, except 
in the corner case where the reads trigger additional write work, like for a 
repair or lurking tombstones deemed droppable.  For a sustained growth in 
pending compactions, that’s not looking like random tripping over corner cases. 
  All an elevated read request rate would do, if it weren’t for an increasing 
number of sstables, is cause you to churn the chunk cache. Reads would be 
slower due to the cache misses but the memory footprint wouldn’t be that 
different.


From: "Steinmaurer, Thomas" 
Reply-To: "user@cassandra.apache.org" 
Date: Wednesday, November 6, 2019 at 2:43 PM
To: "user@cassandra.apache.org" 
Subject: RE: Cassandra 3.0.18 went OOM several hours after joining a cluster

Message from External Sender
Reid,

thanks for thoughts.

I agree with your last comment and I’m pretty sure/convinced that the 
increasing number of SSTables is causing the issue, although I’m not sure if 
compaction or read requests (after the node flipped from UJ to UN) or both, but 
I tend more towards client read requests resulting in accessing a high number 
of SSTables which basically results in ~ 2Mbyte on-heap usage per 
BigTableReader instance, with ~ 5K such object instances on the heap.

The big question for us is why this starts to pop-up with Cas 3.0 without 
seeing this with 2.1 in > 3 years production usage.

To avoid double work, I will try to continue providing additional information / 
thoughts on the Cassandra ticket.

Regards,
Thomas

From: Reid Pinchback 
Sent: Mittwoch, 06. November 2019 18:28
To: user@cassandra.apache.org
Subject: Re: Cassandra 3.0.18 went OOM several hours after joining a cluster

The other thing that comes to mind is that the increase in pending compactions 
suggests back pressure on compaction activity.  GC is only one possible source 
of that.  Between your throughput setting and how your disk I/O is set up, 
maybe that’s throttling you to a rate where the rate of added reasons for 
compactions > the rate of compactions completed.

In fact, the more that I think about it, I wonder about that a lot.

If you can’t keep up with compactions, then operations have to span more and 
more SSTables over time.  You’ll keep holding on to what you read, as you read 
more of them, until eventually…pop.


From: Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Date: Wednesday, November 6, 2019 at 12:11 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Subject: Re: Cassandra 3.0.18 went OOM several hours after joining a cluster

Message from External Sender
My first thought was that you were running into the merkle tree depth problem, 
but the details on the ticket don’t seem to confirm that.

It does look like eden is too small.   C* lives in Java’s GC pain point, a lot 
of medium-lifetime objects.  If you haven’t already done so, you’ll want to 
configure as many things to be off-heap as you can, but I’d definitely look at 
improving the ratio of eden to old gen, and see if you can get the young gen GC 
activity to be more successful at sweeping away the medium-lived objects.

All that really comes to mind is if you’re getting to a point where GC isn’t 
coping.  That can be hard to sometimes spot on metrics with coarse granularity. 
 Per-second metrics might show CPU cores getting pegged.

I’m not sure that GC tuning eliminates this problem, but if it isn’t being 
caused by that, GC tuning may at least improve the visibility of the underlying 
problem.

From: "Steinmaurer, Thomas" 
mailto:thomas.steinmau...@dynatrace.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Date: Wednesday, November 6, 2019 at 11:27 AM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Subject: Cassandra 3.0.18 went OOM several hours after joining a cluster

Message from External Sender
Hello,

after moving from 2.1.18 to 3.0.18, we are facing OOM situations after several 
hours a node has successfully joined a cluster (via auto-bootstrap).

I have created the following ticket trying to describe the situation, including 
hprof / MAT screens: 
https://issues.apache.org/jira/browse/CASSANDRA-15400<https://urldefense.proofpoint.com/v2/url?u=https-3A__nam02.safelinks.protection.outlook.com_-3Furl-3Dhttps-253A-252F-252Furldefense.proofpoint.com-252Fv2-252Furl-253Fu-253Dhttps-2D3A-5F-5Fissues.apache.org-5Fjira-5Fbrowse-5FCASSANDRA-2D2D15400-2526d-253DDwMF-2Dg-2526c-253D9Hv6XPedRSA-2D5PSECC38X80c1h60-5FXWA4z1k-5FR1pROA-2526r-253DOIgB3poYhzp3-5FA7WgD7iBCnsJaYmspOa2okNpf6uqWc-2526m-253DlnQdp

Re: Cassandra 3.0.18 went OOM several hours after joining a cluster

2019-11-06 Thread Reid Pinchback
The other thing that comes to mind is that the increase in pending compactions 
suggests back pressure on compaction activity.  GC is only one possible source 
of that.  Between your throughput setting and how your disk I/O is set up, 
maybe that’s throttling you to a rate where the rate of added reasons for 
compactions > the rate of compactions completed.

In fact, the more that I think about it, I wonder about that a lot.

If you can’t keep up with compactions, then operations have to span more and 
more SSTables over time.  You’ll keep holding on to what you read, as you read 
more of them, until eventually…pop.


From: Reid Pinchback 
Reply-To: "user@cassandra.apache.org" 
Date: Wednesday, November 6, 2019 at 12:11 PM
To: "user@cassandra.apache.org" 
Subject: Re: Cassandra 3.0.18 went OOM several hours after joining a cluster

Message from External Sender
My first thought was that you were running into the merkle tree depth problem, 
but the details on the ticket don’t seem to confirm that.

It does look like eden is too small.   C* lives in Java’s GC pain point, a lot 
of medium-lifetime objects.  If you haven’t already done so, you’ll want to 
configure as many things to be off-heap as you can, but I’d definitely look at 
improving the ratio of eden to old gen, and see if you can get the young gen GC 
activity to be more successful at sweeping away the medium-lived objects.

All that really comes to mind is if you’re getting to a point where GC isn’t 
coping.  That can be hard to sometimes spot on metrics with coarse granularity. 
 Per-second metrics might show CPU cores getting pegged.

I’m not sure that GC tuning eliminates this problem, but if it isn’t being 
caused by that, GC tuning may at least improve the visibility of the underlying 
problem.

From: "Steinmaurer, Thomas" 
Reply-To: "user@cassandra.apache.org" 
Date: Wednesday, November 6, 2019 at 11:27 AM
To: "user@cassandra.apache.org" 
Subject: Cassandra 3.0.18 went OOM several hours after joining a cluster

Message from External Sender
Hello,

after moving from 2.1.18 to 3.0.18, we are facing OOM situations after several 
hours a node has successfully joined a cluster (via auto-bootstrap).

I have created the following ticket trying to describe the situation, including 
hprof / MAT screens: 
https://issues.apache.org/jira/browse/CASSANDRA-15400<https://urldefense.proofpoint.com/v2/url?u=https-3A__issues.apache.org_jira_browse_CASSANDRA-2D15400=DwMF-g=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=lnQdpMrbVjmjj_af9BwSn1ftI8H2uSyvAya3887aDLk=BEeQbrRZS6Z1i25NSdwRmQVpQ36AvSNz_i8Y9ks5UmA=>

Would be great if someone could have a look.

Thanks a lot.

Thomas
The contents of this e-mail are intended for the named addressee only. It 
contains information that may be confidential. Unless you are the named 
addressee or an authorized designee, you may not copy or use it, or disclose it 
to anyone else. If you received it in error please notify us immediately and 
then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) is a 
company registered in Linz whose registered office is at 4040 Linz, Austria, 
Freistädterstraße 313


Re: Aws instance stop and star with ebs

2019-11-06 Thread Reid Pinchback
Almost 15 minutes, that sounds suspiciously like blocking on a default TCP 
socket timeout.

From: Rahul Reddy 
Reply-To: "user@cassandra.apache.org" 
Date: Wednesday, November 6, 2019 at 12:12 PM
To: "user@cassandra.apache.org" 
Subject: Re: Aws instance stop and star with ebs

Message from External Sender
Thank you.
I have stopped instance in east. i see that all other instances can gossip to 
that instance and only one instance in west having issues gossiping to that 
node.  when i enable debug mode i see below on the west node

i see bellow messages from 16:32 to 16:47
DEBUG [RMI TCP Connection(272)-127.0.0.1] 2019-11-06 16:44:50,
417 StorageProxy.java:2361 - Hosts not in agreement. Didn't get a response from 
everybody:
424 StorageProxy.java:2361 - Hosts not in agreement. Didn't get a response from 
everybody:

later i see timeout
DEBUG [MessagingService-Outgoing-/eastip-Gossip] 2019-11-06 16:47:04,831 
OutboundTcpConnection.java:350 - Error writing to /eastip
java.io.IOException: Connection timed out

then  INFO  [GossipStage:1] 2019-11-06 16:47:05,792 StorageService.j
ava:2289 - Node /eastip state jump to NORMAL

DEBUG [GossipStage:1] 2019-11-06 16:47:06,244 MigrationManager
.java:99 - Not pulling schema from /eastip, because sche
ma versions match: local/real=cdbb639b-1675-31b3-8a0d-84aca18e
86bf, local/compatible=49bf1daa-d585-38e0-a72b-b36ce82da9cb, r
emote=cdbb639b-1675-31b3-8a0d-84aca18e86bf

i tried running some tcpdump during that time i dont see any packet loss during 
that time.  still unsure why east instance which was stopped and started 
unreachable to west node almost for 15 minutes.


On Tue, Nov 5, 2019 at 10:14 PM daemeon reiydelle 
mailto:daeme...@gmail.com>> wrote:
10 minutes is 600 seconds, and there are several timeouts that are set to that, 
including the data center timeout as I recall.

You may be forced to tcpdump the interface(s) to see where the chatter is. Out 
of curiosity, when you restart the node, have you snapped the jvm's memory to 
see if e.g. heap is even in use?


On Tue, Nov 5, 2019 at 7:03 PM Rahul Reddy 
mailto:rahulreddy1...@gmail.com>> wrote:
Thanks Ben,
Before stoping the ec2 I did run nodetool drain .so i ruled it out and 
system.log also doesn't show commitlogs being applied.




On Tue, Nov 5, 2019, 7:51 PM Ben Slater 
mailto:ben.sla...@instaclustr.com>> wrote:
The logs between first start and handshaking should give you a clue but my 
first guess would be replaying commit logs.

Cheers
Ben

---

Ben Slater
Chief Product Officer

Error! Filename not 
specified.

Error! Filename not 
specified.
  Error! Filename not 
specified.
  Error! Filename not 
specified.

Read our latest technical blog posts 
here.

This email has been sent on behalf of Instaclustr Pty. Limited (Australia) and 
Instaclustr Inc (USA).

This email and any attachments may contain confidential and legally privileged 
information.  If you are not the intended recipient, do not copy or disclose 
its content, but please reply to this email immediately and highlight the error 
to the sender and then immediately delete the message.


On Wed, 6 Nov 2019 at 04:36, Rahul Reddy 
mailto:rahulreddy1...@gmail.com>> wrote:
I can reproduce the issue.

I did drain Cassandra node then stop and started Cassandra instance . Cassandra 
instance comes up but other nodes will be in DN state around 10 minutes.

I don't see error in the systemlog

DN  xx.xx.xx.59   420.85 MiB  256  48.2% id  2
UN  xx.xx.xx.30   432.14 MiB  256  50.0% id  0
UN  xx.xx.xx.79   447.33 MiB  256  51.1% id  4
DN  xx.xx.xx.144  452.59 MiB  256  51.6%   

Re: Cassandra 3.0.18 went OOM several hours after joining a cluster

2019-11-06 Thread Reid Pinchback
My first thought was that you were running into the merkle tree depth problem, 
but the details on the ticket don’t seem to confirm that.

It does look like eden is too small.   C* lives in Java’s GC pain point, a lot 
of medium-lifetime objects.  If you haven’t already done so, you’ll want to 
configure as many things to be off-heap as you can, but I’d definitely look at 
improving the ratio of eden to old gen, and see if you can get the young gen GC 
activity to be more successful at sweeping away the medium-lived objects.

All that really comes to mind is if you’re getting to a point where GC isn’t 
coping.  That can be hard to sometimes spot on metrics with coarse granularity. 
 Per-second metrics might show CPU cores getting pegged.

I’m not sure that GC tuning eliminates this problem, but if it isn’t being 
caused by that, GC tuning may at least improve the visibility of the underlying 
problem.

From: "Steinmaurer, Thomas" 
Reply-To: "user@cassandra.apache.org" 
Date: Wednesday, November 6, 2019 at 11:27 AM
To: "user@cassandra.apache.org" 
Subject: Cassandra 3.0.18 went OOM several hours after joining a cluster

Message from External Sender
Hello,

after moving from 2.1.18 to 3.0.18, we are facing OOM situations after several 
hours a node has successfully joined a cluster (via auto-bootstrap).

I have created the following ticket trying to describe the situation, including 
hprof / MAT screens: 
https://issues.apache.org/jira/browse/CASSANDRA-15400

Would be great if someone could have a look.

Thanks a lot.

Thomas
The contents of this e-mail are intended for the named addressee only. It 
contains information that may be confidential. Unless you are the named 
addressee or an authorized designee, you may not copy or use it, or disclose it 
to anyone else. If you received it in error please notify us immediately and 
then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) is a 
company registered in Linz whose registered office is at 4040 Linz, Austria, 
Freistädterstraße 313


Re: ***UNCHECKED*** Re: Memory Recommendations for G1GC

2019-11-04 Thread Reid Pinchback
It’s not a setting I’ve played with at all.  I understand the gist of it 
though, essentially it’ll let you automatically adjust your JVM size relative 
to whatever you allocated to the cgroup.  Unfortunately I’m not a K8s developer 
(that may change shortly, but atm the case).  What you need to a firm handle on 
yourself is where does the memory for the O/S file cache live, and is that size 
sufficient for your read/write activity.  Bare metal and VM tuning I understand 
better, so I’ll have to defer to others who may have specific personal 
experience with the details, but the essence of the issue should remain the 
same.  You want a file cache that functions appropriately or you’ll get 
excessive stalls happening on either reading from disk or flushing dirty pages 
to disk.


From: Ben Mills 
Reply-To: "user@cassandra.apache.org" 
Date: Monday, November 4, 2019 at 12:14 PM
To: "user@cassandra.apache.org" 
Subject: Re: ***UNCHECKED*** Re: Memory Recommendations for G1GC

CGroup


Re: ***UNCHECKED*** Re: Memory Recommendations for G1GC

2019-11-04 Thread Reid Pinchback
Hi Ben, just catching up over the weekend.

The typical advice, per Sergio’s link reference, is an obvious starting point.  
We use G1GC and normally I’d treat 8gig as the minimal starting point for a 
heap.  What sometimes doesn’t get talked about in the myriad of tunings, is 
that you have to have a clear goal in your mind on what you are tuning *for*. 
You could be tuning for throughput, or average latency, or 99’s latency, etc.  
How you tune varies quite a lot according to your goal.  The more your goal is 
about latency, the more work you have ahead of you.

I will suggest that, if your data footprint is going to stay low, that you give 
yourself permission to do some experimentation.  As you’re using K8s, you are 
in a bit of a position where if your usage is small enough, you can get 2x bang 
for the buck on your servers by sizing the pods to about 45% of server 
resources and using the C* rack metaphor to ensure you don’t co-locate replicas.

For example, were I you, I’d start asking myself if SSTable compression 
mattered to me at all.  The reason I’d start asking myself questions like that 
is C* has multiple uses of memory, and one of the balancing acts is chunk cache 
and the O/S file cache.  If I could find a way to make my O/S file cache be a 
defacto C* cache, I’d roll up the shirt sleeves and see what kind of 
performance numbers I could squeeze out with some creative tuning experiments.  
Now, I’m not saying *do* that, because your write volume also plays a roll, and 
you said you’re expecting a relatively even balance in reads and writes.  I’m 
just saying, by way of example, I’d start weighing if the advice I get online 
was based in experience similar to my current circumstance, or ones that were 
very different.

R

From: Ben Mills 
Reply-To: "user@cassandra.apache.org" 
Date: Monday, November 4, 2019 at 8:51 AM
To: "user@cassandra.apache.org" 
Subject: Re: ***UNCHECKED*** Re: Memory Recommendations for G1GC

Message from External Sender
Hi (yet again) Sergio,

Finally, note that we use this 
sidecar
 for shipping metrics to Stackdriver. It runs as a second container within our 
Prometheus stateful set.


On Mon, Nov 4, 2019 at 8:46 AM Ben Mills 
mailto:b...@bitbrew.com>> wrote:
Hi (again) Sergio,

I forgot to note that along with Prometheus, we use Grafana (with Prometheus as 
its data source) as well as Stackdriver for monitoring.

As Stackdriver is still developing (i.e. does not have all the features we 
need), we tend to use it for the basics (i.e. monitoring and alerting on 
memory, cpu and disk (PVs) thresholds). More specifically, the Prometheus JMX 
exporter (noted above) scrapes all the MBeans inside Cassandra, exporting in 
the Prometheus data model. Its config map filters (allows) our metrics of 
interest, and those metrics are sent to our Grafana instances and to 
Stackdriver. We use Grafana for more advanced metric configs that provide 
deeper insight in Cassandra - e.g. read/write latencies and so forth. For 
monitoring memory utilization, we monitor both pod-level in Stackdriver (i.e. 
to avoid having a Cassandra pod oomkilled by kubelet) as well as inside the JVM 
(heap space).

Hope this helps.

On Mon, Nov 4, 2019 at 8:26 AM Ben Mills 
mailto:b...@bitbrew.com>> wrote:
Hi Sergio,

Thanks for this and sorry for the slow reply.

We are indeed still running Java 8 and so it's very helpful.

This Cassandra cluster has been running reliably in Kubernetes for several 
years, and while we've had some repair-related issues, they are not related to 
container orchestration or the cloud environment. We don't use operators and 
have simply built the needed Kubernetes configs (YAML manifests) to handle 
deployment of new Docker images (when needed), and so forth. We have:

(1) ConfigMap - Cassandra environment variables
(2) ConfigMap - Prometheus configs for this JMX 
exporter,
 which is built into the image and runs as a Java agent
(3) PodDisruptionBudget - with minAvailable: 2 as the important setting
(4) Service - this is a headless service (clusterIP: None) which specifies the 
ports for cql, jmx, prometheus, intra-node
(5) StatefulSet - 3 replicas, ports, health checks, resources, etc - as you 
would expect

We store data on persistent volumes using an SSD storage class, and use: an 
updateStrategy of OnDelete, some affinity rules to ensure an even spread of 
pods across our zones, Prometheus annotations for 

Re: Memory Recommendations for G1GC

2019-11-01 Thread Reid Pinchback
Maybe I’m missing something.  You’re expecting less than 1 gig of data per 
node?  Unless this is some situation of super-high data churn/brief TTL, it 
sounds like you’ll end up with your entire database in memory.

From: Ben Mills 
Reply-To: "user@cassandra.apache.org" 
Date: Friday, November 1, 2019 at 3:31 PM
To: "user@cassandra.apache.org" 
Subject: Memory Recommendations for G1GC

Message from External Sender
Greetings,

We are planning a Cassandra upgrade from 3.7 to 3.11.5 and considering a change 
to the GC config.

What is the minimum amount of memory that needs to be allocated to heap space 
when using G1GC?

For GC, we currently use CMS. Along with the version upgrade, we'll be running 
the stateful set of Cassandra pods on new machine types in a new node pool with 
12Gi memory per node. Not a lot of memory but an improvement. We may be able to 
go up to 16Gi memory per node. We'd like to continue using these heap settings:

-XX:+UnlockExperimentalVMOptions
-XX:+UseCGroupMemoryLimitForHeap
-XX:MaxRAMFraction=2

which (if 12Gi per node) would provide 6Gi memory for heap (i.e. half of total 
available).

Here are some details on the environment and configs in the event that 
something is relevant.

Environment: Kubernetes
Environment Config: Stateful set of 3 replicas
Storage: Persistent Volumes
Storage Class: SSD
Node OS: Container-Optimized OS
Container OS: Ubuntu 16.04.3 LTS
Data Centers: 1
Racks: 3 (one per zone)
Nodes: 3
Tokens: 4
Replication Factor: 3
Replication Strategy: NetworkTopologyStrategy (all keyspaces)
Compaction Strategy: STCS (all tables)
Read/Write Requirements: Blend of both
Data Load: <1GB per node
gc_grace_seconds: default (10 days - all tables)

GC Settings: (CMS)

-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:CMSWaitDuration=3
-XX:+CMSParallelInitialMarkEnabled
-XX:+CMSEdenChunksRecordAlways

Any ideas are much appreciated.


Re: Cassandra 4 alpha/alpha2

2019-11-01 Thread Reid Pinchback
That is indeed what Amazon AMIs are for.  

However if your question is “why don’t the C* developers do that for people?” 
the answer is going to be some mix of “people only do so much work for free” 
and “the ones that don’t do it for free have a company you pay to do things 
like that (Datastax)”.  Keep in mind, that when you create AMIs you’re using 
AWS resources and whoever owns the account that did the work, is on the hook to 
pay for the resources.

But if your question is about whether you can do that for your own company, 
then obviously yes.  And when you do so at first it’ll be about C*, then it’ll 
be about how your company in particular likes to monitor things, and handle 
backup, spec out encryption of data at rest, and deal with auth security, and 
deal with log shipping, and deal with PII concerns, and …

Which is why there isn’t really a big win to other people setting up an AMI for 
you, except in cases where they are offering whatever-it-is-as-a-service and 
get paid for its usage.  1000 consumers will say they want a simple thing, but 
all 1000 usages will be a little different, and nobody will like the AMI they 
get if their simple thing isn’t present on it.

(plus AMI creation and maintenance, within and across regions, is just a pain 
in the rump and I can’t imagine doing it without money coming back from the 
effort)


From: Sergio 
Reply-To: "user@cassandra.apache.org" 
Date: Thursday, October 31, 2019 at 4:09 PM
To: "user@cassandra.apache.org" 
Subject: Re: Cassandra 4 alpha/alpha2

Message from External Sender
OOO but still relevant:
Would not it be possible to create an Amazon AMI that has all the OS and JVM 
settings in the right place and from there each developer can tweak the things 
that need to be adjusted?
Best,
Sergio

Il giorno gio 31 ott 2019 alle ore 12:56 Abdul Patel 
mailto:abd786...@gmail.com>> ha scritto:
Looks like i am messing up or missing something ..will revisit again

On Thursday, October 31, 2019, Stefan Miklosovic 
mailto:stefan.mikloso...@instaclustr.com>> 
wrote:
Hi,

I have tested both alpha and alpha2 and 3.11.5 on Centos 7.7.1908 and
all went fine (I have some custom images for my own purposes).

Update between alpha and alpha2 was just about mere version bump.

Cheers

On Thu, 31 Oct 2019 at 20:40, Abdul Patel 
mailto:abd786...@gmail.com>> wrote:
>
> Hey Everyone
>
> Did anyone was successfull to install either alpha or alpha2 version for 
> cassandra 4.0?
> Found 2 issues :
> 1> cassandra-env.sh:
> JAVA_VERSION varianle is not defined.
> Jvm-server.options file is not defined.
>
> This is fixable and after adding those , the error for cassandra-env.sh 
> errora went away.
>
> 2> second and major issue the cassandea binary when i try to start says 
> syntax error.
>
> /bin/cassandea: line 198:exec: : not found.
>
> Anyone has any idea on second issue?
>

-
To unsubscribe, e-mail: 
user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: 
user-h...@cassandra.apache.org


Re: Cassandra 3.11.4 Node the load starts to increase after few minutes to 40 on 4 CPU machine

2019-11-01 Thread Reid Pinchback
Hi Sergio,

I’m definitely not enough of a network wonk to make definitive statements on 
network configuration, finding your in-company network expert is definitely 
going to be a lot more productive.  I’ve forgotten if you are on-prem or in 
AWS, so if in AWS replace “your network wonk” with “your AWS support contact” 
if you’re paying for support.  I will make two more concrete observations 
though, and you can run these notions down as appropriate.

When C* starts up, see if the logs contain a warning about jemalloc not being 
detected.  That’s something we missed in our 3.11.4 setup and is on my todo 
list to circle back around to evaluate later.  JVMs have some rather 
complicated memory management that relates to efficient allocation of memory to 
threads (this isn’t strictly a JVM thing, but JVMs definitely care).  If you 
have high connection counts, I can see that likely mattering to you.  Also, as 
part of that, the memory arena setting of 4 that is Cassandra’s default may not 
be the right one for you.  The more concurrency you have, the more that number 
may need to bump up to avoid contention on memory allocations.  We haven’t 
played with it because our simultaneous connection counts are modest.  Note 
that Cassandra can create a lot of threads but many of them have low activity 
so I think it’s more about how many area actually active.  Large connection 
counts will move the needle up on you and may motivate tuning the arena count.

When talking to your network person, I’d see what they think about C*’s 
defaults on TCP_NODELAY vs delayed ACKs.  The Datastax docs say that the 
TCP_NODELAY default setting is false in C*, but I looked in the 3.11.4 source 
and the default is coded as true.  It’s only via the config file samples that 
bounce around that it typically gets set to false.  There are times where Nagle 
and delayed ACKs don’t play well together and induce stalls.  I’m not the 
person to help you investigate that because it gets a bit gnarly on the details 
(for example, a refinement to the Nagle algorithm was proposed in the 1990’s 
that exists in some OS’s and can make my comments here moot).  Somebody who 
lives this stuff will be a more definitive source, but you are welcome to 
copy-paste my thoughts to them for context.

R

From: Sergio 
Reply-To: "user@cassandra.apache.org" 
Date: Wednesday, October 30, 2019 at 5:56 PM
To: "user@cassandra.apache.org" 
Subject: Re: Cassandra 3.11.4 Node the load starts to increase after few 
minutes to 40 on 4 CPU machine

Message from External Sender
Hi Reid,

I don't have anymore this loading problem.
I solved by changing the Cassandra Driver Configuration.
Now my cluster is pretty stable and I don't have machines with crazy CPU Load.
The only thing not urgent but I need to investigate is the number of 
ESTABLISHED TCP connections. I see just one node having 7K TCP connections 
ESTABLISHED while the others are having around 4-6K connection opened. So the 
newest nodes added into the cluster have a higher number of ESTABLISHED TCP 
connections.

default['cassandra']['sysctl'] = {
'net.ipv4.tcp_keepalive_time' => 60,
'net.ipv4.tcp_keepalive_probes' => 3,
'net.ipv4.tcp_keepalive_intvl' => 10,
'net.core.rmem_max' => 16777216,
'net.core.wmem_max' => 16777216,
'net.core.rmem_default' => 16777216,
'net.core.wmem_default' => 16777216,
'net.core.optmem_max' => 40960,
'net.ipv4.tcp_rmem' => '4096 87380 16777216',
'net.ipv4.tcp_wmem' => '4096 65536 16777216',
'net.ipv4.ip_local_port_range' => '1 65535',
'net.ipv4.tcp_window_scaling' => 1,
  'net.core.netdev_max_backlog' => 2500,
  'net.core.somaxconn' => 65000,
'vm.max_map_count' => 1048575,
'vm.swappiness' => 0
}

These are my tweaked value and I used the values recommended from datastax.

Do you have something different?

Best,
Sergio

Il giorno mer 30 ott 2019 alle ore 13:27 Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>> ha scritto:
Oh nvm, didn't see the later msg about just posting what your fix was.

R


On 10/30/19, 4:24 PM, "Reid Pinchback" 
mailto:rpinchb...@tripadvisor.com>> wrote:

 Message from External Sender

Hi Sergio,

Assuming nobody is actually mounting a SYN flood attack, then this sounds 
like you're either being hammered with connection requests in very short 
periods of time, or your TCP backlog tuning is off.   At least, that's where 
I'd start looking.  If you take that log message and google it (Possible SYN 
flooding... Sending cookies") you'll find explanations.  Or just googling "TCP 
backlog tuning".

R


On 10/30/19, 3:29 PM, "Sergio Bilello" 
mailto:lapostadiser...@gmail.com>> wrote:

>
>Oct 17 00:23:03 prod-personalization-live-data-cassandra-08 kernel: 
TCP: request_sock_TCP: Possible SYN flooding on port 9042. Sending cookies. 
Check SNMP counters.




--

Re: Cassandra 3.11.4 Node the load starts to increase after few minutes to 40 on 4 CPU machine

2019-10-30 Thread Reid Pinchback
Oh nvm, didn't see the later msg about just posting what your fix was.

R


On 10/30/19, 4:24 PM, "Reid Pinchback"  wrote:

 Message from External Sender

Hi Sergio,

Assuming nobody is actually mounting a SYN flood attack, then this sounds 
like you're either being hammered with connection requests in very short 
periods of time, or your TCP backlog tuning is off.   At least, that's where 
I'd start looking.  If you take that log message and google it (Possible SYN 
flooding... Sending cookies") you'll find explanations.  Or just googling "TCP 
backlog tuning".

R


On 10/30/19, 3:29 PM, "Sergio Bilello"  wrote:

>
>Oct 17 00:23:03 prod-personalization-live-data-cassandra-08 kernel: 
TCP: request_sock_TCP: Possible SYN flooding on port 9042. Sending cookies. 
Check SNMP counters.




-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org




Re: Cassandra 3.11.4 Node the load starts to increase after few minutes to 40 on 4 CPU machine

2019-10-30 Thread Reid Pinchback
Hi Sergio,

Assuming nobody is actually mounting a SYN flood attack, then this sounds like 
you're either being hammered with connection requests in very short periods of 
time, or your TCP backlog tuning is off.   At least, that's where I'd start 
looking.  If you take that log message and google it (Possible SYN flooding... 
Sending cookies") you'll find explanations.  Or just googling "TCP backlog 
tuning".

R


On 10/30/19, 3:29 PM, "Sergio Bilello"  wrote:

>
>Oct 17 00:23:03 prod-personalization-live-data-cassandra-08 kernel: TCP: 
request_sock_TCP: Possible SYN flooding on port 9042. Sending cookies. Check 
SNMP counters.




-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org


Re: Where to get old RPMs?

2019-10-30 Thread Reid Pinchback
Oh, my mistake, there was also another subdirectory there with the old rpm’s, I 
missed that the first time.  Thanks.


From: Reid Pinchback 
Reply-To: "user@cassandra.apache.org" 
Date: Wednesday, October 30, 2019 at 1:47 PM
To: "user@cassandra.apache.org" 
Subject: Re: Where to get old RPMs?

Message from External Sender
Alas, that wasn’t the info I was looking for Jon, the archive site you pointed 
to is for the jars, not the rpms.  The rpm site you pointed at only has the 
current, not the past point releases.  Michael had the magic link though, I’m 
all set.  

R

From: Jon Haddad 
Reply-To: "user@cassandra.apache.org" 
Date: Wednesday, October 30, 2019 at 1:46 PM
To: "user@cassandra.apache.org" 
Subject: Re: Where to get old RPMs?

Message from External Sender
Archives are here: 
http://archive.apache.org/dist/cassandra/<https://urldefense.proofpoint.com/v2/url?u=http-3A__archive.apache.org_dist_cassandra_=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=RT8ws6ROpaY6sxQrxEhDMnlKYw5TmHjs2N95tiB-gE8=ziJyYQ3CI8ymS_g4_4FNyKklO9ng5VSdT22aZvbBiLc=>

For example, the RPM for 3.11.x you can find here: 
http://archive.apache.org/dist/cassandra/redhat/311x/<https://urldefense.proofpoint.com/v2/url?u=http-3A__archive.apache.org_dist_cassandra_redhat_311x_=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=RT8ws6ROpaY6sxQrxEhDMnlKYw5TmHjs2N95tiB-gE8=yA9DOlTaMJXJTy_4QVVx7BVRvjXVNWzUXZb1FZQ3uJI=>

The old releases are removed by Apache automatically as part of their policy, 
it's not specific to Cassandra.


On Wed, Oct 30, 2019 at 10:39 AM Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>> wrote:
With the latest round of C* updates, the yum repo no longer has whatever the 
previous version is.  For environments that try to do more controlled stepping 
of release changes instead of just taking the latest, is there any URL for 
previous versions of RPMs?  Previous jars I can find easily enough, but not 
RPMs.



Re: Where to get old RPMs?

2019-10-30 Thread Reid Pinchback
Alas, that wasn’t the info I was looking for Jon, the archive site you pointed 
to is for the jars, not the rpms.  The rpm site you pointed at only has the 
current, not the past point releases.  Michael had the magic link though, I’m 
all set.  

R

From: Jon Haddad 
Reply-To: "user@cassandra.apache.org" 
Date: Wednesday, October 30, 2019 at 1:46 PM
To: "user@cassandra.apache.org" 
Subject: Re: Where to get old RPMs?

Message from External Sender
Archives are here: 
http://archive.apache.org/dist/cassandra/<https://urldefense.proofpoint.com/v2/url?u=http-3A__archive.apache.org_dist_cassandra_=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=RT8ws6ROpaY6sxQrxEhDMnlKYw5TmHjs2N95tiB-gE8=ziJyYQ3CI8ymS_g4_4FNyKklO9ng5VSdT22aZvbBiLc=>

For example, the RPM for 3.11.x you can find here: 
http://archive.apache.org/dist/cassandra/redhat/311x/<https://urldefense.proofpoint.com/v2/url?u=http-3A__archive.apache.org_dist_cassandra_redhat_311x_=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=RT8ws6ROpaY6sxQrxEhDMnlKYw5TmHjs2N95tiB-gE8=yA9DOlTaMJXJTy_4QVVx7BVRvjXVNWzUXZb1FZQ3uJI=>

The old releases are removed by Apache automatically as part of their policy, 
it's not specific to Cassandra.


On Wed, Oct 30, 2019 at 10:39 AM Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>> wrote:
With the latest round of C* updates, the yum repo no longer has whatever the 
previous version is.  For environments that try to do more controlled stepping 
of release changes instead of just taking the latest, is there any URL for 
previous versions of RPMs?  Previous jars I can find easily enough, but not 
RPMs.



Re: Where to get old RPMs?

2019-10-30 Thread Reid Pinchback
Thanks Michael, that was exactly the info I needed.


On 10/30/19, 1:44 PM, "Michael Shuler"  wrote:

 Message from External Sender

On 10/30/19 12:39 PM, Reid Pinchback wrote:

> With the latest round of C* updates, the yum repo no longer has

> whatever the previous version is.  For environments that try to do

> more controlled stepping of release changes instead of just taking

> the latest, is there any URL for previous versions of RPMs?  Previous

> jars I can find easily enough, but not RPMs.



All the old release artifacts are archived at archive.apache.org. The 

non-latest RPMs are under the redhat/XYx/ directory for whichever major 

version you need.




https://urldefense.proofpoint.com/v2/url?u=https-3A__archive.apache.org_dist_cassandra_redhat_=DwICaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=9O3_kA5TUnjhQ536WNKKUYTgOlzxWCUwqbmQVwrfpSw=xngPBip0xqnNO9Laq8FosVyBGCbkDWxGgeZ3EcObklc=
 



Michael



-

To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org

For additional commands, e-mail: user-h...@cassandra.apache.org






-
To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org
For additional commands, e-mail: user-h...@cassandra.apache.org


Where to get old RPMs?

2019-10-30 Thread Reid Pinchback
With the latest round of C* updates, the yum repo no longer has whatever the 
previous version is.  For environments that try to do more controlled stepping 
of release changes instead of just taking the latest, is there any URL for 
previous versions of RPMs?  Previous jars I can find easily enough, but not 
RPMs.



Re: Repair Issues

2019-10-24 Thread Reid Pinchback
Ben, you may find this helpful:

https://blog.pythian.com/so-you-have-a-broken-cassandra-sstable-file/


From: Ben Mills 
Reply-To: "user@cassandra.apache.org" 
Date: Thursday, October 24, 2019 at 3:31 PM
To: "user@cassandra.apache.org" 
Subject: Repair Issues

Message from External Sender
Greetings,

Inherited a small Cassandra cluster with some repair issues and need some 
advice on recommended next steps. Apologies in advance for a long email.

Issue:

Intermittent repair failures on two non-system keyspaces.

- platform_users
- platform_management

Repair Type:

Full, parallel repairs are run on each of the three nodes every five days.

Repair command output for a typical failure:

[2019-10-18 00:22:09,109] Starting repair command #46, repairing keyspace 
platform_users with repair options (parallelism: parallel, primary range: 
false, incremental: false, job threads: 1, ColumnFamilies: [], dataCenters: [], 
hosts: [], # of ranges: 12)
[2019-10-18 00:22:09,242] Repair session 5282be70-f13d-11e9-9b4e-7f6db768ba9a 
for range [(-1890954128429545684,2847510199483651721], 
(8249813014782655320,-8746483007209345011], 
(4299912178579297893,6811748355903297393], 
(-8746483007209345011,-8628999431140554276], 
(-5865769407232506956,-4746990901966533744], 
(-4470950459111056725,-1890954128429545684], 
(4001531392883953257,4299912178579297893], 
(6811748355903297393,6878104809564599690], 
(6878104809564599690,8249813014782655320], 
(-4746990901966533744,-4470950459111056725], 
(-8628999431140554276,-5865769407232506956], 
(2847510199483651721,4001531392883953257]] failed with error [repair 
#5282be70-f13d-11e9-9b4e-7f6db768ba9a on platform_users/access_tokens_v2, 
[(-1890954128429545684,2847510199483651721], 
(8249813014782655320,-8746483007209345011], 
(4299912178579297893,6811748355903297393], 
(-8746483007209345011,-8628999431140554276], 
(-5865769407232506956,-4746990901966533744], 
(-4470950459111056725,-1890954128429545684], 
(4001531392883953257,4299912178579297893], 
(6811748355903297393,6878104809564599690], 
(6878104809564599690,8249813014782655320], 
(-4746990901966533744,-4470950459111056725], 
(-8628999431140554276,-5865769407232506956], 
(2847510199483651721,4001531392883953257]]] Validation failed in /10.x.x.x 
(progress: 26%)
[2019-10-18 00:22:09,246] Some repair failed
[2019-10-18 00:22:09,248] Repair command #46 finished in 0 seconds

Additional Notes:

Repairs encounter above failures more often than not. Sometimes on one node 
only, though occasionally on two. Sometimes just one of the two keyspaces, 
sometimes both. Apparently the previous repair schedule for this cluster 
included incremental repairs (script alternated between incremental and full 
repairs). After reading this TLP article:

https://thelastpickle.com/blog/2017/12/14/should-you-use-incremental-repair.html

the repair script was replaced with cassandra-reaper (v1.4.0), which was run 
with its default configs. Reaper was fine but only obscured the ongoing issues 
(it did not resolve them) and complicated the debugging process and so was then 
removed. The current repair schedule is as described above under Repair Type.

Attempts at Resolution:

(1) nodetool scrub was attempted on the offending keyspaces/tables to no effect.

(2) sstablescrub has not been attempted due to the current design of the Docker 
image that runs Cassandra in each Kubernetes pod - i.e. there is no way to stop 
the server to run this utility without killing the only pid running in the 
container.

Related Error:

Not sure if this is related, though sometimes, when either:

(a) Running nodetool snapshot, or
(b) Rolling a pod that runs a Cassandra node, which calls nodetool drain prior 
shutdown,

the following error is thrown:

-- StackTrace --
java.lang.RuntimeException: Last written key 
DecoratedKey(10df3ba1-6eb2-4c8e-bddd-c0c7af586bda, 
10df3ba16eb24c8ebdddc0c7af586bda) >= current key 
DecoratedKey(----, 
17343121887f480c9ba87c0e32206b74) writing into 
/cassandra_data/data/platform_management/device_by_tenant_v2-e91529202ccf11e7ab96d5693708c583/.device_by_tenant_tags_idx/mb-45-big-Data.db
at 
org.apache.cassandra.io.sstable.format.big.BigTableWriter.beforeAppend(BigTableWriter.java:114)
at 
org.apache.cassandra.io.sstable.format.big.BigTableWriter.append(BigTableWriter.java:153)
at 
org.apache.cassandra.io.sstable.SimpleSSTableMultiWriter.append(SimpleSSTableMultiWriter.java:48)
at 
org.apache.cassandra.db.Memtable$FlushRunnable.writeSortedContents(Memtable.java:441)
at 
org.apache.cassandra.db.Memtable$FlushRunnable.call(Memtable.java:477)
  

Re: Cassandra Rack - Datacenter Load Balancing relations

2019-10-24 Thread Reid Pinchback
Two different AWS AZs are in two different physical locations.  Typically 
different cities.  Which means that you’re trying to manage the risk of an AZ 
going dark, so you use more than one AZ just in case.  The downside is that you 
will have some degree of network performance difference between AZs because of 
whatever WAN pipe AWS owns/leased to connect between them.

Having a DC in one AZ is easy to reason about.  The AZ is there, or it is not.  
If you have two DCs in your cluster, and you lose an AZ, it means you still 
have a functioning cluster with one DC and you still have quorum.  Yay, even in 
an outage, you know you can still do business.  You would only have to route 
any traffic normally sent to the other DC to the remaining one, so as long as 
there is resource headroom planning in how you provision your hardware, you’re 
in a safe state.

If you start splitting a DC across AZs without using racks to organize nodes on 
a per-AZ basis, off the top of my head I don’t know how you reason about your 
risks for losing quorum without pausing to really think through vnodes and 
token distribution and whatnot.  I’m not a fan of topologies I can’t reason 
about when paged at 3 in the morning and I’m half asleep.  I prefer simple 
until the workload motivates complex.

R


From: Sergio 
Reply-To: "user@cassandra.apache.org" 
Date: Thursday, October 24, 2019 at 12:06 PM
To: "user@cassandra.apache.org" 
Subject: Re: Cassandra Rack - Datacenter Load Balancing relations

Message from External Sender
Thanks Reid and Jon!

Yes I will stick with one rack per DC for sure and I will look at the Vnodes 
problem later on.


What's the difference in terms of reliability between
A) spreading 2 Datacenters across 3 AZ
B) having 2 Datacenters in 2 separate AZ
?


Best,

Sergio

On Thu, Oct 24, 2019, 7:36 AM Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>> wrote:
Hey Sergio,

Forgive but I’m at work and had to skim the info quickly.

When in doubt, simplify.  So 1 rack per DC.  Distributed systems get rapidly 
harder to reason about the more complicated you make them.  There’s more than 
enough to learn about C* without jumping into the complexity too soon.

To deal with the unbalancing issue, pay attention to Jon Haddad’s advice on 
vnode count and how to fairly distribute tokens with a small vnode count.  I’d 
rather point you to his information, as I haven’t dug into vnode counts and 
token distribution in detail; he’s got a lot more time in C* than I do.  I come 
at this more as a traditional RDBMS and Java guy who has slowly gotten up to 
speed on C* over the last few years, and dealt with DynamoDB a lot so have 
lived with a lot of similarity in data modelling concerns.  Detailed internals 
I only know in cases where I had reason to dig into C* source.

There are so many knobs to turn in C* that it can be very easy to overthink 
things.  Simplify where you can.  Remove GC pressure wherever you can.  
Negotiate with your consumers to have data models that make sense for C*.  If 
you have those three criteria foremost in mind, you’ll likely be fine for quite 
some time.  And in the times where something isn’t going well, simpler is 
easier to investigate.

R

From: Sergio mailto:lapostadiser...@gmail.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Date: Wednesday, October 23, 2019 at 3:34 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Subject: Re: Cassandra Rack - Datacenter Load Balancing relations

Message from External Sender
Hi Reid,

Thank you very much for clearing these concepts for me.
https://community.datastax.com/comments/1133/view.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__community.datastax.com_comments_1133_view.html=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=hcKr__B8MyXvYx8vQx20B_KN89ZynwB-N4px87tcYY8=RSwuSea6HjOb3gChVS_i4GnKgl--H0q-VHz38_setfc=>
 I posted this question on the datastax forum regarding our cluster that it is 
unbalanced and the reply was related that the number of racks should be a 
multiplier of the replication factor in order to be balanced or 1. I thought 
then if I have 3 availability zones I should have 3 racks for each datacenter 
and not 2 (us-east-1b, us-east-1a) as I have right now or in the easiest way, I 
should have a rack for each datacenter.



1.  Datacenter: live

Status=Up/Down
|/ State=Normal/Leaving/Joining/Moving
--  Address  Load   Tokens   OwnsHost ID
   Rack
UN  10.1.20.49   289.75 GiB  256  ?   
be5a0193-56e7-4d42-8cc8-5d2141ab4872  us-east-1a
UN  10.1.30.112  103.03 GiB  256  ?   
e5108a8e-cc2f-4914-a86e-fccf770e3f0f  us-east-1b
UN  10.1.19.163  129.61 GiB  256  ?   

Re: Cassandra Rack - Datacenter Load Balancing relations

2019-10-24 Thread Reid Pinchback
cate the data in the other datacenters. My scope is 
to keep the read machines dedicated to serve reads and write machines to serve 
writes. Cassandra will handle the replication for me. Is there any other option 
that is I missing or wrong assumption? I am thinking that I will write a blog 
post about all my learnings so far, thank you very much for the replies Best, 
Sergio

Il giorno mer 23 ott 2019 alle ore 10:57 Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>> ha scritto:
No, that’s not correct.  The point of racks is to help you distribute the 
replicas, not further-replicate the replicas.  Data centers are what do the 
latter.  So for example, if you wanted to be able to ensure that you always had 
quorum if an AZ went down, then you could have two DCs where one was in each 
AZ, and use one rack in each DC.  In your situation I think I’d be more tempted 
to consider that.  Then if an AZ went away, you could fail over your traffic to 
the remaining DC and still be perfectly fine.

For background on replicas vs racks, I believe the information you want is 
under the heading ‘NetworkTopologyStrategy’ at:
http://cassandra.apache.org/doc/latest/architecture/dynamo.html<https://urldefense.proofpoint.com/v2/url?u=http-3A__cassandra.apache.org_doc_latest_architecture_dynamo.html=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=hcKr__B8MyXvYx8vQx20B_KN89ZynwB-N4px87tcYY8=BhioPylf2Zs5ocBSiSQX--IeP2ojSoTiaq66SXbYN6w=>

That should help you better understand how replicas distribute.

As mentioned before, while you can choose to do the reads in one DC, except for 
concerns about contention related to network traffic and connection handling, 
you can’t isolate reads from writes.  You can _mostly_ insulate the write DC 
from the activity within the read DC, and even that isn’t an absolute because 
of repairs.  However, your mileage may vary, so do what makes sense for your 
usage pattern.

R

From: Sergio mailto:lapostadiser...@gmail.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Date: Wednesday, October 23, 2019 at 12:50 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Subject: Re: Cassandra Rack - Datacenter Load Balancing relations

Message from External Sender
Hi Reid,

Thanks for your reply. I really appreciate your explanation.

We are in AWS and we are using right now 2 Availability Zone and not 3. We 
found our cluster really unbalanced because the keyspace has a replication 
factor = 3 and the number of racks is 2 with 2 datacenters.
We want the writes spread across all the nodes but we wanted the reads isolated 
from the writes to keep the load on that node low and to be able to identify 
problems in the consumers (reads) or producers (writes) applications.
It looks like that each rack contains an entire copy of the data so this would 
lead to replicate for each rack and then for each node the information. If I am 
correct if we have  a keyspace with 100GB and Replication Factor = 3 and RACKS 
= 3 => 100 * 3 * 3 = 900GB
If I had only one rack across 2 or even 3 availability zone I would save in 
space and I would have 300GB only. Please correct me if I am wrong.

Best,

Sergio

Il giorno mer 23 ott 2019 alle ore 09:21 Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>> ha scritto:
Datacenters and racks are different concepts.  While they don't have to be 
associated with their historical meanings, the historical meanings probably 
provide a helpful model for understanding what you want from them.

When companies own their own physical servers and have them housed somewhere, 
the questions arise on where you want to locate any particular server.  It's a 
balancing act on things like network speed of related servers being able to 
talk to each other, versus fault-tolerance of having many servers not all 
exposed to the same risks.

"Same rack" in that physical world tended to mean something like "all behind 
the same network switch and all sharing the same power bus".  The morning after 
an electrical glitch fries a power bus and thus everything in that rack, you 
realize you wished you didn't have so many of the same type of server together. 
 Well, they were servers.  Now they are door stops.  Badness and sadness.

That's kind of the mindset to have in mind with racks in Cassandra.  It's an 
artifact for you to separate servers into pools so that the disparate pools 
have hopefully somewhat independent infrastructure risks.  However, all those 
servers are still doing the same kind of work, are the same version, etc.

Datacenters are amalgams of those racks, and how similar or different they are 
from each other depends on what you want to do with them.  What is true is that 
if you have N datacenters, each one of them must have enough disk stora

Re: Cassandra Rack - Datacenter Load Balancing relations

2019-10-23 Thread Reid Pinchback
No, that’s not correct.  The point of racks is to help you distribute the 
replicas, not further-replicate the replicas.  Data centers are what do the 
latter.  So for example, if you wanted to be able to ensure that you always had 
quorum if an AZ went down, then you could have two DCs where one was in each 
AZ, and use one rack in each DC.  In your situation I think I’d be more tempted 
to consider that.  Then if an AZ went away, you could fail over your traffic to 
the remaining DC and still be perfectly fine.

For background on replicas vs racks, I believe the information you want is 
under the heading ‘NetworkTopologyStrategy’ at:

http://cassandra.apache.org/doc/latest/architecture/dynamo.html

That should help you better understand how replicas distribute.

As mentioned before, while you can choose to do the reads in one DC, except for 
concerns about contention related to network traffic and connection handling, 
you can’t isolate reads from writes.  You can _mostly_ insulate the write DC 
from the activity within the read DC, and even that isn’t an absolute because 
of repairs.  However, your mileage may vary, so do what makes sense for your 
usage pattern.

R

From: Sergio 
Reply-To: "user@cassandra.apache.org" 
Date: Wednesday, October 23, 2019 at 12:50 PM
To: "user@cassandra.apache.org" 
Subject: Re: Cassandra Rack - Datacenter Load Balancing relations

Message from External Sender
Hi Reid,

Thanks for your reply. I really appreciate your explanation.

We are in AWS and we are using right now 2 Availability Zone and not 3. We 
found our cluster really unbalanced because the keyspace has a replication 
factor = 3 and the number of racks is 2 with 2 datacenters.
We want the writes spread across all the nodes but we wanted the reads isolated 
from the writes to keep the load on that node low and to be able to identify 
problems in the consumers (reads) or producers (writes) applications.
It looks like that each rack contains an entire copy of the data so this would 
lead to replicate for each rack and then for each node the information. If I am 
correct if we have  a keyspace with 100GB and Replication Factor = 3 and RACKS 
= 3 => 100 * 3 * 3 = 900GB
If I had only one rack across 2 or even 3 availability zone I would save in 
space and I would have 300GB only. Please correct me if I am wrong.

Best,

Sergio


Il giorno mer 23 ott 2019 alle ore 09:21 Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>> ha scritto:
Datacenters and racks are different concepts.  While they don't have to be 
associated with their historical meanings, the historical meanings probably 
provide a helpful model for understanding what you want from them.

When companies own their own physical servers and have them housed somewhere, 
the questions arise on where you want to locate any particular server.  It's a 
balancing act on things like network speed of related servers being able to 
talk to each other, versus fault-tolerance of having many servers not all 
exposed to the same risks.

"Same rack" in that physical world tended to mean something like "all behind 
the same network switch and all sharing the same power bus".  The morning after 
an electrical glitch fries a power bus and thus everything in that rack, you 
realize you wished you didn't have so many of the same type of server together. 
 Well, they were servers.  Now they are door stops.  Badness and sadness.

That's kind of the mindset to have in mind with racks in Cassandra.  It's an 
artifact for you to separate servers into pools so that the disparate pools 
have hopefully somewhat independent infrastructure risks.  However, all those 
servers are still doing the same kind of work, are the same version, etc.

Datacenters are amalgams of those racks, and how similar or different they are 
from each other depends on what you want to do with them.  What is true is that 
if you have N datacenters, each one of them must have enough disk storage to 
house all the data.  The actual physical footprint of that data in each DC 
depends on the replication factors in play.

Note that you sorta can't have "one datacenter for writes" because the writes 
will replicate across the data centers.  You could definitely choose to have 
only one that takes read queries, but best to think of writing as being 
universal.  One scenario you can have is where the DC not taking live traffic 
read queries is the one you use for maintenance or performance testing or 
version upgrades.

One rack makes your life easier if you don't have a reason for multiple racks. 
It depends on the environment you deploy into and your fault tolerance goals.  
If you were in AWS and wanting to spread risk across availability zones, then 
you would likely have as many racks as AZs you choose to be in, because that's 
really the point of using multiple AZs.

R


On 10/23/19, 4:06 AM, "Sergio Bilello" 
mailto:lapostadiser...@gmail.co

Re: merge two cluster

2019-10-23 Thread Reid Pinchback
I haven’t seen much evidence that larger cluster = more performance, plus or 
minus the statistics of speculative retry.  It horizontally scales for storage 
definitely, and somewhat for connection volume.  If anything, per Sean’s 
observation, you have less ability to have a stable tuning for a particular 
usage pattern.

Try to have a mental picture of what you think is happening in the JVM while 
Cassandra is running.  There are short-lived objects, medium-lived objects, 
long/static-lived objects, and behind the scenes some degree of read I/O and 
write I/O against disk.  Garbage collectors struggle badly with medium-lived 
objects, but Cassandra really depends a great deal on those.  If you merge two 
clusters together, within any one node you still have the JVM size and disk 
architecture you had before, but you are adding competition on fixed resources 
and potentially in the very way they find most difficult to handle.

If those resources were heavily underutilized, like Sean’s point about merging 
small apps together, then sure.  But if those two clusters of yours are already 
showing that they experience significant load, then you are unlikely to improve 
anything, far more likely to end up worse off.  GC overhead and compaction 
flushes to disk are your challenges; merging two clusters doesn’t change the 
physics of those two areas, but could increase the demand on them.

The only caveat to all of the above I can think of is if there was a 
fault-tolerance story motivating the merging.  Like “management wants us in two 
AZs in AWS, but lacks the budget for more instances, and each pool by itself is 
too small for us to come up with a 2 rack organization that makes sense”.

R

From: Osman YOZGATLIOĞLU 
Reply-To: "user@cassandra.apache.org" 
Date: Wednesday, October 23, 2019 at 10:40 AM
To: "user@cassandra.apache.org" 
Subject: Re: merge two cluster

Message from External Sender

Sorry, missing question;

Actually I'm asking this for performance perspective. At application level both 
cluster used at the same time and approx same level. Inserted data inserted to 
both cluster, different parts of course.

If I merge two cluster, can I gain some performance improvements? Like raid 
stripes, more disk, more stripe, more speed..



Regards
On 23.10.2019 17:30, Durity, Sean R wrote:
Beneficial to whom? The apps, the admins, the developers?

I suggest that app teams have separate clusters per application. This prevents 
the noisy neighbor problem, isolates any security issues, and helps when it is 
time for maintenance, upgrade, performance testing, etc. to not have to 
coordinate multiple app teams at the same time. Also, an individual cluster can 
be tuned for its specific workload. Sometimes, though, costs and data size push 
us towards combining smaller apps owned by the same team onto a single cluster. 
Those are the exceptions.

As a Cassandra admin, I am always trying to scale the ability to admin multiple 
clusters without just adding new admins. That is an on-going task, dependent on 
your operating environment.

Also, because every table has a portion of memory (memtable), there is a 
practical limit to the number of tables that any one cluster should have. I 
have heard it is in the low hundreds of tables. This puts a limit on the number 
of applications that a cluster can safely support.


Sean Durity – Staff Systems Engineer, Cassandra

From: Osman YOZGATLIOĞLU 

Sent: Wednesday, October 23, 2019 6:23 AM
To: user@cassandra.apache.org
Subject: [EXTERNAL] merge two cluster


Hello,

I have two cluster and both contains different data sets with different node 
counts.

Would it be beneficial to merge two cluster?



Regards,

Osman



The information in this Internet Email is confidential and may be legally 
privileged. It is intended solely for the addressee. Access to this Email by 
anyone else is unauthorized. If you are not the intended recipient, any 
disclosure, copying, distribution or any action taken or omitted to be taken in 
reliance on it, is prohibited and may be unlawful. When addressed to our 
clients any opinions or advice contained in this Email are subject to the terms 
and conditions expressed in any applicable governing The Home Depot terms of 
business or client engagement letter. The Home Depot disclaims all 
responsibility and liability for the accuracy and content of this attachment 
and for any damages or losses arising from any inaccuracies, errors, viruses, 
e.g., worms, trojan horses, etc., or other items of a destructive nature, which 
may be contained in this attachment and shall not be liable for direct, 
indirect, consequential or special damages in connection with this e-mail 
message or its attachment.


Re: Cassandra Rack - Datacenter Load Balancing relations

2019-10-23 Thread Reid Pinchback
Datacenters and racks are different concepts.  While they don't have to be 
associated with their historical meanings, the historical meanings probably 
provide a helpful model for understanding what you want from them.

When companies own their own physical servers and have them housed somewhere, 
the questions arise on where you want to locate any particular server.  It's a 
balancing act on things like network speed of related servers being able to 
talk to each other, versus fault-tolerance of having many servers not all 
exposed to the same risks.  

"Same rack" in that physical world tended to mean something like "all behind 
the same network switch and all sharing the same power bus".  The morning after 
an electrical glitch fries a power bus and thus everything in that rack, you 
realize you wished you didn't have so many of the same type of server together. 
 Well, they were servers.  Now they are door stops.  Badness and sadness.  

That's kind of the mindset to have in mind with racks in Cassandra.  It's an 
artifact for you to separate servers into pools so that the disparate pools 
have hopefully somewhat independent infrastructure risks.  However, all those 
servers are still doing the same kind of work, are the same version, etc.

Datacenters are amalgams of those racks, and how similar or different they are 
from each other depends on what you want to do with them.  What is true is that 
if you have N datacenters, each one of them must have enough disk storage to 
house all the data.  The actual physical footprint of that data in each DC 
depends on the replication factors in play.

Note that you sorta can't have "one datacenter for writes" because the writes 
will replicate across the data centers.  You could definitely choose to have 
only one that takes read queries, but best to think of writing as being 
universal.  One scenario you can have is where the DC not taking live traffic 
read queries is the one you use for maintenance or performance testing or 
version upgrades.

One rack makes your life easier if you don't have a reason for multiple racks. 
It depends on the environment you deploy into and your fault tolerance goals.  
If you were in AWS and wanting to spread risk across availability zones, then 
you would likely have as many racks as AZs you choose to be in, because that's 
really the point of using multiple AZs.

R


On 10/23/19, 4:06 AM, "Sergio Bilello"  wrote:

 Message from External Sender

Hello guys!

I was reading about 
https://urldefense.proofpoint.com/v2/url?u=https-3A__cassandra.apache.org_doc_latest_architecture_dynamo.html-23networktopologystrategy=DwIBaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=xmgs1uQTlmvCtIoGJKHbByZZ6aDFzS5hDQzChDPCfFA=9ZDWAK6pstkCQfdbwLNsB-ZGsK64RwXSXfAkOWtmkq4=
 

I would like to understand a concept related to the node load balancing.

I know that Jon recommends Vnodes = 4 but right now I found a cluster with 
vnodes = 256 replication factor = 3 and 2 racks. This is unbalanced because the 
racks are not a multiplier of the replication factor.

However, my plan is to move all the nodes in a single rack to eventually 
scale up and down the node in the cluster once at the time. 

If I had 3 racks and I would like to keep the things balanced I should 
scale up 3 nodes at the time one for each rack.

If I would have 3 racks, should I have also 3 different datacenters so one 
datacenter for each rack? 

Can I have 2 datacenters and 3 racks? If this is possible one datacenter 
would have more nodes than the others? Could it be a problem?

I am thinking to split my cluster in one datacenter for reads and one for 
writes and keep all the nodes in the same rack so I can scale up once node at 
the time.



Please correct me if I am wrong



Thanks,



Sergio



-

To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org

For additional commands, e-mail: user-h...@cassandra.apache.org







Re: Cassandra 2.1.18 - Question on stream/bootstrap throughput

2019-10-22 Thread Reid Pinchback
Thanks for the reading Jon.  

From: Jon Haddad 
Reply-To: "user@cassandra.apache.org" 
Date: Tuesday, October 22, 2019 at 12:32 PM
To: "user@cassandra.apache.org" 
Subject: Re: Cassandra 2.1.18 - Question on stream/bootstrap throughput

Message from External Sender
CPU waiting on memory will look like CPU overhead.   There's a good post on the 
topic by Brendan Gregg: 
http://www.brendangregg.com/blog/2017-05-09/cpu-utilization-is-wrong.html<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.brendangregg.com_blog_2017-2D05-2D09_cpu-2Dutilization-2Dis-2Dwrong.html=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=uyQyRQAH6rGAAtjwZF7Xzd0gwksPBtKKNFpzfyi9f2M=g-34YFo5F6gV_lvv-fCjlGn5SdvQJRFUOT0DIohRpCQ=>

Regarding GC, I agree with Reid.  You're probably not going to saturate your 
network card no matter what your settings, Cassandra has way too much overhead 
to do that.  It's one of the reasons why the whole zero-copy streaming feature 
was added to Cassandra 4.0: 
http://cassandra.apache.org/blog/2018/08/07/faster_streaming_in_cassandra.html<https://urldefense.proofpoint.com/v2/url?u=http-3A__cassandra.apache.org_blog_2018_08_07_faster-5Fstreaming-5Fin-5Fcassandra.html=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=uyQyRQAH6rGAAtjwZF7Xzd0gwksPBtKKNFpzfyi9f2M=kCbODyLouPOI__Ku2DHXUXvBhw29wixkEsbXj8uwICk=>

Reid is also correct in pointing out the method by which you're monitoring your 
metrics might be problematic.  With prometheus, the same data can show 
significantly different graphs when using rate vs irate, and only collecting 
once a minute would hide a lot of useful data.

If you keep digging and find you're not using all your CPU during GC pauses, 
you can try using more GC threads by setting -XX:ParallelGCThreads to match the 
number of cores you have, since by default it won't use them all.  You've got 
40 cores in the m4.10xlarge, try setting -XX:ParallelGCThreads to 40.
Jon



On Tue, Oct 22, 2019 at 11:38 AM Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>> wrote:
Thomas, what is your frequency of metric collection?  If it is minute-level 
granularity, that can give a very false impression.  I’ve seen CPU and disk 
throttles that don’t even begin to show visibility until second-level 
granularity around the time of the constraining event.  Even clearer is 100ms.

Also, are you monitoring your GC activity at all?  GC bound up in a lot of 
memory copies is not going to manifest that much CPU, it’s memory bus bandwidth 
you are fighting against then.  It is easy to have a box that looks unused but 
in reality its struggling.  Given that you’ve opened up the floodgates on 
compaction, that would seem quite plausible to be what you are experiencing.

From: "Steinmaurer, Thomas" 
mailto:thomas.steinmau...@dynatrace.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Date: Tuesday, October 22, 2019 at 11:22 AM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Subject: RE: Cassandra 2.1.18 - Question on stream/bootstrap throughput

Message from External Sender
Hi Alex,

Increased streaming throughput has been set on the existing nodes only, cause 
it is meant to limit outgoing traffic only, right? At least when judging from 
the name, reading the documentation etc.

Increased compaction throughput on all nodes, although my understanding is that 
it would be necessary only on the joining node to catchup with compacting 
received SSTables.

We really see no resource (CPU, NW and disk) being somehow maxed out on any 
node, which would explain the limit in the area of the new node receiving data 
at ~ 180-200 Mbit/s.

Thanks again,
Thomas

From: Oleksandr Shulgin 
mailto:oleksandr.shul...@zalando.de>>
Sent: Dienstag, 22. Oktober 2019 16:35
To: User mailto:user@cassandra.apache.org>>
Subject: Re: Cassandra 2.1.18 - Question on stream/bootstrap throughput

On Tue, Oct 22, 2019 at 12:47 PM Steinmaurer, Thomas 
mailto:thomas.steinmau...@dynatrace.com>> 
wrote:

using 2.1.8, 3 nodes (m4.10xlarge, ESB SSD-based), vnodes=256, RF=3, we are 
trying to add a 4th node.

The two options to my knowledge, mainly affecting throughput, namely stream 
output and compaction throttling has been set to very high values (e.g. stream 
output = 800 Mbit/s resp. compaction throughput = 500 Mbyte/s) or even set to 0 
(unthrottled) in cassandra.yaml + process restart. In both scenarios 
(throttling with high values vs. unthrottled), the 4th node is streaming from 
one node capped ~ 180-200Mbit/s, according to our SFM.

The nodes have plenty of resources available (10Gbit, disk io/iops), also 
confirmed by e.g. iperf in regard to NW throughput and write to / read from 
disk in the area of 200 MByte/s

Re: Cassandra 2.1.18 - Question on stream/bootstrap throughput

2019-10-22 Thread Reid Pinchback
Thomas, what is your frequency of metric collection?  If it is minute-level 
granularity, that can give a very false impression.  I’ve seen CPU and disk 
throttles that don’t even begin to show visibility until second-level 
granularity around the time of the constraining event.  Even clearer is 100ms.

Also, are you monitoring your GC activity at all?  GC bound up in a lot of 
memory copies is not going to manifest that much CPU, it’s memory bus bandwidth 
you are fighting against then.  It is easy to have a box that looks unused but 
in reality its struggling.  Given that you’ve opened up the floodgates on 
compaction, that would seem quite plausible to be what you are experiencing.

From: "Steinmaurer, Thomas" 
Reply-To: "user@cassandra.apache.org" 
Date: Tuesday, October 22, 2019 at 11:22 AM
To: "user@cassandra.apache.org" 
Subject: RE: Cassandra 2.1.18 - Question on stream/bootstrap throughput

Message from External Sender
Hi Alex,

Increased streaming throughput has been set on the existing nodes only, cause 
it is meant to limit outgoing traffic only, right? At least when judging from 
the name, reading the documentation etc.

Increased compaction throughput on all nodes, although my understanding is that 
it would be necessary only on the joining node to catchup with compacting 
received SSTables.

We really see no resource (CPU, NW and disk) being somehow maxed out on any 
node, which would explain the limit in the area of the new node receiving data 
at ~ 180-200 Mbit/s.

Thanks again,
Thomas

From: Oleksandr Shulgin 
Sent: Dienstag, 22. Oktober 2019 16:35
To: User 
Subject: Re: Cassandra 2.1.18 - Question on stream/bootstrap throughput

On Tue, Oct 22, 2019 at 12:47 PM Steinmaurer, Thomas 
mailto:thomas.steinmau...@dynatrace.com>> 
wrote:

using 2.1.8, 3 nodes (m4.10xlarge, ESB SSD-based), vnodes=256, RF=3, we are 
trying to add a 4th node.

The two options to my knowledge, mainly affecting throughput, namely stream 
output and compaction throttling has been set to very high values (e.g. stream 
output = 800 Mbit/s resp. compaction throughput = 500 Mbyte/s) or even set to 0 
(unthrottled) in cassandra.yaml + process restart. In both scenarios 
(throttling with high values vs. unthrottled), the 4th node is streaming from 
one node capped ~ 180-200Mbit/s, according to our SFM.

The nodes have plenty of resources available (10Gbit, disk io/iops), also 
confirmed by e.g. iperf in regard to NW throughput and write to / read from 
disk in the area of 200 MByte/s.

Are there any other known throughput / bootstrap limitations, which basically 
outrule above settings?

Hi Thomas,

Assuming you have 3 Availability Zones and you are adding the new node to one 
of the zones where you already have a node running, it is expected that it only 
streams from that node (its local rack).

Have you increased the streaming throughput on the node it streams from or only 
on the new node?  The limit applies to the source node as well.  You can change 
it online w/o the need to restart using nodetool command.

Have you checked if the new node is not CPU-bound?  It's unlikely though due to 
big instance type and only one node to stream from, more relevant for scenarios 
when streaming from a lot of nodes.

Cheers,
--
Alex

The contents of this e-mail are intended for the named addressee only. It 
contains information that may be confidential. Unless you are the named 
addressee or an authorized designee, you may not copy or use it, or disclose it 
to anyone else. If you received it in error please notify us immediately and 
then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) is a 
company registered in Linz whose registered office is at 4040 Linz, Austria, 
Freistädterstraße 313


Re: Cassandra 2.1.18 - Question on stream/bootstrap throughput

2019-10-22 Thread Reid Pinchback
A high level of compaction seems highly likely to throttle you by sending the 
service into a GC death spiral, doubly-so if any repairs happen to be underway 
at the same time (I may or may not have killed a few nodes this way, but I 
admit nothing!).  Even if not in GC hell, it can cause you to episodically 
blast out writes that rapidly dirty a lot of pages, thus triggering a fill of 
the disk io queue that then starves out read requests from the disk.  More != 
Better when it comes to compaction.  You want as little compaction as your 
usage pattern requires of you.  Smoothness of its contribution to the overall 
load is a better objective.

Jon Haddad did a datastax conference talk this year on some easy tunings that 
you’ll likely want to listen to. You’ll probably end up rethinking your vnode 
count as well. Also note that a fast disk can spend a lot of its time doing the 
wrong things. His talk covers some of the factors in that.

https://www.youtube.com/watch?v=swL7bCnolkU


From: "Steinmaurer, Thomas" 
Reply-To: "user@cassandra.apache.org" 
Date: Tuesday, October 22, 2019 at 6:47 AM
To: "user@cassandra.apache.org" 
Subject: Cassandra 2.1.18 - Question on stream/bootstrap throughput

Message from External Sender
Hello,

using 2.1.8, 3 nodes (m4.10xlarge, ESB SSD-based), vnodes=256, RF=3, we are 
trying to add a 4th node.

The two options to my knowledge, mainly affecting throughput, namely stream 
output and compaction throttling has been set to very high values (e.g. stream 
output = 800 Mbit/s resp. compaction throughput = 500 Mbyte/s) or even set to 0 
(unthrottled) in cassandra.yaml + process restart. In both scenarios 
(throttling with high values vs. unthrottled), the 4th node is streaming from 
one node capped ~ 180-200Mbit/s, according to our SFM.

The nodes have plenty of resources available (10Gbit, disk io/iops), also 
confirmed by e.g. iperf in regard to NW throughput and write to / read from 
disk in the area of 200 MByte/s.

Are there any other known throughput / bootstrap limitations, which basically 
outrule above settings?

Thanks,
Thomas


The contents of this e-mail are intended for the named addressee only. It 
contains information that may be confidential. Unless you are the named 
addressee or an authorized designee, you may not copy or use it, or disclose it 
to anyone else. If you received it in error please notify us immediately and 
then destroy it. Dynatrace Austria GmbH (registration number FN 91482h) is a 
company registered in Linz whose registered office is at 4040 Linz, Austria, 
Freistädterstraße 313


Re: Cassandra Recommended System Settings

2019-10-21 Thread Reid Pinchback
Sergio, if you do some online searching about ‘bufferbloat’ in networking, 
you’ll find the background to help explain what motivates networking changes.  
Actual investigation of network performance can get a bit gnarly.  The TL;DR 
summary is that big buffers function like big queues, and thus attempts to 
speed up throughput can cause things stuck in a queue to have higher latency.  
With very fast networks, there isn’t as much need to have big buffers.  Imagine 
having a coordinator node waiting to respond to a query but can’t because a 
bunch of merkel trees are sitting in the tcp buffer waiting to be sent out.  
Sometimes total latency doesn’t fairly measure actual effort to do the work, 
some of that can be time spent sitting waiting in the buffer to be shipped out 
back to the client.

From: Sergio 
Reply-To: "user@cassandra.apache.org" 
Date: Monday, October 21, 2019 at 4:54 PM
To: "user@cassandra.apache.org" 
Subject: Re: Cassandra Recommended System Settings

Message from External Sender
Thanks Elliott!

How do you know if there is too much RAM used for those settings?

Which metrics do you keep track of?

What would you recommend instead?

Best,

Sergio

On Mon, Oct 21, 2019, 1:41 PM Elliott Sims 
mailto:elli...@backblaze.com>> wrote:
Based on my experiences, if you have a new enough kernel I'd strongly suggest 
switching the TCP scheduler algorithm to BBR.  I've found the rest tend to be 
extremely sensitive to even small amounts of packet loss among cluster members 
where BBR holds up well.
High ulimits for basically everything are probably a good idea, although 
"unlimited" may not be purely optimal for all cases.
The TCP keepalive settings are probably only necessary for traffic 
buggy/misconfigured firewalls, but shouldn't really do any harm on a modern 
fast network.
The TCP memory settings are pretty aggressive and probably result in 
unnecessary RAM usage.
The net.core.rmem_default/net.core.wmem_default settings are overridden by the 
TCP-specific settings as far as I know, so they're not really relevant/helpful 
for Cassandra
The net.ipv4.tcp_rmem/net.ipv4.tcp_wmem max settings are pretty aggressive.  
That works out to something like 1Gbps with 130ms latency per TCP connection, 
but on a local LAN with latencies <1ms it's enough buffer for over 100Gbps per 
TCP session.  A much smaller value will probably make more sense for most 
setups.


On Mon, Oct 21, 2019 at 10:21 AM Sergio 
mailto:lapostadiser...@gmail.com>> wrote:

Hello!

This is the kernel that I am using
Linux  4.16.13-1.el7.elrepo.x86_64 #1 SMP Wed May 30 14:31:51 EDT 2018 x86_64 
x86_64 x86_64 GNU/Linux

Best,

Sergio

Il giorno lun 21 ott 2019 alle ore 07:30 Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>> ha scritto:
I don't know which distro and version you are using, but watch out for 
surprises in what vm.swappiness=0 means.  In older kernels it means "only use 
swap when desperate".  I believe that newer kernels changed to have 1 mean 
that, and 0 means to always use the oomkiller.  Neither situation is strictly 
good or bad, what matters is what you intend the system behavior to be in 
comparison with whatever monitoring/alerting you have put in place.

R


On 10/18/19, 9:04 PM, "Sergio Bilello" 
mailto:lapostadiser...@gmail.com>> wrote:

 Message from External Sender

Hello everyone!



Do you have any setting that you would change or tweak from the below list?



sudo cat /proc/4379/limits

Limit Soft Limit   Hard Limit   Units

Max cpu time  unlimitedunlimitedseconds

Max file size unlimitedunlimitedbytes

Max data size unlimitedunlimitedbytes

Max stack sizeunlimitedunlimitedbytes

Max core file sizeunlimitedunlimitedbytes

Max resident set  unlimitedunlimitedbytes

Max processes 3276832768
processes

Max open files1048576  1048576  files

Max locked memory unlimitedunlimitedbytes

Max address space unlimitedunlimitedbytes

Max file locksunlimitedunlimitedlocks

Max pending signals   unlimitedunlimitedsignals

Max msgqueue size unlimitedunlimitedbytes

Max nice priority 00

Max realtime priority 00

Max realtime timeout  unlimitedunlimitedus



These are the sysctl settings

default['cassandra']['sysctl'] = {

'net.ipv4.tcp_keepalive_time' => 60,

'ne

Re: [EXTERNAL] Re: GC Tuning https://thelastpickle.com/blog/2018/04/11/gc-tuning.html

2019-10-21 Thread Reid Pinchback
Think of GB to OS as something intended to support file caching.  As such the 
amount is whatever suits your usage.  If your use is almost exclusively 
reading, then file cache memory doesn’t matter that much if you’re operating 
with your storage as those nvme ssd drives that the i3’s come with.  There is 
already a chunk cache that you should be tuning in C* instead, and feeding fast 
from the O/S file cache, assuming compressed SSTables, maybe turns out to be 
less of a concern.

If you have moderate write activity then your situation changes because then 
that same file cache is how your dirty background pages turn into eventual 
flushes to disk, and so you have to watch the impact of read stalls when the 
I/O fills with write requests.  You might not see this so obviously on nvme 
drives, but that could depend a lot on the distro and kernels and how the 
filesystem is mounted.

My super strong advice on issues like this is to not cargo-cult other people’s 
tunings.  Look at them for ideas, sure. But learn how to do your own 
investigations, and budget the time for it into your project.  Budget a LOT of 
time for it if your measure of “good performance” is based on latency; when 
“good” is defined in terms of throughput your life is easier.  Also, everything 
is always a little different in virtualization, and lord knows you can have 
screwball things appear in AWS. The good news is you don’t need a perfect 
configuration out of the gate; you need a configuration you understand and can 
refine; understanding comes from knowing how to do your own performance 
monitoring.


From: Sergio 
Reply-To: "user@cassandra.apache.org" 
Date: Monday, October 21, 2019 at 1:16 PM
To: "user@cassandra.apache.org" 
Subject: Re: [EXTERNAL] Re: GC Tuning 
https://thelastpickle.com/blog/2018/04/11/gc-tuning.html

Message from External Sender
Thanks, guys!
I just copied and paste what I found on our test machines but I can confirm 
that we have the same settings except for 8GB in production.
I didn't select these settings and I need to verify why these settings are 
there.
If any of you want to share your flags for a read-heavy workload it would be 
appreciated, so I would replace and test those flags with TLP-STRESS.
I am thinking about different approaches (G1GC vs ParNew + CMS)
How many GB for RAM do you dedicate to the OS in percentage or in an exact 
number?
Can you share the flags for ParNew + CMS that I can play with it and perform a 
test?

Best,
Sergio

Il giorno lun 21 ott 2019 alle ore 09:27 Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>> ha scritto:
Since the instance size is < 32gb, hopefully swap isn’t being used, so it 
should be moot.

Sergio, also be aware that  -XX:+CMSClassUnloadingEnabled probably doesn’t do 
anything for you.  I believe that only applies to CMS, not G1GC.  I also 
wouldn’t take it as gospel truth that  -XX:+UseNUMA is a good thing on AWS (or 
anything virtualized), you’d have to run your own tests and find out.

R
From: Jon Haddad mailto:j...@jonhaddad.com>>
Reply-To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Date: Monday, October 21, 2019 at 12:06 PM
To: "user@cassandra.apache.org<mailto:user@cassandra.apache.org>" 
mailto:user@cassandra.apache.org>>
Subject: Re: [EXTERNAL] Re: GC Tuning 
https://thelastpickle.com/blog/2018/04/11/gc-tuning.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__thelastpickle.com_blog_2018_04_11_gc-2Dtuning.html=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=UvvIpm6RP7FYRQH6S5EPXTsxAMsezbm6QzHNB0zmMG0=jmk5lyXeQ6gwlVWF86TKWUbIhy57G5tOnlLEps8-DQw=>

Message from External Sender
One thing to note, if you're going to use a big heap, cap it at 31GB, not 32.  
Once you go to 32GB, you don't get to use compressed pointers [1], so you get 
less addressable space than at 31GB.

[1] 
https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/<https://urldefense.proofpoint.com/v2/url?u=https-3A__blog.codecentric.de_en_2014_02_35gb-2Dheap-2Dless-2D32gb-2Djava-2Djvm-2Dmemory-2Doddities_=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=e9Ahs5XXRBicgUhMZQaboxsqb6jXpjvo48kEojUWaQc=Q7jI4ZEqVMFZIMPoSXTvMebG5fWOUJ6lhDOgWGxiHg8=>

On Mon, Oct 21, 2019 at 11:39 AM Durity, Sean R 
mailto:sean_r_dur...@homedepot.com>> wrote:
I don’t disagree with Jon, who has all kinds of performance tuning experience. 
But for ease of operation, we only use G1GC (on Java 8), because the tuning of 
ParNew+CMS requires a high degree of knowledge and very repeatable testing 
harnesses. It isn’t worth our time. As a previous writer mentioned, there is 
usually better return on our time tuning the schema (aka helping developers 
understand Cassandra’s strengths).

We use 16 – 32 GB heaps, nothing smaller than that.


Re: [EXTERNAL] Re: GC Tuning https://thelastpickle.com/blog/2018/04/11/gc-tuning.html

2019-10-21 Thread Reid Pinchback
Since the instance size is < 32gb, hopefully swap isn’t being used, so it 
should be moot.

Sergio, also be aware that  -XX:+CMSClassUnloadingEnabled probably doesn’t do 
anything for you.  I believe that only applies to CMS, not G1GC.  I also 
wouldn’t take it as gospel truth that  -XX:+UseNUMA is a good thing on AWS (or 
anything virtualized), you’d have to run your own tests and find out.

R

From: Jon Haddad 
Reply-To: "user@cassandra.apache.org" 
Date: Monday, October 21, 2019 at 12:06 PM
To: "user@cassandra.apache.org" 
Subject: Re: [EXTERNAL] Re: GC Tuning 
https://thelastpickle.com/blog/2018/04/11/gc-tuning.html

Message from External Sender
One thing to note, if you're going to use a big heap, cap it at 31GB, not 32.  
Once you go to 32GB, you don't get to use compressed pointers [1], so you get 
less addressable space than at 31GB.

[1] 
https://blog.codecentric.de/en/2014/02/35gb-heap-less-32gb-java-jvm-memory-oddities/<https://urldefense.proofpoint.com/v2/url?u=https-3A__blog.codecentric.de_en_2014_02_35gb-2Dheap-2Dless-2D32gb-2Djava-2Djvm-2Dmemory-2Doddities_=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=e9Ahs5XXRBicgUhMZQaboxsqb6jXpjvo48kEojUWaQc=Q7jI4ZEqVMFZIMPoSXTvMebG5fWOUJ6lhDOgWGxiHg8=>

On Mon, Oct 21, 2019 at 11:39 AM Durity, Sean R 
mailto:sean_r_dur...@homedepot.com>> wrote:
I don’t disagree with Jon, who has all kinds of performance tuning experience. 
But for ease of operation, we only use G1GC (on Java 8), because the tuning of 
ParNew+CMS requires a high degree of knowledge and very repeatable testing 
harnesses. It isn’t worth our time. As a previous writer mentioned, there is 
usually better return on our time tuning the schema (aka helping developers 
understand Cassandra’s strengths).

We use 16 – 32 GB heaps, nothing smaller than that.

Sean Durity

From: Jon Haddad mailto:j...@jonhaddad.com>>
Sent: Monday, October 21, 2019 10:43 AM
To: user@cassandra.apache.org<mailto:user@cassandra.apache.org>
Subject: [EXTERNAL] Re: GC Tuning 
https://thelastpickle.com/blog/2018/04/11/gc-tuning.html<https://urldefense.proofpoint.com/v2/url?u=https-3A__thelastpickle.com_blog_2018_04_11_gc-2Dtuning.html=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=e9Ahs5XXRBicgUhMZQaboxsqb6jXpjvo48kEojUWaQc=YFRUQ6Rdb5mcFf6GqguRYCsrcAcP6KzjozIgYp56riE=>

I still use ParNew + CMS over G1GC with Java 8.  I haven't done a comparison 
with JDK 11 yet, so I'm not sure if it's any better.  I've heard it is, but I 
like to verify first.  The pause times with ParNew + CMS are generally lower 
than G1 when tuned right, but as Chris said it can be tricky.  If you aren't 
willing to spend the time understanding how it works and why each setting 
matters, G1 is a better option.

I wouldn't run Cassandra in production on less than 8GB of heap - I consider it 
the absolute minimum.  For G1 I'd use 16GB, and never 4GB with Cassandra unless 
you're rarely querying it.

I typically use the following as a starting point now:

ParNew + CMS
16GB heap
10GB new gen
2GB memtable cap, otherwise you'll spend a bunch of time copying around 
memtables (cassandra.yaml)
Max tenuring threshold: 2
survivor ratio 6

I've also done some tests with a 30GB heap, 24 GB of which was new gen.  This 
worked surprisingly well in my tests since it essentially keeps everything out 
of the old gen.  New gen allocations are just a pointer bump and are pretty 
fast, so in my (limited) tests of this I was seeing really good p99 times.  I 
was seeing a 200-400 ms pause roughly once a minute running a workload that 
deliberately wasn't hitting a resource limit (testing real world looking stress 
vs overwhelming the cluster).

We built tlp-cluster [1] and tlp-stress [2] to help figure these things out.

[1] https://thelastpickle.com/tlp-cluster/ 
[thelastpickle.com]<https://urldefense.com/v3/__https:/thelastpickle.com/tlp-cluster/__;!OYIaWQQGbnA!ZhiXAdRaL49J8nBlh0F_5MQ97Z1QNTUuTSMvksmEmxan3d65D6ATmQO1ig58W52u_EmQ1GM$>
[2] http://thelastpickle.com/tlp-stress 
[thelastpickle.com]<https://urldefense.com/v3/__http:/thelastpickle.com/tlp-stress__;!OYIaWQQGbnA!ZhiXAdRaL49J8nBlh0F_5MQ97Z1QNTUuTSMvksmEmxan3d65D6ATmQO1ig58W52uuCUZYKw$>

Jon




On Mon, Oct 21, 2019 at 10:24 AM Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>> wrote:
An i3x large has 30.5 gb of RAM but you’re using less than 4gb for C*.  So 
minus room for other uses of jvm memory and for kernel activity, that’s about 
25 gb for file cache.  You’ll have to see if you either want a bigger heap to 
allow for less frequent gc cycles, or you could save money on the instance 
size.  C* generates a lot of medium-length lifetime objects which can easily 
end up in old gen.  A larger heap will reduce the burn of more old-gen 
collections.  There are no magic numbers to just give because it’ll depend on 
your usage pat

Re: Cassandra Recommended System Settings

2019-10-21 Thread Reid Pinchback
I don't know which distro and version you are using, but watch out for 
surprises in what vm.swappiness=0 means.  In older kernels it means "only use 
swap when desperate".  I believe that newer kernels changed to have 1 mean 
that, and 0 means to always use the oomkiller.  Neither situation is strictly 
good or bad, what matters is what you intend the system behavior to be in 
comparison with whatever monitoring/alerting you have put in place.

R


On 10/18/19, 9:04 PM, "Sergio Bilello"  wrote:

 Message from External Sender

Hello everyone!



Do you have any setting that you would change or tweak from the below list?



sudo cat /proc/4379/limits

Limit Soft Limit   Hard Limit   Units

Max cpu time  unlimitedunlimitedseconds

Max file size unlimitedunlimitedbytes

Max data size unlimitedunlimitedbytes

Max stack sizeunlimitedunlimitedbytes

Max core file sizeunlimitedunlimitedbytes

Max resident set  unlimitedunlimitedbytes

Max processes 3276832768
processes

Max open files1048576  1048576  files

Max locked memory unlimitedunlimitedbytes

Max address space unlimitedunlimitedbytes

Max file locksunlimitedunlimitedlocks

Max pending signals   unlimitedunlimitedsignals

Max msgqueue size unlimitedunlimitedbytes

Max nice priority 00

Max realtime priority 00

Max realtime timeout  unlimitedunlimitedus



These are the sysctl settings

default['cassandra']['sysctl'] = {

'net.ipv4.tcp_keepalive_time' => 60, 

'net.ipv4.tcp_keepalive_probes' => 3, 

'net.ipv4.tcp_keepalive_intvl' => 10,

'net.core.rmem_max' => 16777216,

'net.core.wmem_max' => 16777216,

'net.core.rmem_default' => 16777216,

'net.core.wmem_default' => 16777216,

'net.core.optmem_max' => 40960,

'net.ipv4.tcp_rmem' => '4096 87380 16777216',

'net.ipv4.tcp_wmem' => '4096 65536 16777216',

'net.ipv4.ip_local_port_range' => '1 65535',

'net.ipv4.tcp_window_scaling' => 1,

   'net.core.netdev_max_backlog' => 2500,

   'net.core.somaxconn' => 65000,

'vm.max_map_count' => 1048575,

'vm.swappiness' => 0

}



Am I missing something else?



Do you have any experience to configure CENTOS 7

for 

JAVA HUGE PAGES


https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.datastax.com_en_dse_5.1_dse-2Dadmin_datastax-5Fenterprise_config_configRecommendedSettings.html-23CheckJavaHugepagessettings=DwIBaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=zke-WpkD1c6Qt1cz8mJG0ZQ37h8kezqknMSnerQhXuU=b6lGdbtv1SN9opBsIOFRT6IX6BroMW-8Tudk9qEh3bI=
 



OPTIMIZE SSD


https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.datastax.com_en_dse_5.1_dse-2Dadmin_datastax-5Fenterprise_config_configRecommendedSettings.html-23OptimizeSSDs=DwIBaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=zke-WpkD1c6Qt1cz8mJG0ZQ37h8kezqknMSnerQhXuU=c0S3S3V_0YHVMx2I-pyOh24MiQs1D-L73JytaSw648M=
 




https://urldefense.proofpoint.com/v2/url?u=https-3A__docs.datastax.com_en_dse_5.1_dse-2Dadmin_datastax-5Fenterprise_config_configRecommendedSettings.html=DwIBaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=zke-WpkD1c6Qt1cz8mJG0ZQ37h8kezqknMSnerQhXuU=PZFG6SXF6dL5LRJ-aUoidHnnLGpKPbpxdKstM8M9JMk=
 



We are using AWS i3.xlarge instances



Thanks,



Sergio



-

To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org

For additional commands, e-mail: user-h...@cassandra.apache.org







Re: GC Tuning https://thelastpickle.com/blog/2018/04/11/gc-tuning.html

2019-10-21 Thread Reid Pinchback
An i3x large has 30.5 gb of RAM but you’re using less than 4gb for C*.  So 
minus room for other uses of jvm memory and for kernel activity, that’s about 
25 gb for file cache.  You’ll have to see if you either want a bigger heap to 
allow for less frequent gc cycles, or you could save money on the instance 
size.  C* generates a lot of medium-length lifetime objects which can easily 
end up in old gen.  A larger heap will reduce the burn of more old-gen 
collections.  There are no magic numbers to just give because it’ll depend on 
your usage patterns.

From: Sergio 
Reply-To: "user@cassandra.apache.org" 
Date: Sunday, October 20, 2019 at 2:51 PM
To: "user@cassandra.apache.org" 
Subject: Re: GC Tuning https://thelastpickle.com/blog/2018/04/11/gc-tuning.html

Message from External Sender
Thanks for the answer.

This is the JVM version that I have right now.

openjdk version "1.8.0_161"
OpenJDK Runtime Environment (build 1.8.0_161-b14)
OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode)

These are the current flags. Would you change anything in a i3x.large aws node?

java -Xloggc:/var/log/cassandra/gc.log 
-Dcassandra.max_queued_native_transport_requests=4096 -ea 
-XX:+UseThreadPriorities -XX:ThreadPriorityPolicy=42 
-XX:+HeapDumpOnOutOfMemoryError -Xss256k -XX:StringTableSize=103 
-XX:+AlwaysPreTouch -XX:-UseBiasedLocking -XX:+UseTLAB -XX:+ResizeTLAB 
-XX:+UseNUMA -XX:+PerfDisableSharedMem -Djava.net.preferIPv4Stack=true 
-XX:SurvivorRatio=8 -XX:MaxTenuringThreshold=1 -XX:+UseG1GC 
-XX:G1RSetUpdatingPauseTimePercent=5 -XX:MaxGCPauseMillis=200 
-XX:InitiatingHeapOccupancyPercent=45 -XX:G1HeapRegionSize=0 
-XX:-ParallelRefProcEnabled -Xms3821M -Xmx3821M 
-XX:CompileCommandFile=/etc/cassandra/conf/hotspot_compiler 
-Dcom.sun.management.jmxremote.port=7199 
-Dcom.sun.management.jmxremote.rmi.port=7199 
-Dcom.sun.management.jmxremote.ssl=false 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.password.file=/etc/cassandra/conf/jmxremote.password
 
-Dcom.sun.management.jmxremote.access.file=/etc/cassandra/conf/jmxremote.access 
-Djava.library.path=/usr/share/cassandra/lib/sigar-bin 
-Djava.rmi.server.hostname=172.24.150.141 -XX:+CMSClassUnloadingEnabled 
-javaagent:/usr/share/cassandra/lib/jmx_prometheus_javaagent-0.3.1.jar=10100:/etc/cassandra/default.conf/jmx-export.yml
 -Dlogback.configurationFile=logback.xml -Dcassandra.logdir=/var/log/cassandra 
-Dcassandra.storagedir= -Dcassandra-pidfile=/var/run/cassandra/cassandra.pid 
-Dcassandra-foreground=yes -cp 

Re: Elevated response times from all nodes in a data center at the same time.

2019-10-16 Thread Reid Pinchback
es level you can look at:
- logs (grep tombstone)
- sstablemetadata gives you the % of droppable tombstones. This is an estimate 
and of the space that could be freed, it gives no information on whether 
tombstones are being read and can affect performances or not, yet it gives you 
an idea of the tombstones that can be generated in the workflow
- Trace queries: either trace a manual query from cqlsh with 'TRACING ON;' then 
sending queries similar to prod ones. Or directly using 'nodetool 
settraceprobability X', /!\ ensure X is really low to start with - like 0.001 
or 0.0001 maybe, we probably don't need many queries to understand what 
happened and tracing might inflicts a big penalty to Cassandra servers in terms 
of performances (each of the traced queries will induce a bunch of queries to 
actually persist the trace In system_traces key space.

We do not see any Read request timeouts or Exception in the our API Splunk logs 
only p99 and average latency go up suddenly.

What's the value you use for timeouts? Also, any other exception/timeout, 
somewhere else than for reads?
What are the result of:

- nodetool tablestats (I think this would gather what you need to check --> 
nodetool tablestats | grep -e Keyspace -e Table: -e latency -e partition -e 
tombstones)
- watch -d nodetool tpstats (here look at any pending threads constantly higher 
than 0, any blocked or dropped threads)

We have investigated CPU level usage, Disk I/O, Memory usage and Network 
parameters for the nodes during this period and we are not experiencing any 
sudden surge in these parameters.

If the resources are fine, there is a bottleneck within Cassandra, that we need 
to find, the commands above aim at that, finding C*'s bottleneck, assuming 
machines can handle more load.

We setup client using WhiteListPolicy to send queries to each of the 6 nodes to 
understand which one is bad, but we see all of them responding with very high 
latency. It doesn't happen during our peak traffic period sometime in the night.

 This brings something else to my mind. The fact latency goes lower when there 
is a traffic increase, it can perfectly mean that each of the queries sent 
during the spike are really efficient, and despite you might still have some 
other queries being slow (even during peak hours), having that many 
'efficient/quick' requests, lowers the average/pXX latencies. Does the max 
latency change over time?

You can here try to get a sense of this with:

- nodetool proxyhistograms
- nodetool tablehistograms   # For the table level stats

We checked the system.log files on our nodes, took a thread dump and checked 
for any rouge processes running on the nodes which is stealing CPU but we are 
able to find nothing.

From what I read/understand, resource are fine, I would put these searches 
aside for now. About the log file, I like to use:

- tail -fn 100 /var/log/cassandra/system.log #See current logs (if you are 
having the issues NOW)
- grep -e "WARN" -e "ERROR" /var/log/cassandra/system.log # to check what 
happened and was wrong

For now I can't think about anything else, I hope some of those ideas will help 
you diagnose the problem. Once it is diagnosed, we should be able to reason 
about how we can fix it.

C*heers,
---
Alain Rodriguez - al...@thelastpickle.com<mailto:al...@thelastpickle.com>
France / Spain

The Last Pickle - Apache Cassandra Consulting
http://www.thelastpickle.com<https://urldefense.proofpoint.com/v2/url?u=http-3A__www.thelastpickle.com=DwMFaQ=9Hv6XPedRSA-5PSECC38X80c1h60_XWA4z1k_R1pROA=OIgB3poYhzp3_A7WgD7iBCnsJaYmspOa2okNpf6uqWc=H2hSujsdARbPvmkMdQJPZ29Ha6qZZGndZxV4mz60j7g=kapaVkL0EZEzPVTpVk0GNWZFxxsL7aWHrZU-HJvK45I=>

Le mar. 15 oct. 2019 à 17:26, Reid Pinchback 
mailto:rpinchb...@tripadvisor.com>> a écrit :
I’d look to see if you have compactions fronting the p99’s.  If so, then go 
back to looking at the I/O.  Disbelieve any metrics not captured at a high 
resolution for a time window around the compactions, like 100ms.  You could be 
hitting I/O stalls where reads are blocked by the flushing of writes.  It’s 
short-lived when it happens, and per-minute metrics won’t provide breadcrumbs.

From: Bill Walters mailto:billwalter...@gmail.com>>
Date: Monday, October 14, 2019 at 7:10 PM
To: mailto:user@cassandra.apache.org>>
Subject: Elevated response times from all nodes in a data center at the same 
time.

Hi Everyone,

Need some suggestions regarding a peculiar issue we started facing in our 
production cluster for the last couple of days.

Here are our Production environment details.

AWS Regions: us-east-1 and us-west-2. Deployed over 3 availability zone in each 
region.
No of Nodes: 24
Data Centers: 4 (6 nodes in each data center, 2 OLTP Data centers for APIs and 
2 OLAP Data centers for Analytics and Batch loads)
Instance Types: r5.8x Large
Average Node Size: 182 GB
Work Load: Read heavy
Read TPS: 22k
Cassandra version: 3.0.15
Java Versio

Re: Elevated response times from all nodes in a data center at the same time.

2019-10-15 Thread Reid Pinchback
I’d look to see if you have compactions fronting the p99’s.  If so, then go 
back to looking at the I/O.  Disbelieve any metrics not captured at a high 
resolution for a time window around the compactions, like 100ms.  You could be 
hitting I/O stalls where reads are blocked by the flushing of writes.  It’s 
short-lived when it happens, and per-minute metrics won’t provide breadcrumbs.

From: Bill Walters 
Date: Monday, October 14, 2019 at 7:10 PM
To: 
Subject: Elevated response times from all nodes in a data center at the same 
time.

Hi Everyone,

Need some suggestions regarding a peculiar issue we started facing in our 
production cluster for the last couple of days.

Here are our Production environment details.

AWS Regions: us-east-1 and us-west-2. Deployed over 3 availability zone in each 
region.
No of Nodes: 24
Data Centers: 4 (6 nodes in each data center, 2 OLTP Data centers for APIs and 
2 OLAP Data centers for Analytics and Batch loads)
Instance Types: r5.8x Large
Average Node Size: 182 GB
Work Load: Read heavy
Read TPS: 22k
Cassandra version: 3.0.15
Java Version: JDK 181.
EBS Volumes: GP2 with 1TB 3000 iops.

1. We have been running in production for more than one year and our experience 
with Cassandra is great. Experienced little hiccups here and there but nothing 
severe.

2. But recently for the past couple of days we see a behavior where our p99 
latency in our AWS us-east-1 region OLTP data center, suddenly starts rising 
from 2 ms to 200 ms. It starts with one node where we see the 99th percentile 
Read Request latency in Datastax Opscenter starts increasing. And it spreads 
immediately, to all other 6 nodes in the data center.

3. We do not see any Read request timeouts or Exception in the our API Splunk 
logs only p99 and average latency go up suddenly.

4. We have investigated CPU level usage, Disk I/O, Memory usage and Network 
parameters for the nodes during this period and we are not experiencing any 
sudden surge in these parameters.

5. We setup client using WhiteListPolicy to send queries to each of the 6 nodes 
to understand which one is bad, but we see all of them responding with very 
high latency. It doesn't happen during our peak traffic period sometime in the 
night.

6. We checked the system.log files on our nodes, took a thread dump and checked 
for any rouge processes running on the nodes which is stealing CPU but we are 
able to find nothing.

7. We even checked our the write requests coming in during this time and we do 
not see any large batch operations happening.

8. Initially we tried restarting the nodes to see if the issue can be mitigated 
but it kept happening, and we had to fail over API traffic to us-west-2 region 
OLTP data center. After a couple of hours we failed back and everything seems 
to be working.

We are baffled by this behavior, only correlation we find is the "Native 
requests pending" in our Task queues when this happens.

Please let us know your suggestions on how to debug this issue. Has anyone 
experienced an issue like this before.(We had issues where one node starts 
acting bad due to bad EBS volume I/O read and write time, but all nodes 
experiencing an issue at same time is very peculiar)

Thank You,
Bill Walters.