I'll try to capture answer to questions in the last 2 messages.

Network traffic looks pretty steady overall. About 0.5 up to 2 megabytes/s. The cluster handles about 100k to 500k operations per minute, right now the read/write comparison is about 50/50 right now, eventually though it will probably be 70% writes and 30% reads.

There does seem to be some nodes that are affected more frequently then others. I haven't captured cpu/memory stats vs other nodes at the time the problem is occurring, I will do that next time it happens. Also I will look at compaction stats and tpstats, what are some things that I should be looking for in tpstats in particular, I'm not exactly sure how to read the output from that command.

The heap size is set to 15GB on each node, and each node has 60GB of ram available.

In regards to the "... is now DOWN" messages. I'm unable to find one in the system.log for a time I know that a node was having problems. I've built a system that polls nodetool status and parses the output, and if it sees a node reporting as DN it sends a message to a slack channel. Is it possible for a node to report as DN, but not have the message show up in th log?
The system polling nodetool status is not the status that was reported as DN.

I'm a bit unclear about the last point about mtime/size of files and how to check, can you provide more information there?

Thanks for the all the help, I really appreciate it.



On Jun 1 2017, at 10:33 am, Victor Chen <victor.h.c...@gmail.com> wrote:
Hi Daniel,

In my experience when a node shows DN and then comes back up by itself that sounds some sort of gc pause (especially if nodtool status when run from the "DN" node itself shows it is up-- assuming there isn't a spotty network issue). Perhaps I missed this info due to length of thread but have you shared info about the following?
  • cpu/memory usage of affected nodes (are all nodes affected comparably, or some more than others?)
  • nodetool compactionstats and tpstats output (especially as the )
  • what is your heap size set to?
  • system.log and gc.logs: for investigating node "DN" symptoms I will usually start by noting the timestamp of the "123.56.78.901 is now DOWN" entries in system.log of other nodes to tell me where to look in system.log of node in question. Then it's a question answer "what was this node doing up to that point?"
  • mtime/size of files in data directory-- which files are growing in size?

That will help reduce how much we need to speculate. I don't think you should need to restart cassandra every X days if things are optimally configured for your read/write pattern-- at least I would not want to use something where that is the normal expected behavior (and I don't believe cassandra is one of those sorts of things).


On Thu, Jun 1, 2017 at 11:40 AM, daemeon reiydelle <daeme...@gmail.com> wrote:
Some random thoughts; I would like to thank you for giving us an interesting problem. Cassandra can get boring sometimes, it is too stable.

- Do you have a way to monitor the network traffic to see if it is increasing between restarts or does it seem relatively flat?
- What activities are happening when you observe the (increasing) latencies? Something must be writing to keyspaces, something I presume is reading. What is the workload?
- when using SSD, there are some /devices optimizations for SSD's. I wonder if those were done (they will cause some IO latency, but not like this)





Daemeon C.M. Reiydelle
USA (+1) 415.501.0198
London (+44) (0) 20 8144 9872




On Thu, Jun 1, 2017 at 7:18 AM, Daniel Steuernol <dan...@sendwithus.com> wrote:
I am just restarting cassandra. I'm not having any disk space issues I think, but we're having issues where operations have increased latency, and these are fixed by a restart. It seemed like the load reported by nodetool status might be helpful in understanding what is going wrong but I'm not sure. Another symptom is that nodes will report as DN in nodetool status and then come back up again just a minute later.

I'm not really sure what to track to find out what exactly is going wrong on the cluster, so any insight or debugging techniques would be super helpful


On May 31 2017, at 5:07 pm, Anthony Grasso <anthony.gra...@gmail.com> wrote:
Hi Daniel,

When you say that the nodes have to be restarted, are you just restarting the Cassandra service or are you restarting the machine?
How are you reclaiming disk space at the moment? Does disk space free up after the restart?

Regarding storage on nodes, keep in mind the more data stored on a node, the longer some operations to maintain that data will take to complete. In addition, the more data that is on each node, the long it will take to stream data to other nodes. Whether it is replacing a down node or inserting a new node, having a large amount of data on each node will mean that it takes longer for a node to join the cluster if it is streaming the data.

Kind regards,
Anthony

On 30 May 2017 at 02:43, Daniel Steuernol <dan...@sendwithus.com> wrote:
The cluster is running with RF=3, right now each node is storing about 3-4 TB of data. I'm using r4.2xlarge EC2 instances, these have 8 vCPU's, 61 GB of RAM, and the disks attached for the data drive are gp2 ssd ebs volumes with 10k iops. I guess this brings up the question of what's a good marker to decide on whether to increase disk space vs provisioning a new node?



On May 29 2017, at 9:35 am, tommaso barbugli <tbarbu...@gmail.com> wrote:
Hi Daniel,

This is not normal. Possibly a capacity problem. Whats the RF, how much data do you store per node and what kind of servers do you use (core count, RAM, disk, ...)?

Cheers,
Tommaso

On Mon, May 29, 2017 at 6:22 PM, Daniel Steuernol <dan...@sendwithus.com> wrote:

I am running a 6 node cluster, and I have noticed that the reported load on each node rises throughout the week and grows way past the actual disk space used and available on each node. Also eventually latency for operations suffers and the nodes have to be restarted. A couple questions on this, is this normal? Also does cassandra need to be restarted every few days for best performance? Any insight on this behaviour would be helpful.

Cheers,
Daniel
--------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org

--------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org

--------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org


--------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org

Reply via email to