Re: Why and How is Cassandra using all my ram ?
On 19 July 2018 at 10:43, Léo FERLIN SUTTON wrote: > Hello list ! > > I have a question about cassandra memory usage. > > My cassandra nodes are slowly using up all my ram until they get OOM-Killed. > > When I check the memory usage with nodetool info the memory > (off-heap+heap) doesn't match what the java process is really using. Hi Léo, It's possible that glibc is creating too many memory arenas. Are you setting/exporting MALLOC_ARENA_MAX to something sane before calling the JVM? You can check that in /proc//environ. I would also turn on -XX:NativeMemoryTracking=summary and use jcmd to check out native memory usage from the JVM's perspective. -Mark - To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org For additional commands, e-mail: user-h...@cassandra.apache.org
Re: Migrating to LCS : Disk Size recommendation clashes
Hi Amit, The size recommendations are based on balancing CPU and the amount of data stored on a node. LCS requires less disk space but generally requires much more CPU to keep up with compaction for the same amount of data, which is why the size recommendation is smaller. There is nothing wrong with attaching a larger disk, of course. The sizes are recommendations to start with when you have nothing else to go by. If your cluster is light on writes, you may be able to add much larger amounts data than the suggested sizes and have no problem keeping up with LCS compaction. If your cluster is heavy on writes, you may find you can only store a small fraction of the data per node you were able to store with STCS. You will have to benchmark for your use-case. The 10 TB number is from a theoretical situation where LCS would result in reading a maximum of 7 SSTables to return a read -- if LCS compaction can keep up. Cheers, Mark On Thu, Apr 13, 2017 at 8:23 AM, Amit Singh Fwrote: > Hi All, > > > > We are in process of migrating from STCS to LCS and was just doing few > reads on line . Below is the excerpt from Datastax recommendation on data > size : > > > > Doc link : https://docs.datastax.com/en/landing_page/doc/landing_page/ > planning/planningHardware.html > > > > > > Also there is one more recommendation where it hints down to disk size can > be limited to 10 TB (worst case) . Below is also excerpt also : > > > > Doc link : http://www.datastax.com/dev/blog/leveled-compaction-in- > apache-cassandra > > > > > > So are there any restrictions/scenarios due to which 600GB is the > preferred one in LCS. > > > > Thanks & Regards > > Amit Singh > > >
Re: Maximum memory usage reached in cassandra!
You may have better luck switching to G1GC and using a much larger heap (16 to 30GB). 4GB is likely too small for your amount of data, especially if you have a lot of sstables. Then try increasing file_cache_size_in_mb further. Cheers, Mark On Tue, Mar 28, 2017 at 3:01 AM, Mokkapati, Bhargav (Nokia - IN/Chennai)wrote: > Hi Cassandra users, > > > > I am getting “Maximum memory usage reached (536870912 bytes), cannot > allocate chunk of 1048576 bytes” . As a remedy I have changed the off heap > memory usage limit cap i.e file_cache_size_in_mb parameter in cassandra.yaml > from 512 to 1024. > > > > But now again the increased limit got filled up and throwing a message > “Maximum memory usage reached (1073741824 bytes), cannot allocate chunk of > 1048576 bytes” > > > > This issue occurring when redistribution of index’s happening ,due to this > Cassandra nodes are still UP but read requests are getting failed from > application side. > > > > My configuration details are as below: > > > > 5 node cluster , each node with 68 disks, each disk is 3.7 TB > > > > Total CPU cores - 8 > > > > total Mem:377G > > used 265G > > free 58G > > shared 378M > > buff/cache 53G > > available 104G > > > > MAX_HEAP_SIZE is 4GB > > file_cache_size_in_mb: 1024 > > > > memtable heap space is commented in yaml file as below: > > # memtable_heap_space_in_mb: 2048 > > # memtable_offheap_space_in_mb: 2048 > > > > Can anyone please suggest the solution for this issue. Thanks in advance ! > > > > Thanks, > > Bhargav M > > > > > > > >
Re: Adding disk capacity to a running node
I've had luck using the st1 EBS type, too, for situations where reads are rare (the commit log still needs to be on its own high IOPS volume; I like using ephemeral storage for that). On Mon, Oct 17, 2016 at 3:03 PM, Branton Daviswrote: > I doubt that's true anymore. EBS volumes, while previously discouraged, are > the most flexible way to go, and are very reliable. You can attach, detach, > and snapshot them too. If you don't need provisioned IOPS, the GP2 SSDs are > more cost-effective and allow you to balance IOPS with cost. > > On Mon, Oct 17, 2016 at 1:55 PM, Jonathan Haddad wrote: >> >> Vladimir, >> >> *Most* people are running Cassandra are doing so using ephemeral disks. >> Instances are not arbitrarily moved to different hosts. Yes, instances can >> be shut down, but that's why you distribute across AZs. >> >> On Mon, Oct 17, 2016 at 11:48 AM Vladimir Yudovin >> wrote: >>> >>> It's extremely unreliable to use ephemeral (local) disks. Even if you >>> don't stop instance by yourself, it can be restarted on different server in >>> case of some hardware failure or AWS initiated update. So all node data will >>> be lost. >>> >>> Best regards, Vladimir Yudovin, >>> Winguzone - Hosted Cloud Cassandra on Azure and SoftLayer. >>> Launch your cluster in minutes. >>> >>> >>> On Mon, 17 Oct 2016 14:45:00 -0400Seth Edwards >>> wrote >>> >>> These are i2.2xlarge instances so the disks currently configured as >>> ephemeral dedicated disks. >>> >>> On Mon, Oct 17, 2016 at 11:34 AM, Laing, Michael >>> wrote: >>> >>> You could just expand the size of your ebs volume and extend the file >>> system. No data is lost - assuming you are running Linux. >>> >>> >>> On Monday, October 17, 2016, Seth Edwards wrote: >>> >>> We're running 2.0.16. We're migrating to a new data model but we've had >>> an unexpected increase in write traffic that has caused us some capacity >>> issues when we encounter compactions. Our old data model is on STCS. We'd >>> like to add another ebs volume (we're on aws) to our JBOD config and >>> hopefully avoid any situation where we run out of disk space during a large >>> compaction. It appears that the behavior we are hoping to get is actually >>> undesirable and removed in 3.2. It still might be an option for us until we >>> can finish the migration. >>> >>> I'm not familiar with LVM so it may be a bit risky to try at this point. >>> >>> On Mon, Oct 17, 2016 at 9:42 AM, Yabin Meng wrote: >>> >>> I assume you're talking about Cassandra JBOD (just a bunch of disk) setup >>> because you do mention it as adding it to the list of data directories. If >>> this is the case, you may run into issues, depending on your C* version. >>> Check this out: http://www.datastax.com/dev/blog/improving-jbod. >>> >>> Or another approach is to use LVM to manage multiple devices into a >>> single mount point. If you do so, from what Cassandra can see is just simply >>> increased disk storage space and there should should have no problem. >>> >>> Hope this helps, >>> >>> Yabin >>> >>> On Mon, Oct 17, 2016 at 11:54 AM, Vladimir Yudovin >>> wrote: >>> >>> >>> Yes, Cassandra should keep percent of disk usage equal for all disk. >>> Compaction process and SSTable flushes will use new disk to distribute both >>> new and existing data. >>> >>> Best regards, Vladimir Yudovin, >>> Winguzone - Hosted Cloud Cassandra on Azure and SoftLayer. >>> Launch your cluster in minutes. >>> >>> >>> On Mon, 17 Oct 2016 11:43:27 -0400Seth Edwards >>> wrote >>> >>> We have a few nodes that are running out of disk capacity at the moment >>> and instead of adding more nodes to the cluster, we would like to add >>> another disk to the server and add it to the list of data directories. My >>> question, is, will Cassandra use the new disk for compactions on sstables >>> that already exist in the primary directory? >>> >>> >>> >>> Thanks! >>> >>> >>> >
Re: Is to ok restart DECOMMISION
I've done that several times. Kill the process, restart it, let it sync, decommission. You'll need enough space on the receiving nodes for the full set of data, on top of the other data that was already sent earlier, plus room to cleanup/compact it. Before you kill, check system.log to see if it died on anything. If so, the decommission process will never finish. If not, let it continue. Of particular note is that by default transferring large sstables will timeout. You can fix that by adjusting streaming_socket_timeout_in_ms to a sufficiently large value (I set it to a day). -Mark On Thu, Sep 15, 2016 at 9:28 AM, laxmikanth sadulawrote: > I started decommssioned a node in our cassandra cluster. > But its taking too long time (more than 12 hrs) , so I would like to > restart(stop/kill the node & restart 'node decommission' again).. > > Does killing node/stopping decommission and restarting decommission will > cause any issues to cluster? > > Using c*-2.0.17 , 2 Data centers, each DC with 3 groups each , each group > with 3 nodes with RF-3 > > -- > Thanks...!
Re: large number of pending compactions, sstables steadily increasing
Hi Ezra, Are you making frequent changes to your rows (including TTL'ed values), or mostly inserting new ones? If you're only inserting new data, it's probable using size-tiered compaction would work better for you. If you are TTL'ing whole rows, consider date-tiered. If leveled compaction is still the best strategy, one way to catch up with compactions is to have less data per partition -- in other words, use more machines. Leveled compaction is CPU expensive. You are CPU bottlenecked currently, or from the other perspective, you have too much data per node for leveled compaction. At this point, compaction is so far behind that you'll likely be getting high latency if you're reading old rows (since dozens to hundreds of uncompacted sstables will likely need to be checked for matching rows). You may be better off with size tiered compaction, even if it will mean always reading several sstables per read (higher latency than when leveled can keep up). How much data do you have per node? Do you update/insert to/delete rows? Do you TTL? Cheers, Mark On Wed, Aug 17, 2016 at 2:39 PM, Ezra Stuetzelwrote: > I have one node in my cluster 2.2.7 (just upgraded from 2.2.6 hoping to fix > issue) which seems to be stuck in a weird state -- with a large number of > pending compactions and sstables. The node is compacting about 500gb/day, > number of pending compactions is going up at about 50/day. It is at about > 2300 pending compactions now. I have tried increasing number of compaction > threads and the compaction throughput, which doesn't seem to help eliminate > the many pending compactions. > > I have tried running 'nodetool cleanup' and 'nodetool compact'. The latter > has fixed the issue in the past, but most recently I was getting OOM errors, > probably due to the large number of sstables. I upgraded to 2.2.7 and am no > longer getting OOM errors, but also it does not resolve the issue. I do see > this message in the logs: > >> INFO [RMI TCP Connection(611)-10.9.2.218] 2016-08-17 01:50:01,985 >> CompactionManager.java:610 - Cannot perform a full major compaction as >> repaired and unrepaired sstables cannot be compacted together. These two set >> of sstables will be compacted separately. > > Below are the 'nodetool tablestats' comparing a normal and the problematic > node. You can see problematic node has many many more sstables, and they are > all in level 1. What is the best way to fix this? Can I just delete those > sstables somehow then run a repair? >> >> Normal node >>> >>> keyspace: mykeyspace >>> >>> Read Count: 0 >>> >>> Read Latency: NaN ms. >>> >>> Write Count: 31905656 >>> >>> Write Latency: 0.051713177939359714 ms. >>> >>> Pending Flushes: 0 >>> >>> Table: mytable >>> >>> SSTable count: 1908 >>> >>> SSTables in each level: [11/4, 20/10, 213/100, 1356/1000, 306, 0, >>> 0, 0, 0] >>> >>> Space used (live): 301894591442 >>> >>> Space used (total): 301894591442 >>> >>> >>> >>> Problematic node >>> >>> Keyspace: mykeyspace >>> >>> Read Count: 0 >>> >>> Read Latency: NaN ms. >>> >>> Write Count: 30520190 >>> >>> Write Latency: 0.05171286705620116 ms. >>> >>> Pending Flushes: 0 >>> >>> Table: mytable >>> >>> SSTable count: 14105 >>> >>> SSTables in each level: [13039/4, 21/10, 206/100, 831, 0, 0, 0, >>> 0, 0] >>> >>> Space used (live): 561143255289 >>> >>> Space used (total): 561143255289 > > Thanks, > > Ezra
Re: My cluster shows high system load without any apparent reason
Hi Garo, I haven't had this issue on SSDs, but I have definitely seen it with spinning drives. I would think that SSDs would have more than enough bandwidth to keep up with requests, but you may be running into issues with Cassandra calling fsync on the commitlog. What are your settings for the following? commitlog_sync commitlog_sync_period_in_ms commitlog_sync_batch_window_in_ms If you're using periodic, you could try changing commitlog_sync_period_in_ms to something smaller like 1000 ms and seeing if the problem is reduced (the theory is that there would be less pending data to sync). If you are using batch, switch to periodic. You could try mounting a GP2 volume and putting the commit log directory on it and see if the problem goes away (say 200 GB for sufficient IOPS). I'm guessing you don't have much in the way of unallocated blocks in your LVM vg. Writing to the commit log is single threaded, and if the commit log is tied up waiting for IO during an fsync, it will block writes to the node. If the threads are blocked on writing, the nodes will also be stall for reading. The symptoms you are seeing are exactly the same as I saw with spinning rust. I'm not sure why you didn't see this problem with EBS. -Mark On Sat, Jul 23, 2016 at 7:21 AM, Juho Mäkinen <juho.maki...@gmail.com> wrote: > Hi Mark. > > I have an LVM volume which stripes the four ephemeral SSD:s in the system > and we use that for both data and commit log. I've used similar setup in the > past (but with EBS) and we didn't see this behavior. Each node gets just > around 250 writes per second. It is possible that the commit log is the > issue here, but could I somehow measure it from the JMX metrics without the > need of restructuring my entire cluster? > > Here's a screenshot from the latencies from our application point of view, > which uses the Cassandra cluster to do reads. I started a rolling restart at > around 09:30 and you can clearly see how the system latency dropped. > http://imgur.com/a/kaPG7 > > On Sat, Jul 23, 2016 at 2:25 AM, Mark Rose <markr...@markrose.ca> wrote: >> >> Hi Garo, >> >> Did you put the commit log on its own drive? Spiking CPU during stalls >> is a symptom of not doing that. The commitlog is very latency >> sensitive, even under low load. Do be sure you're using the deadline >> or noop scheduler for that reason, too. >> >> -Mark >> >> On Fri, Jul 22, 2016 at 4:44 PM, Juho Mäkinen <juho.maki...@gmail.com> >> wrote: >> >> Are you using XFS or Ext4 for data? >> > >> > >> > We are using XFS. Many nodes have a couple large SSTables (in order of >> > 20-50 >> > GiB), but I havent cross checked if the load spikes happen only on >> > machines >> > which have these tables. >> > >> >> >> >> As an aside, for the amount of reads/writes you're doing, I've found >> >> using c3/m3 instances with the commit log on the ephemeral storage and >> >> data on st1 EBS volumes to be much more cost effective. It's something >> >> to look into if you haven't already. >> > >> > >> > Thanks for the idea! I previously used c4.4xlarge instances with two >> > 1500 GB >> > GP2 volumes, but I found out that we maxed out their bandwidth too >> > easily, >> > so that's why my newest cluster is based on i2.4xlarge instances. >> > >> > And to answer Ryan: No, we are not using counters. >> > >> > I was thinking that could the big amount (100+ GiB) of mmap'ed files >> > somehow >> > cause some inefficiencies on the kernel side. That's why I started to >> > learn >> > on kernel huge pages and came up with the idea of disabling the huge >> > page >> > defrag, but nothing what I've found indicates that this can be a real >> > problem. After all, Linux fs cache is a really old feature, so I expect >> > it >> > to be pretty bug free. >> > >> > I guess that I have to next learn how the load value itself is >> > calculated. I >> > know about the basic idea that when load is below the number of CPUs >> > then >> > the system should still be fine, but there's at least the iowait which >> > is >> > also used to calculate the load. So because I am not seeing any >> > extensive >> > iowait, and my userland CPU usage is well below what my 16 cores should >> > handle, then what else contributes to the system load? Can I somehow >> > make >> > any educated guess what the high load might tell me if it's not iowait >> > and >> > it's not purely userland process CPU usage?
Re: My cluster shows high system load without any apparent reason
Hi Garo, Did you put the commit log on its own drive? Spiking CPU during stalls is a symptom of not doing that. The commitlog is very latency sensitive, even under low load. Do be sure you're using the deadline or noop scheduler for that reason, too. -Mark On Fri, Jul 22, 2016 at 4:44 PM, Juho Mäkinenwrote: >> Are you using XFS or Ext4 for data? > > > We are using XFS. Many nodes have a couple large SSTables (in order of 20-50 > GiB), but I havent cross checked if the load spikes happen only on machines > which have these tables. > >> >> As an aside, for the amount of reads/writes you're doing, I've found >> using c3/m3 instances with the commit log on the ephemeral storage and >> data on st1 EBS volumes to be much more cost effective. It's something >> to look into if you haven't already. > > > Thanks for the idea! I previously used c4.4xlarge instances with two 1500 GB > GP2 volumes, but I found out that we maxed out their bandwidth too easily, > so that's why my newest cluster is based on i2.4xlarge instances. > > And to answer Ryan: No, we are not using counters. > > I was thinking that could the big amount (100+ GiB) of mmap'ed files somehow > cause some inefficiencies on the kernel side. That's why I started to learn > on kernel huge pages and came up with the idea of disabling the huge page > defrag, but nothing what I've found indicates that this can be a real > problem. After all, Linux fs cache is a really old feature, so I expect it > to be pretty bug free. > > I guess that I have to next learn how the load value itself is calculated. I > know about the basic idea that when load is below the number of CPUs then > the system should still be fine, but there's at least the iowait which is > also used to calculate the load. So because I am not seeing any extensive > iowait, and my userland CPU usage is well below what my 16 cores should > handle, then what else contributes to the system load? Can I somehow make > any educated guess what the high load might tell me if it's not iowait and > it's not purely userland process CPU usage? This is starting to get really > deep really fast :/ > > - Garo > > >> >> >> -Mark >> >> On Fri, Jul 22, 2016 at 8:10 AM, Juho Mäkinen >> wrote: >> > After a few days I've also tried disabling Linux kernel huge pages >> > defragement (echo never > /sys/kernel/mm/transparent_hugepage/defrag) >> > and >> > turning coalescing off (otc_coalescing_strategy: DISABLED), but either >> > did >> > do any good. I'm using LCS, there are no big GC pauses, and I have set >> > "concurrent_compactors: 5" (machines have 16 CPUs), but there are >> > usually >> > not any compactions running when the load spike comes. "nodetool >> > tpstats" >> > shows no running thread pools except on the Native-Transport-Requests >> > (usually 0-4) and perhaps ReadStage (usually 0-1). >> > >> > The symptoms are the same: after about 12-24 hours increasingly number >> > of >> > nodes start to show short CPU load spikes and this affects the median >> > read >> > latencies. I ran a dstat when a load spike was already under way (see >> > screenshot http://i.imgur.com/B0S5Zki.png), but any other column than >> > the >> > load itself doesn't show any major change except the system/kernel CPU >> > usage. >> > >> > All further ideas how to debug this are greatly appreciated. >> > >> > >> > On Wed, Jul 20, 2016 at 7:13 PM, Juho Mäkinen >> > wrote: >> >> >> >> I just recently upgraded our cluster to 2.2.7 and after turning the >> >> cluster under production load the instances started to show high load >> >> (as >> >> shown by uptime) without any apparent reason and I'm not quite sure >> >> what >> >> could be causing it. >> >> >> >> We are running on i2.4xlarge, so we have 16 cores, 120GB of ram, four >> >> 800GB SSDs (set as lvm stripe into one big lvol). Running >> >> 3.13.0-87-generic >> >> on HVM virtualisation. Cluster has 26 TiB of data stored in two tables. >> >> >> >> Symptoms: >> >> - High load, sometimes up to 30 for a short duration of few minutes, >> >> then >> >> the load drops back to the cluster average: 3-4 >> >> - Instances might have one compaction running, but might not have any >> >> compactions. >> >> - Each node is serving around 250-300 reads per second and around 200 >> >> writes per second. >> >> - Restarting node fixes the problem for around 18-24 hours. >> >> - No or very little IO-wait. >> >> - top shows that around 3-10 threads are running on high cpu, but that >> >> alone should not cause a load of 20-30. >> >> - Doesn't seem to be GC load: A system starts to show symptoms so that >> >> it >> >> has ran only one CMS sweep. Not like it would do constant >> >> stop-the-world >> >> gc's. >> >> - top shows that the C* processes use 100G of RSS memory. I assume >> >> that >> >> this is because cassandra opens all SSTables with mmap() so that they >> >> will >> >> pop up in the RSS count because of this.
Re: My cluster shows high system load without any apparent reason
Hi Garo, Are you using XFS or Ext4 for data? XFS is much better at deleting large files, such as may happen after a compaction. If you have 26 TB in just two tables, I bet you have some massive sstables which may take a while for Ext4 to delete, which may be causing the stalls. The underlying block layers will not show high IO-wait. See if the stall times line up with large compactions in system.log. If you must use Ext4, another way to avoid issues with massive sstables is to run more, smaller instances. As an aside, for the amount of reads/writes you're doing, I've found using c3/m3 instances with the commit log on the ephemeral storage and data on st1 EBS volumes to be much more cost effective. It's something to look into if you haven't already. -Mark On Fri, Jul 22, 2016 at 8:10 AM, Juho Mäkinenwrote: > After a few days I've also tried disabling Linux kernel huge pages > defragement (echo never > /sys/kernel/mm/transparent_hugepage/defrag) and > turning coalescing off (otc_coalescing_strategy: DISABLED), but either did > do any good. I'm using LCS, there are no big GC pauses, and I have set > "concurrent_compactors: 5" (machines have 16 CPUs), but there are usually > not any compactions running when the load spike comes. "nodetool tpstats" > shows no running thread pools except on the Native-Transport-Requests > (usually 0-4) and perhaps ReadStage (usually 0-1). > > The symptoms are the same: after about 12-24 hours increasingly number of > nodes start to show short CPU load spikes and this affects the median read > latencies. I ran a dstat when a load spike was already under way (see > screenshot http://i.imgur.com/B0S5Zki.png), but any other column than the > load itself doesn't show any major change except the system/kernel CPU > usage. > > All further ideas how to debug this are greatly appreciated. > > > On Wed, Jul 20, 2016 at 7:13 PM, Juho Mäkinen > wrote: >> >> I just recently upgraded our cluster to 2.2.7 and after turning the >> cluster under production load the instances started to show high load (as >> shown by uptime) without any apparent reason and I'm not quite sure what >> could be causing it. >> >> We are running on i2.4xlarge, so we have 16 cores, 120GB of ram, four >> 800GB SSDs (set as lvm stripe into one big lvol). Running 3.13.0-87-generic >> on HVM virtualisation. Cluster has 26 TiB of data stored in two tables. >> >> Symptoms: >> - High load, sometimes up to 30 for a short duration of few minutes, then >> the load drops back to the cluster average: 3-4 >> - Instances might have one compaction running, but might not have any >> compactions. >> - Each node is serving around 250-300 reads per second and around 200 >> writes per second. >> - Restarting node fixes the problem for around 18-24 hours. >> - No or very little IO-wait. >> - top shows that around 3-10 threads are running on high cpu, but that >> alone should not cause a load of 20-30. >> - Doesn't seem to be GC load: A system starts to show symptoms so that it >> has ran only one CMS sweep. Not like it would do constant stop-the-world >> gc's. >> - top shows that the C* processes use 100G of RSS memory. I assume that >> this is because cassandra opens all SSTables with mmap() so that they will >> pop up in the RSS count because of this. >> >> What I've done so far: >> - Rolling restart. Helped for about one day. >> - Tried doing manual GC to the cluster. >> - Increased heap from 8 GiB with CMS to 16 GiB with G1GC. >> - sjk-plus shows bunch of SharedPool workers. Not sure what to make of >> this. >> - Browsed over >> https://tobert.github.io/pages/als-cassandra-21-tuning-guide.html but didn't >> find any apparent >> >> I know that the general symptom of "system shows high load" is not very >> good and informative, but I don't know how to better describe what's going >> on. I appreciate all ideas what to try and how to debug this further. >> >> - Garo >> >