Re: Question on Monitoring a Mesos Cluster
The master/cpus_percent metric is nothing else than used / total. It however represent resources allocated to tasks, but tasks may not use them fully (or use more if isolation is not enabled). You can't get actual cluster utilisation, the best option is to aggregate system/* metrics, that report the node load. This however includes all the process running on a node, not only mesos and its tasks. Hope this helps. On Mon, Mar 9, 2015 at 8:16 AM, Andras Kerekes andras.kere...@ishisystems.com wrote: We use the same monitoring script from rayrod2030. However instead of the master_cpus_percent, we use the master_cpus_used and master_cpus_total to calculate a percentage. And this will give the allocated percentage of CPUs in the cluster, the actual utilization is measured by collectd. -Original Message- From: rasput...@gmail.com [mailto:rasput...@gmail.com] On Behalf Of Dick Davies Sent: Saturday, March 07, 2015 2:15 PM To: user@mesos.apache.org Subject: Re: Question on Monitoring a Mesos Cluster Yeah, that confused me too - I think that figure is specific to the master/slave polled (and that'll just be the active one since you're only reporting when master/elected is true. I'm using this one https://github.com/rayrod2030/collectd-mesos , not sure if that's the same as yours? On 7 March 2015 at 18:56, Jeff Schroeder jeffschroe...@computer.org wrote: Responses inline On Sat, Mar 7, 2015 at 12:48 PM, CCAAT cc...@tampabay.rr.com wrote: ... snip ... After getting everything working, I built a few dashboards, one of which displays these stats from http://master:5051/metrics/snapshot: master/disk_percent master/cpus_percent master/mem_percent I had assumed that this was something like aggregate cluster utilization, but this seems incorrect in practice. I have a small cluster with ~1T of memory, ~25T of Disks, and ~540 CPU cores. I had a dozen or so small tasks running, and launched 500 tasks with 1G of memory and 1 CPU each. Now I'd expect to se the disk/cpu/mem percentage metrics above go up considerably. I did notice that cpus_percent went to around 0.94. What is the correct way to measure overall cluster utilization for capacity planning? We can have the NOC watch this and simply add more hardware when the number starts getting low. Boy, I cannot wait to read the tidbits of wisdom here. Maybe the development group has more accurate information if not some vague roadmap on resource/process monitoring. Sooner or later, this is going to become a quintessential need; so I hope the deep thinkers are all over this need both in the user and dev groups. In fact the monitoring can easily create a significant loading on the cluster/cloud, if one is not judicious in how this is architect, implemented and dynamically tuned. Monitoring via passive metrics gathering and application telemetry is one of the best ways to do it. That is how I've implemented things The beauty of the rest api is that it isn't heavyweight, and every master has it on port 5050 (by default) and every slave has it on port 5051 (by default). Since I'm throwing this all into graphite (well technically cassandra fronted by cyanite fronted by graphite-api... but same difference), I found a reasonable way to do capacity planning. Collectd will poll the master/slave on each mesos host every 10 seconds (localhost:5050 on masters and localhost:5151 on slaves). This gets put into graphite via collectd's write_graphite plugin. These 3 graphite targets give me percentages of utilization for nice graphs: alias(asPercent(collectd.mesos.clustername.gauge-master_cpu_used, collectd.mesos.clustername.gauge-master_cpu_total), Total CPU Usage) alias(asPercent(collectd.mesos.clustername.gauge-master_mem_used, collectd.mesos.clustername.gauge-master_mem_total), Total Memory Usage) alias(asPercent(collectd.mesos.clustername.gauge-master_disk_used, collectd.mesos.clustername.gauge-master_disk_total), Total Disk Usage) With that data, you can have your monitoring tools such as nagios/icinga poll graphite. Using the native graphite render api, you can do things like: * if the cpu usage is over 80% for 24 hours, send a warning event * if the cpu usage is over 95% for 6 hours, send a critical event This allows mostly no-impact monitoring since the monitoring tools are hitting graphite. Anyways, back to the original questions: How does everyone do proper monitoring and capacity planning for large mesos clusters? I expect my cluster to grow beyond what it currently is by quite a bit. -- Jeff Schroeder Don't drink and derive, alcohol and analysis don't mix. http://www.digitalprognosis.com
Question on Monitoring a Mesos Cluster
I wrote a python collectd plugin which pulls both master (only if master/elected == 1) and slave stats from the rest api under /metrics/snapshot and /slave(1)/stats.json respectively and throws those into graphite. After getting everything working, I built a few dashboards, one of which displays these stats from http://master:5051/metrics/snapshot: master/disk_percent master/cpus_percent master/mem_percent I had assumed that this was something like aggregate cluster utilization, but this seems incorrect in practice. I have a small cluster with ~1T of memory, ~25T of Disks, and ~540 CPU cores. I had a dozen or so small tasks running, and launched 500 tasks with 1G of memory and 1 CPU each. Now I'd expect to se the disk/cpu/mem percentage metrics above go up considerably. I did notice that cpus_percent went to around 0.94. What is the correct way to measure overall cluster utilization for capacity planning? We can have the NOC watch this and simply add more hardware when the number starts getting low. Thanks -- Jeff Schroeder Don't drink and derive, alcohol and analysis don't mix. http://www.digitalprognosis.com
Re: Question on Monitoring a Mesos Cluster
Yeah, that confused me too - I think that figure is specific to the master/slave polled (and that'll just be the active one since you're only reporting when master/elected is true. I'm using this one https://github.com/rayrod2030/collectd-mesos , not sure if that's the same as yours? On 7 March 2015 at 18:56, Jeff Schroeder jeffschroe...@computer.org wrote: Responses inline On Sat, Mar 7, 2015 at 12:48 PM, CCAAT cc...@tampabay.rr.com wrote: ... snip ... After getting everything working, I built a few dashboards, one of which displays these stats from http://master:5051/metrics/snapshot: master/disk_percent master/cpus_percent master/mem_percent I had assumed that this was something like aggregate cluster utilization, but this seems incorrect in practice. I have a small cluster with ~1T of memory, ~25T of Disks, and ~540 CPU cores. I had a dozen or so small tasks running, and launched 500 tasks with 1G of memory and 1 CPU each. Now I'd expect to se the disk/cpu/mem percentage metrics above go up considerably. I did notice that cpus_percent went to around 0.94. What is the correct way to measure overall cluster utilization for capacity planning? We can have the NOC watch this and simply add more hardware when the number starts getting low. Boy, I cannot wait to read the tidbits of wisdom here. Maybe the development group has more accurate information if not some vague roadmap on resource/process monitoring. Sooner or later, this is going to become a quintessential need; so I hope the deep thinkers are all over this need both in the user and dev groups. In fact the monitoring can easily create a significant loading on the cluster/cloud, if one is not judicious in how this is architect, implemented and dynamically tuned. Monitoring via passive metrics gathering and application telemetry is one of the best ways to do it. That is how I've implemented things The beauty of the rest api is that it isn't heavyweight, and every master has it on port 5050 (by default) and every slave has it on port 5051 (by default). Since I'm throwing this all into graphite (well technically cassandra fronted by cyanite fronted by graphite-api... but same difference), I found a reasonable way to do capacity planning. Collectd will poll the master/slave on each mesos host every 10 seconds (localhost:5050 on masters and localhost:5151 on slaves). This gets put into graphite via collectd's write_graphite plugin. These 3 graphite targets give me percentages of utilization for nice graphs: alias(asPercent(collectd.mesos.clustername.gauge-master_cpu_used, collectd.mesos.clustername.gauge-master_cpu_total), Total CPU Usage) alias(asPercent(collectd.mesos.clustername.gauge-master_mem_used, collectd.mesos.clustername.gauge-master_mem_total), Total Memory Usage) alias(asPercent(collectd.mesos.clustername.gauge-master_disk_used, collectd.mesos.clustername.gauge-master_disk_total), Total Disk Usage) With that data, you can have your monitoring tools such as nagios/icinga poll graphite. Using the native graphite render api, you can do things like: * if the cpu usage is over 80% for 24 hours, send a warning event * if the cpu usage is over 95% for 6 hours, send a critical event This allows mostly no-impact monitoring since the monitoring tools are hitting graphite. Anyways, back to the original questions: How does everyone do proper monitoring and capacity planning for large mesos clusters? I expect my cluster to grow beyond what it currently is by quite a bit. -- Jeff Schroeder Don't drink and derive, alcohol and analysis don't mix. http://www.digitalprognosis.com
Re: Question on Monitoring a Mesos Cluster
Responses inline On Sat, Mar 7, 2015 at 12:48 PM, CCAAT cc...@tampabay.rr.com wrote: ... snip ... After getting everything working, I built a few dashboards, one of which displays these stats from http://master:5051/metrics/snapshot: master/disk_percent master/cpus_percent master/mem_percent I had assumed that this was something like aggregate cluster utilization, but this seems incorrect in practice. I have a small cluster with ~1T of memory, ~25T of Disks, and ~540 CPU cores. I had a dozen or so small tasks running, and launched 500 tasks with 1G of memory and 1 CPU each. Now I'd expect to se the disk/cpu/mem percentage metrics above go up considerably. I did notice that cpus_percent went to around 0.94. What is the correct way to measure overall cluster utilization for capacity planning? We can have the NOC watch this and simply add more hardware when the number starts getting low. Boy, I cannot wait to read the tidbits of wisdom here. Maybe the development group has more accurate information if not some vague roadmap on resource/process monitoring. Sooner or later, this is going to become a quintessential need; so I hope the deep thinkers are all over this need both in the user and dev groups. In fact the monitoring can easily create a significant loading on the cluster/cloud, if one is not judicious in how this is architect, implemented and dynamically tuned. Monitoring via passive metrics gathering and application telemetry is one of the best ways to do it. That is how I've implemented things The beauty of the rest api is that it isn't heavyweight, and every master has it on port 5050 (by default) and every slave has it on port 5051 (by default). Since I'm throwing this all into graphite (well technically cassandra fronted by cyanite fronted by graphite-api... but same difference), I found a reasonable way to do capacity planning. Collectd will poll the master/slave on each mesos host every 10 seconds (localhost:5050 on masters and localhost:5151 on slaves). This gets put into graphite via collectd's write_graphite plugin. These 3 graphite targets give me percentages of utilization for nice graphs: alias(asPercent(collectd.mesos.clustername.gauge-master_cpu_used, collectd.mesos.clustername.gauge-master_cpu_total), Total CPU Usage) alias(asPercent(collectd.mesos.clustername.gauge-master_mem_used, collectd.mesos.clustername.gauge-master_mem_total), Total Memory Usage) alias(asPercent(collectd.mesos.clustername.gauge-master_disk_used, collectd.mesos.clustername.gauge-master_disk_total), Total Disk Usage) With that data, you can have your monitoring tools such as nagios/icinga poll graphite. Using the native graphite render api, you can do things like: * if the cpu usage is over 80% for 24 hours, send a warning event * if the cpu usage is over 95% for 6 hours, send a critical event This allows mostly no-impact monitoring since the monitoring tools are hitting graphite. Anyways, back to the original questions: How does everyone do proper monitoring and capacity planning for large mesos clusters? I expect my cluster to grow beyond what it currently is by quite a bit. -- Jeff Schroeder Don't drink and derive, alcohol and analysis don't mix. http://www.digitalprognosis.com