Re: Question on Monitoring a Mesos Cluster

2015-03-11 Thread Alex Rukletsov
The master/cpus_percent metric is nothing else than used / total. It
however represent resources allocated to tasks, but tasks may not use
them fully (or use more if isolation is not enabled). You can't get
actual cluster utilisation, the best option is to aggregate system/*
metrics, that report the node load. This however includes all the
process running on a node, not only mesos and its tasks. Hope this
helps.


On Mon, Mar 9, 2015 at 8:16 AM, Andras Kerekes 
andras.kere...@ishisystems.com wrote:

 We use the same monitoring script from rayrod2030. However instead of the
 master_cpus_percent, we use the master_cpus_used and master_cpus_total to
 calculate a percentage. And this will give the allocated percentage of
 CPUs in
 the cluster, the actual utilization is measured by collectd.

 -Original Message-
 From: rasput...@gmail.com [mailto:rasput...@gmail.com] On Behalf Of Dick
 Davies
 Sent: Saturday, March 07, 2015 2:15 PM
 To: user@mesos.apache.org
 Subject: Re: Question on Monitoring a Mesos Cluster

 Yeah, that confused me too - I think that figure is specific to the
 master/slave polled (and that'll just be the active one since you're only
 reporting when master/elected is true.

 I'm using this one https://github.com/rayrod2030/collectd-mesos  , not
 sure if
 that's the same as yours?


 On 7 March 2015 at 18:56, Jeff Schroeder jeffschroe...@computer.org
 wrote:
  Responses inline
 
  On Sat, Mar 7, 2015 at 12:48 PM, CCAAT cc...@tampabay.rr.com wrote:
 
  ... snip ...
 
  After getting everything working, I built a few dashboards, one of
  which displays these stats from http://master:5051/metrics/snapshot:
 
  master/disk_percent
  master/cpus_percent
  master/mem_percent
 
  I had assumed that this was something like aggregate cluster
  utilization, but this seems incorrect in practice. I have a small
  cluster with ~1T of memory, ~25T of Disks, and ~540 CPU cores. I had
  a dozen or so small tasks running, and launched 500 tasks with 1G of
  memory and 1 CPU each.
 
  Now I'd expect to se the disk/cpu/mem percentage metrics above go up
  considerably. I did notice that cpus_percent went to around 0.94.
 
  What is the correct way to measure overall cluster utilization for
  capacity planning? We can have the NOC watch this and simply add
  more hardware when the number starts getting low.
 
 
  Boy, I cannot wait to read the tidbits of wisdom here. Maybe the
  development group has more accurate information if not some vague
  roadmap on resource/process monitoring. Sooner or later, this is
  going to become a quintessential need; so I hope the deep thinkers
  are all over this need both in the user and dev groups.
 
  In fact the monitoring can easily create a significant loading on the
  cluster/cloud, if one is not judicious in how this is architect,
  implemented and dynamically tuned.
 
 
 
 
  Monitoring via passive metrics gathering and application telemetry
  is one of the best ways to do it. That is how I've implemented things
 
 
 
  The beauty of the rest api is that it isn't heavyweight, and every
  master has it on port 5050 (by default) and every slave has it on port
  5051 (by default). Since I'm throwing this all into graphite (well
  technically cassandra fronted by cyanite fronted by graphite-api...
  but same difference), I found a reasonable way to do capacity
  planning. Collectd will poll the master/slave on each mesos host every
  10 seconds (localhost:5050 on masters and localhost:5151 on slaves).
  This gets put into graphite via collectd's write_graphite plugin.
  These 3 graphite targets give me percentages of utilization for nice
 graphs:
 
  alias(asPercent(collectd.mesos.clustername.gauge-master_cpu_used,
  collectd.mesos.clustername.gauge-master_cpu_total), Total CPU Usage)
  alias(asPercent(collectd.mesos.clustername.gauge-master_mem_used,
  collectd.mesos.clustername.gauge-master_mem_total), Total Memory
  Usage)
  alias(asPercent(collectd.mesos.clustername.gauge-master_disk_used,
  collectd.mesos.clustername.gauge-master_disk_total), Total Disk
  Usage)
 
  With that data, you can have your monitoring tools such as
  nagios/icinga poll graphite. Using the native graphite render api, you
 can
  do things like:
 
  * if the cpu usage is over 80% for 24 hours, send a warning event
  * if the cpu usage is over 95% for 6 hours, send a critical event
 
  This allows mostly no-impact monitoring since the monitoring tools are
  hitting graphite.
 
  Anyways, back to the original questions:
 
  How does everyone do proper monitoring and capacity planning for large
  mesos clusters? I expect my cluster to grow beyond what it currently
  is by quite a bit.
 
  --
  Jeff Schroeder
 
  Don't drink and derive, alcohol and analysis don't mix.
  http://www.digitalprognosis.com



Question on Monitoring a Mesos Cluster

2015-03-07 Thread Jeff Schroeder
I wrote a python collectd plugin which pulls both master (only if
master/elected == 1) and slave stats from the rest api under
/metrics/snapshot and /slave(1)/stats.json respectively and throws those
into graphite.

After getting everything working, I built a few dashboards, one of which
displays these stats from http://master:5051/metrics/snapshot:

master/disk_percent
master/cpus_percent
master/mem_percent

I had assumed that this was something like aggregate cluster utilization,
but this seems incorrect in practice. I have a small cluster with ~1T of
memory, ~25T of Disks, and ~540 CPU cores. I had a dozen or so small tasks
running, and launched 500 tasks with 1G of memory and 1 CPU each.

Now I'd expect to se the disk/cpu/mem percentage metrics above go up
considerably. I did notice that cpus_percent went to around 0.94.

What is the correct way to measure overall cluster utilization for capacity
planning? We can have the NOC watch this and simply add more hardware when
the number starts getting low.

Thanks

-- 
Jeff Schroeder

Don't drink and derive, alcohol and analysis don't mix.
http://www.digitalprognosis.com


Re: Question on Monitoring a Mesos Cluster

2015-03-07 Thread Dick Davies
Yeah, that confused me too - I think that figure is specific to the
master/slave polled
(and that'll just be the active one since you're only reporting when
master/elected
is true.

I'm using this one https://github.com/rayrod2030/collectd-mesos  , not
sure if that's
the same as yours?


On 7 March 2015 at 18:56, Jeff Schroeder jeffschroe...@computer.org wrote:
 Responses inline

 On Sat, Mar 7, 2015 at 12:48 PM, CCAAT cc...@tampabay.rr.com wrote:

 ... snip ...

 After getting everything working, I built a few dashboards, one of which
 displays these stats from http://master:5051/metrics/snapshot:

 master/disk_percent
 master/cpus_percent
 master/mem_percent

 I had assumed that this was something like aggregate cluster
 utilization, but this seems incorrect in practice. I have a small
 cluster with ~1T of memory, ~25T of Disks, and ~540 CPU cores. I had a
 dozen or so small tasks running, and launched 500 tasks with 1G of
 memory and 1 CPU each.

 Now I'd expect to se the disk/cpu/mem percentage metrics above go up
 considerably. I did notice that cpus_percent went to around 0.94.

 What is the correct way to measure overall cluster utilization for
 capacity planning? We can have the NOC watch this and simply add more
 hardware when the number starts getting low.


 Boy, I cannot wait to read the tidbits of wisdom here. Maybe the
 development group has more accurate information if not some vague roadmap on
 resource/process monitoring. Sooner or later, this is going to become a
 quintessential need; so I hope the deep thinkers are all over this need
 both in the user and dev groups.

 In fact the monitoring can easily create a significant loading on the
 cluster/cloud, if one is not judicious in how this is architect, implemented
 and dynamically tuned.




 Monitoring via passive metrics gathering and application telemetry is one
 of the best ways to do it. That is how I've implemented things



 The beauty of the rest api is that it isn't heavyweight, and every master
 has it on port 5050 (by default) and every slave has it on port 5051 (by
 default). Since I'm throwing this all into graphite (well technically
 cassandra fronted by cyanite fronted by graphite-api... but same
 difference), I found a reasonable way to do capacity planning. Collectd will
 poll the master/slave on each mesos host every 10 seconds (localhost:5050 on
 masters and localhost:5151 on slaves). This gets put into graphite via
 collectd's write_graphite plugin. These 3 graphite targets give me
 percentages of utilization for nice graphs:

 alias(asPercent(collectd.mesos.clustername.gauge-master_cpu_used,
 collectd.mesos.clustername.gauge-master_cpu_total), Total CPU Usage)
 alias(asPercent(collectd.mesos.clustername.gauge-master_mem_used,
 collectd.mesos.clustername.gauge-master_mem_total), Total Memory Usage)
 alias(asPercent(collectd.mesos.clustername.gauge-master_disk_used,
 collectd.mesos.clustername.gauge-master_disk_total), Total Disk Usage)

 With that data, you can have your monitoring tools such as nagios/icinga
 poll graphite. Using the native graphite render api, you can do things like:

 * if the cpu usage is over 80% for 24 hours, send a warning event
 * if the cpu usage is over 95% for 6 hours, send a critical event

 This allows mostly no-impact monitoring since the monitoring tools are
 hitting graphite.

 Anyways, back to the original questions:

 How does everyone do proper monitoring and capacity planning for large mesos
 clusters? I expect my cluster to grow beyond what it currently is by quite a
 bit.

 --
 Jeff Schroeder

 Don't drink and derive, alcohol and analysis don't mix.
 http://www.digitalprognosis.com


Re: Question on Monitoring a Mesos Cluster

2015-03-07 Thread Jeff Schroeder
Responses inline

On Sat, Mar 7, 2015 at 12:48 PM, CCAAT cc...@tampabay.rr.com wrote:

 ... snip ...

 After getting everything working, I built a few dashboards, one of which
 displays these stats from http://master:5051/metrics/snapshot:

 master/disk_percent
 master/cpus_percent
 master/mem_percent

 I had assumed that this was something like aggregate cluster
 utilization, but this seems incorrect in practice. I have a small
 cluster with ~1T of memory, ~25T of Disks, and ~540 CPU cores. I had a
 dozen or so small tasks running, and launched 500 tasks with 1G of
 memory and 1 CPU each.

 Now I'd expect to se the disk/cpu/mem percentage metrics above go up
 considerably. I did notice that cpus_percent went to around 0.94.

 What is the correct way to measure overall cluster utilization for
 capacity planning? We can have the NOC watch this and simply add more
 hardware when the number starts getting low.


 Boy, I cannot wait to read the tidbits of wisdom here. Maybe the
 development group has more accurate information if not some vague roadmap
 on resource/process monitoring. Sooner or later, this is going to become a
 quintessential need; so I hope the deep thinkers are all over this need
 both in the user and dev groups.

 In fact the monitoring can easily create a significant loading on the
 cluster/cloud, if one is not judicious in how this is architect,
 implemented and dynamically tuned.




Monitoring via passive metrics gathering and application telemetry is one
of the best ways to do it. That is how I've implemented things



The beauty of the rest api is that it isn't heavyweight, and every master
has it on port 5050 (by default) and every slave has it on port 5051 (by
default). Since I'm throwing this all into graphite (well technically
cassandra fronted by cyanite fronted by graphite-api... but same
difference), I found a reasonable way to do capacity planning. Collectd
will poll the master/slave on each mesos host every 10 seconds
(localhost:5050 on masters and localhost:5151 on slaves). This gets put
into graphite via collectd's write_graphite plugin. These 3 graphite
targets give me percentages of utilization for nice graphs:

alias(asPercent(collectd.mesos.clustername.gauge-master_cpu_used,
collectd.mesos.clustername.gauge-master_cpu_total), Total CPU Usage)
alias(asPercent(collectd.mesos.clustername.gauge-master_mem_used,
collectd.mesos.clustername.gauge-master_mem_total), Total Memory Usage)
alias(asPercent(collectd.mesos.clustername.gauge-master_disk_used,
collectd.mesos.clustername.gauge-master_disk_total), Total Disk Usage)

With that data, you can have your monitoring tools such as nagios/icinga
poll graphite. Using the native graphite render api, you can do things like:

* if the cpu usage is over 80% for 24 hours, send a warning event
* if the cpu usage is over 95% for 6 hours, send a critical event

This allows mostly no-impact monitoring since the monitoring tools are
hitting graphite.

Anyways, back to the original questions:

How does everyone do proper monitoring and capacity planning for large
mesos clusters? I expect my cluster to grow beyond what it currently is by
quite a bit.

-- 
Jeff Schroeder

Don't drink and derive, alcohol and analysis don't mix.
http://www.digitalprognosis.com