Re: Question on Monitoring a Mesos Cluster

Alex Rukletsov Wed, 11 Mar 2015 17:20:31 -0700

The master/cpus_percent metric is nothing else than used / total. It
however represent resources allocated to tasks, but tasks may not use
them fully (or use more if isolation is not enabled). You can't get
actual cluster utilisation, the best option is to aggregate system/*
metrics, that report the node load. This however includes all the
process running on a node, not only mesos and its tasks. Hope this
helps.



On Mon, Mar 9, 2015 at 8:16 AM, Andras Kerekes <
andras.kere...@ishisystems.com> wrote:

> We use the same monitoring script from rayrod2030. However instead of the
> master_cpus_percent, we use the master_cpus_used and master_cpus_total to
> calculate a percentage. And this will give the allocated percentage of
> CPUs in
> the cluster, the actual utilization is measured by collectd.
>
> -----Original Message-----
> From: rasput...@gmail.com [mailto:rasput...@gmail.com] On Behalf Of Dick
> Davies
> Sent: Saturday, March 07, 2015 2:15 PM
> To: user@mesos.apache.org
> Subject: Re: Question on Monitoring a Mesos Cluster
>
> Yeah, that confused me too - I think that figure is specific to the
> master/slave polled (and that'll just be the active one since you're only
> reporting when master/elected is true.
>
> I'm using this one https://github.com/rayrod2030/collectd-mesos  , not
> sure if
> that's the same as yours?
>
>
> On 7 March 2015 at 18:56, Jeff Schroeder <jeffschroe...@computer.org>
> wrote:
> > Responses inline
> >
> > On Sat, Mar 7, 2015 at 12:48 PM, CCAAT <cc...@tampabay.rr.com> wrote:
> >>
> >> ... snip ...
> >>>
> >>> After getting everything working, I built a few dashboards, one of
> >>> which displays these stats from http://master:5051/metrics/snapshot:
> >>>
> >>> master/disk_percent
> >>> master/cpus_percent
> >>> master/mem_percent
> >>>
> >>> I had assumed that this was something like aggregate cluster
> >>> utilization, but this seems incorrect in practice. I have a small
> >>> cluster with ~1T of memory, ~25T of Disks, and ~540 CPU cores. I had
> >>> a dozen or so small tasks running, and launched 500 tasks with 1G of
> >>> memory and 1 CPU each.
> >>>
> >>> Now I'd expect to se the disk/cpu/mem percentage metrics above go up
> >>> considerably. I did notice that cpus_percent went to around 0.94.
> >>>
> >>> What is the correct way to measure overall cluster utilization for
> >>> capacity planning? We can have the NOC watch this and simply add
> >>> more hardware when the number starts getting low.
> >>
> >>
> >> Boy, I cannot wait to read the tidbits of wisdom here. Maybe the
> >> development group has more accurate information if not some vague
> >> roadmap on resource/process monitoring. Sooner or later, this is
> >> going to become a quintessential need; so I hope the "deep thinkers"
> >> are all over this need both in the user and dev groups.
> >>
> >> In fact the monitoring can easily create a significant loading on the
> >> cluster/cloud, if one is not judicious in how this is architect,
> >> implemented and dynamically tuned.
> >
> >
> >
> >
> > Monitoring via passive metrics gathering and application "telemetry"
> > is one of the best ways to do it. That is how I've implemented things
> >
> >
> >
> > The beauty of the rest api is that it isn't heavyweight, and every
> > master has it on port 5050 (by default) and every slave has it on port
> > 5051 (by default). Since I'm throwing this all into graphite (well
> > technically cassandra fronted by cyanite fronted by graphite-api...
> > but same difference), I found a reasonable way to do capacity
> > planning. Collectd will poll the master/slave on each mesos host every
> > 10 seconds (localhost:5050 on masters and localhost:5151 on slaves).
> > This gets put into graphite via collectd's write_graphite plugin.
> > These 3 graphite targets give me percentages of utilization for nice
> graphs:
> >
> > alias(asPercent(collectd.mesos.clustername.gauge-master_cpu_used,
> > collectd.mesos.clustername.gauge-master_cpu_total), "Total CPU Usage")
> > alias(asPercent(collectd.mesos.clustername.gauge-master_mem_used,
> > collectd.mesos.clustername.gauge-master_mem_total), "Total Memory
> > Usage")
> > alias(asPercent(collectd.mesos.clustername.gauge-master_disk_used,
> > collectd.mesos.clustername.gauge-master_disk_total), "Total Disk
> > Usage")
> >
> > With that data, you can have your monitoring tools such as
> > nagios/icinga poll graphite. Using the native graphite render api, you
> can
> > do things like:
> >
> >     * "if the cpu usage is over 80% for 24 hours, send a warning event"
> >     * "if the cpu usage is over 95% for 6 hours, send a critical event"
> >
> > This allows mostly no-impact monitoring since the monitoring tools are
> > hitting graphite.
> >
> > Anyways, back to the original questions:
> >
> > How does everyone do proper monitoring and capacity planning for large
> > mesos clusters? I expect my cluster to grow beyond what it currently
> > is by quite a bit.
> >
> > --
> > Jeff Schroeder
> >
> > Don't drink and derive, alcohol and analysis don't mix.
> > http://www.digitalprognosis.com
>

Re: Question on Monitoring a Mesos Cluster

Reply via email to