Re: Question on Monitoring a Mesos Cluster

CCAAT Sat, 07 Mar 2015 10:44:08 -0800

On 03/07/2015 01:02 PM, Jeff Schroeder wrote:

I wrote a python collectd plugin which pulls both master (only if
master/elected == 1) and slave stats from the rest api under
/metrics/snapshot and /slave(1)/stats.json respectively and throws those
into graphite.


Interesting. I have not found anything suitable; but my needs are for a
diverse, heterogeneous environment of quite a few different architectures
running linux. Some have tried nagios and such. Some have rudimentary
self-developed tools. The most likely candidate to be easily extensible
is probably either gkrellm or something similar. I think this is an area
that will become a sub-project, as different organizations will probably

need vastly different tools to track the distributed resourceutilizations, as well as a myriad of process and the linkage(relationships) of those processes. ymmv. Then folks are going to wantto compare the differences in how a problem is solved, with differentlanguages, such as C, java, Scala, python, etc etc) and therefore

there will be a need for metrics to feed the analytic processes.

Furthermore, I see mesos diversifying into be useful for both HPCC
Hi Performance Cluster Computing, where a single problem spans
large numbers of processors and resources, as well as Clusters/Clouds

where a myriad of small to large tasks are processed concurrently.Robustly monitoring both scenarios does require different tools, greatlydependent on the granularity of the monitoring needs.


And it goes on and on, like using different frameworks, files systems,
and algorithms.

After getting everything working, I built a few dashboards, one of which
displays these stats from http://master:5051/metrics/snapshot:

master/disk_percent
master/cpus_percent
master/mem_percent

I had assumed that this was something like aggregate cluster
utilization, but this seems incorrect in practice. I have a small
cluster with ~1T of memory, ~25T of Disks, and ~540 CPU cores. I had a
dozen or so small tasks running, and launched 500 tasks with 1G of
memory and 1 CPU each.

Now I'd expect to se the disk/cpu/mem percentage metrics above go up
considerably. I did notice that cpus_percent went to around 0.94.

What is the correct way to measure overall cluster utilization for
capacity planning? We can have the NOC watch this and simply add more
hardware when the number starts getting low.

Boy, I cannot wait to read the tidbits of wisdom here. Maybe thedevelopment group has more accurate information if not some vagueroadmap on resource/process monitoring. Sooner or later, this is goingto become a quintessential need; so I hope the "deep thinkers" are allover this need both in the user and dev groups.

In fact the monitoring can easily create a significant loading on thecluster/cloud, if one is not judicious in how this is architect,implemented and dynamically tuned.


hth,
James

Jeff Schroeder

Don't drink and derive, alcohol and analysis don't mix.
http://www.digitalprognosis.com

Re: Question on Monitoring a Mesos Cluster

Reply via email to