On 03/07/2015 01:02 PM, Jeff Schroeder wrote:
I wrote a python collectd plugin which pulls both master (only if
master/elected == 1) and slave stats from the rest api under
/metrics/snapshot and /slave(1)/stats.json respectively and throws those
into graphite.

Interesting. I have not found anything suitable; but my needs are for a
diverse, heterogeneous environment of quite a few different architectures
running linux. Some have tried nagios and such. Some have rudimentary
self-developed tools. The most likely candidate to be easily extensible
is probably either gkrellm or something similar. I think this is an area
that will become a sub-project, as different organizations will probably
need vastly different tools to track the distributed resource utilizations, as well as a myriad of process and the linkage (relationships) of those processes. ymmv. Then folks are going to want to compare the differences in how a problem is solved, with different languages, such as C, java, Scala, python, etc etc) and therefore
there will be a need for metrics to feed the analytic processes.

Furthermore, I see mesos diversifying into be useful for both HPCC
Hi Performance Cluster Computing, where a single problem spans
large numbers of processors and resources, as well as Clusters/Clouds
where a myriad of small to large tasks are processed concurrently. Robustly monitoring both scenarios does require different tools, greatly dependent on the granularity of the monitoring needs.

And it goes on and on, like using different frameworks, files systems,
and algorithms.

After getting everything working, I built a few dashboards, one of which
displays these stats from http://master:5051/metrics/snapshot:

master/disk_percent
master/cpus_percent
master/mem_percent

I had assumed that this was something like aggregate cluster
utilization, but this seems incorrect in practice. I have a small
cluster with ~1T of memory, ~25T of Disks, and ~540 CPU cores. I had a
dozen or so small tasks running, and launched 500 tasks with 1G of
memory and 1 CPU each.

Now I'd expect to se the disk/cpu/mem percentage metrics above go up
considerably. I did notice that cpus_percent went to around 0.94.

What is the correct way to measure overall cluster utilization for
capacity planning? We can have the NOC watch this and simply add more
hardware when the number starts getting low.

Boy, I cannot wait to read the tidbits of wisdom here. Maybe the development group has more accurate information if not some vague roadmap on resource/process monitoring. Sooner or later, this is going to become a quintessential need; so I hope the "deep thinkers" are all over this need both in the user and dev groups.

In fact the monitoring can easily create a significant loading on the cluster/cloud, if one is not judicious in how this is architect, implemented and dynamically tuned.

hth,
James


Jeff Schroeder

Don't drink and derive, alcohol and analysis don't mix.
http://www.digitalprognosis.com

Reply via email to