Igor,

In our setup, the ganglia clients have no knowledge of slurm.  Instead, a 
separate web front end queries slurm to determine what nodes are part of a job. 
 It then pulls ganglia data associated with those nodes for display to the user.

Pete

From: Igor Yakushin <igor.2...@gmail.com<mailto:igor.2...@gmail.com>>
Reply-To: slurm-dev <slurm-dev@schedmd.com<mailto:slurm-dev@schedmd.com>>
Date: Sunday, September 18, 2016 at 9:57 PM
To: slurm-dev <slurm-dev@schedmd.com<mailto:slurm-dev@schedmd.com>>
Subject: [slurm-dev] Re: how to monitor CPU/RAM usage on each node of a slurm 
job? python API?

Hi Peter,
Ganglia plugin would be interesting. How do ganglia clients on different nodes 
communicate? Typically they do not talk to each other but only to the central 
node. However, to decide that they are part of the same job, they somehow need 
to talk to each other?
Thank you,
Igor


On Sun, Sep 18, 2016 at 10:01 PM, Peter A Ruprecht 
<peter.rupre...@colorado.edu<mailto:peter.rupre...@colorado.edu>> wrote:
Igor,

Would sstat give you what you need?  (http://slurm.schedmd.com/sstat.html)  It 
doesn't update instantaneously but at least a few times a minute.

If you want to get fancy, I believe that xdmod can integrate with TACC-stats to 
provide graphs about what is happening inside a job but I'm not sure whether 
that updates in "real" time.

One of our summer interns created a custom ganglia interface that checked what 
nodes a job was running on and graphed several relevant variables selected from 
the ganglia RRD files for those nodes.  If you're interested in seeing that 
work, I can look into whether we can share it.

So there are some existing ways of going at this.

Pete

From: Igor Yakushin <igor.2...@gmail.com<mailto:igor.2...@gmail.com>>
Reply-To: slurm-dev <slurm-dev@schedmd.com<mailto:slurm-dev@schedmd.com>>
Date: Sunday, September 18, 2016 at 6:42 PM
To: slurm-dev <slurm-dev@schedmd.com<mailto:slurm-dev@schedmd.com>>
Subject: [slurm-dev] how to monitor CPU/RAM usage on each node of a slurm job? 
python API?

Hi All,

I'd like to be able to see for a given jobid how much resources are used by a 
job on each node it is running on at this moment. Is there a way to do it?

So far it looks like I have to script it: get the list of the involved nodes 
using, for example, squeue or qstat, ssh to each node and find all the user 
processes (not 100% guaranteed that they would be from the job I am interested 
in: is there a way to find UNIX pids corresponding to Slurm jobid?).

Another question: is there python API to slurm? I found pyslurm but so far it 
would not build with my version of Slurm.

Thank you,
Igor


Reply via email to