On 9/18/16, 8:41 PM, "Igor Yakushin" <igor.2...@gmail.com> wrote:
> Hi All,
> I'd like to be able to see for a given jobid how much resources are used by a
> job on each node it is running on at this moment. Is there a way to do it?
> So far it looks like I have to script it: get the list of the involved nodes
> using, for example, squeue or qstat, ssh to each node and find all the user
> processes (not
> 100% guaranteed that they would be from the job I am interested
> in: is there a way to find UNIX pids corresponding to Slurm jobid?).
You can do `scontrol listpids` on a node. It will return a mapping of PIDs to
JobIDs. But from a script, you would have to fork a subshell to execute
scontrol and then you would have to parse the output.
If you are using the cgroup task plugin, a better way would be to parse the
output of the cgroup hierarchy (/cgroup or /sys/fs/cgroup, depending on your
OS) on each compute node. There is a Python API to libcgroup
(https://git.fedorahosted.org/git/python-libcgroup.git) but I don’t think it is
complete and I’m not sure of its status (whether it is maintained or not). If
you are doing this from Python, however, I find it easier and faster to just
glob the cgroup hierarchy and read cgroup.procs and memory.stat under the slurm
tasks. You still need to get the CPU state for each process or thread under a
given job in order to get the “cpu load” for that job.
My take on this was to write a small daemon that runs on each node. It gathers
metrics for all running slurm processes on a node and aggregates them by job.
The daemon then sends the info periodically (every 30 seconds) to a Redis
database in JSON format. From there, I can write command utilities or web
tools that query Redis instead of slurmctld. This makes for a stateless
monitoring environment. Given that Redis runs in-memory, if Redis goes down,
all metrics are lost. However, as long as the daemon is running on each
compute node, Redis will be fully repopulated in 30 seconds.
I have some code that does all this already, but I don’t think it is ready for
mass consumption. I could put it on GitHub if anyone is interested.
> Another question: is there python API to slurm? I found pyslurm but so far it
> would not build with my version of Slurm.
What version of Slurm are you running? If you are having problems building
PySlurm, feel free to post questions here:
We’d be happy to help you get PySlurm going.