I attempted to make something somewhat similar to this with the intent of finding users who submit highly inefficiant jobs. I wanted it to be real-time, no extra plugins and work for users to see how their jobs scaled. It uses only sacct and sstat. For anyone with sudo access, it will show all of the jobs on the cluster (you have to put them in the list).


Warning, terrible bash code ahead:
http://hastebin.com/wiyiweruye.bash


[+] JobID: 48100 Cores: 64 Nodes: 5 Utilization: 91% User: user1 [+] JobID: 48519 Cores: 64 Nodes: 4 Utilization: 26% User: user1 [+] JobID: 48520 Cores: 64 Nodes: 5 Utilization: 36% User: user2 [+] JobID: 48521 Cores: 64 Nodes: 9 Utilization: 6% User: user3 [+] JobID: 48563 Cores: 1 Nodes: 1 Utilization: 99% User: user4 [+] JobID: 48571 Cores: 5 Nodes: 1 Utilization: 86% User: user5 [+] JobID: 48577 Cores: 5 Nodes: 1 Utilization: 86% User: user6 [+] JobID: 48580 Cores: 8 Nodes: 1 Utilization: 87% User: user6 [+] JobID: 48612 Cores: 5 Nodes: 1 Utilization: 87% User: user6

-------------------
Nicholas McCollum
HPC Systems Administrator
Alabama Supercomputer Authority

On Mon, 19 Sep 2016, Ryan Cox wrote:

I should probably add some example output:

Someone we need to talk to:
     Node   |     Memory (GB)     |         CPUs
   Hostname   Alloc    Max    Cur   Alloc   Used  Eff%
    m8-10-5    19.5      0      0       1   0.00     0
   *m8-10-2    19.5    2.3    2.2       1   0.99    99
    m8-10-3    19.5      0      0       1   0.00     0
    m8-10-4    19.5      0      0       1   0.00     0

* denotes the node where the batch script executes (node 0)
CPU usage is cumulative since the start of the job


Much better:
     Node   |     Memory (GB)     |         CPUs
   Hostname   Alloc    Max    Cur   Alloc   Used  Eff%
    m9-48-2   112.0   21.1   19.3      16  15.97    99
    m9-48-3    98.0   18.5   16.8      14  13.98    99
    m9-16-3   112.0   20.9   19.2      16  15.97    99
    m9-44-1   112.0   21.0   19.2      16  15.97    99
    m9-43-3   119.0   22.3   20.4      17  16.97    99
    m9-44-2   112.0   21.2   19.3      16  15.98    99
    m9-14-4   112.0   21.0   19.2      16  15.97    99
    m9-46-4   119.0   22.5   20.5      17  16.97    99
   *m9-10-2    91.0   32.0   15.8      13  12.81    98
    m9-43-1   119.0   22.3   20.4      17  16.97    99
    m9-16-1   126.0   23.9   21.6      18  17.97    99
    m9-47-4   119.0   22.4   20.5      17  16.97    99
    m9-43-4   119.0   22.4   20.5      17  16.97    99
    m9-48-1    84.0   15.7   14.4      12  11.98    99
    m9-42-4   119.0   22.2   20.3      17  16.97    99
    m9-43-2   119.0   22.2   20.4      17  16.97    99

* denotes the node where the batch script executes (node 0)
CPU usage is cumulative since the start of the job

Ryan

On 09/19/2016 11:13 AM, Ryan Cox wrote:
We use this script that we cobbled together: https://github.com/BYUHPC/slurm-random/blob/master/rjobstat. It assumes that you're using cgroups. It uses ssh to connect to each node so it's not very scalable but it works well enough for us.

Ryan

On 09/18/2016 06:42 PM, Igor Yakushin wrote:
how to monitor CPU/RAM usage on each node of a slurm job? python API?
Hi All,

I'd like to be able to see for a given jobid how much resources are used by a job on each node it is running on at this moment. Is there a way to do it?

So far it looks like I have to script it: get the list of the involved nodes using, for example, squeue or qstat, ssh to each node and find all the user processes (not 100% guaranteed that they would be from the job I am interested in: is there a way to find UNIX pids corresponding to Slurm jobid?).

Another question: is there python API to slurm? I found pyslurm but so far it would not build with my version of Slurm.

Thank you,
Igor



--
Ryan Cox
Operations Director
Fulton Supercomputing Lab
Brigham Young University

Attachment: smime.p7s
Description: S/MIME Cryptographic Signature

Reply via email to