We are writing a CPU Management User Guide to help SLURM users in choosing 
the right commands/options to control the use of CPU resources by their 
jobs.

As part of this effort, I have been looking at the commands that report 
CPU usage by jobs/steps/tasks.  Specifically, I am looking for ways of 
obtaining the following information:

1.  List of selected nodes and allocated CPUs on each node for a job/step.
2.  Distribution of tasks to nodes for a job/step.  That is, which task 
has been assigned to which node, by task id and nodename.
3.  Distribution and binding of tasks to CPUs within a node.  That is, the 
set of CPUs to which each task is bound. This only applies if task 
affinity is configured.

1 is provided by "scontrol --details show job".  3 can be obtained using 
the "verbose" option on "srun --cpu_bind" or TaskPluginParam in 
slurm.conf, although the format of the information (hardware CPU masks) is 
hard to interpret.  Is there any straightforward way of getting 2?  I've 
looked at scontrol, squeue and sstat, but I didn't see any options that 
would provide this info. 

It would be nice if "scontrol show job" could provide all three sets of 
information for a job/step.  Perhaps we will look into adding that 
functionality ourselves.  But in the meantime, is there a simple way of 
getting the task distribution information?

Thanks,
Martin Perry
Bull HN
Phoenix

Reply via email to