We are writing a CPU Management User Guide to help SLURM users in choosing the right commands/options to control the use of CPU resources by their jobs.
As part of this effort, I have been looking at the commands that report CPU usage by jobs/steps/tasks. Specifically, I am looking for ways of obtaining the following information: 1. List of selected nodes and allocated CPUs on each node for a job/step. 2. Distribution of tasks to nodes for a job/step. That is, which task has been assigned to which node, by task id and nodename. 3. Distribution and binding of tasks to CPUs within a node. That is, the set of CPUs to which each task is bound. This only applies if task affinity is configured. 1 is provided by "scontrol --details show job". 3 can be obtained using the "verbose" option on "srun --cpu_bind" or TaskPluginParam in slurm.conf, although the format of the information (hardware CPU masks) is hard to interpret. Is there any straightforward way of getting 2? I've looked at scontrol, squeue and sstat, but I didn't see any options that would provide this info. It would be nice if "scontrol show job" could provide all three sets of information for a job/step. Perhaps we will look into adding that functionality ourselves. But in the meantime, is there a simple way of getting the task distribution information? Thanks, Martin Perry Bull HN Phoenix
