Kai, The Linux scheduler is not constrained by Slurm core allocations. So normally it can schedule Slurm tasks, and their child processes, on any core on the node. To constrain a Slurm job to its allocated cores, configure the task/cgroup plugin in slurm.conf with ConstrainCores=yes in cgroup.conf. See the slurm.conf and cgroup.conf man pages for more information.
Also, as you note, Slurm uses its own CPU numbering system, so the CPU_IDs reported by Slurm will not necessarily match the CPU numbers reported by Linux commands like top. Regards, Martin Perry Bull Phoenix -----Original Message----- From: [email protected] [mailto:[email protected]] Sent: Wednesday, February 11, 2015 4:34 AM To: slurm-dev Subject: [slurm-dev] Possible core allocation issues Hi, We are experiencing some possible core allocation issues in our system. One of the issues is that the system seems to allocate cores to be shared by several processes as shown by the output of the top command on one of the nodes: top - 16:59:08 up 16 days, 23:37, 1 user, load average: 11.99, 11.90, 11.58 Tasks: 333 total, 13 running, 320 sleeping, 0 stopped, 0 zombie Cpu(s): 66.7%us, 0.1%sy, 0.0%ni, 33.1%id, 0.1%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 99052292k total, 9088412k used, 89963880k free, 177592k buffers Swap: 40959992k total, 0k used, 40959992k free, 6160760k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ P COMMAND 9477 user1 20 0 10.7g 166m 6724 R 100.0 0.2 46:01.09 8 parcas2 9478 user1 20 0 10.7g 163m 6464 R 100.0 0.2 46:01.08 10 parcas2 9481 user1 20 0 10.7g 161m 6548 R 100.0 0.2 46:01.00 3 parcas2 9479 user1 20 0 10.7g 164m 6692 R 99.6 0.2 45:58.78 1 parcas2 9471 user1 20 0 10.7g 168m 7100 R 50.2 0.2 23:14.44 0 parcas2 9476 user1 20 0 10.7g 164m 7084 R 50.2 0.2 23:14.83 6 parcas2 11620 user2 20 0 2047m 89m 4452 R 50.2 0.1 3:15.78 0 parcas_Fe-Fe_cu 11621 user2 20 0 2047m 88m 3824 R 50.2 0.1 3:15.79 2 parcas_Fe-Fe_cu 11622 user2 20 0 2047m 88m 3824 R 50.2 0.1 3:15.82 4 parcas_Fe-Fe_cu 9472 user1 20 0 10.7g 166m 7168 R 49.8 0.2 23:14.72 2 parcas2 9473 user1 20 0 10.7g 161m 7164 R 49.8 0.2 23:14.57 4 parcas2 11623 user2 20 0 2047m 88m 3824 R 49.8 0.1 3:15.82 6 parcas_Fe-Fe_cu On this node the cores 0, 2, 4 and 6 are shared by the processes of the two users. There are altogether 12 tasks being run by the users on this node and since there are 12 cores per node (2 sockets, 6 cores each), there should in principle be no reason to share cores like this. Here are the details of the node in question: NodeName=xxxx Arch=x86_64 CoresPerSocket=6 CPUAlloc=12 CPUErr=0 CPUTot=12 CPULoad=11.92 Features=(null) Gres=(null) NodeAddr=al32 NodeHostName=al32 Version=14.03 OS=Linux RealMemory=97000 AllocMem=72000 Sockets=2 Boards=1 State=ALLOCATED ThreadsPerCore=1 TmpDisk=0 Weight=1 BootTime=2014-12-29T17:21:18 SlurmdStartTime=2014-12-29T17:26:19 CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 According to the CPU load all the cores are in use, but the cores 5,7,9 and 11 are not listed as being in use by the top command. Also, if we look at the information available on the job that user2 has submitted, we see the following: JobId=2257209 Name=testing2 UserId=user2(xxx) GroupId=group2(xxx) Priority=93 Nice=0 Account=local QOS=normal JobState=RUNNING Reason=None Dependency=(null) Requeue=1 Restarts=0 BatchFlag=1 ExitCode=0:0 DerivedExitCode=0:0 RunTime=00:07:23 TimeLimit=6-23:59:00 TimeMin=N/A SubmitTime=2015-01-15T16:49:54 EligibleTime=2015-01-15T16:49:54 StartTime=2015-01-15T16:52:34 EndTime=2015-01-22T16:51:34 PreemptTime=None SuspendTime=None SecsPreSuspend=0 Partition=8G_long_par AllocNode:Sid=alcyone:3537 ReqNodeList=(null) ExcNodeList=(null) NodeList=al32 BatchHost=al32 NumNodes=1 NumCPUs=4 CPUs/Task=1 ReqB:S:C:T=0:0:*:* Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=0 Nodes=al32 CPU_IDs=2-3,6-7 Mem=8000 MinCPUsNode=1 MinMemoryCPU=2000M MinTmpDiskNode=0 Features=(null) Gres=(null) Reservation=(null) Shared=0 Contiguous=0 Licenses=(null) Network=(null) Command=xxx WorkDir=xxx StdErr=xxx StdIn=xxx StdOut=xxx Here we see that the job is allocated four cores. The IDs are different than the ones listed in top, which may be just because SLURM uses its own ID numbers. I know that the P column in the output of the top command refers to the "last used CPU", so is the apparent sharing of the processors just an artefact of how the top command reports CPU usage or is SLURM in fact - at least at times - letting the jobs share resources? The latter is something that we would like to avoid. The other, possibly related issue, is that a user reported only being allocated 2 cores for a job even though the submit script requests for 4 (thus doubling the time it takes to finish the run). The script he used was: #!/bin/bash #SBATCH -J Fe_noreppot_nogb #SBATCH -N 1 #SBATCH -n 4 #SBATCH -t 6-23:59:00 #SBATCH -p 8G_long_par And the parallel run was started by: module load mvapich2/1.9-intel mpirun -np 4 ./parcas The partition he submitted his job to (and which was also used by the job by user2 above) has the following setup: PartitionName=8G_long_par Nodes=xxx Default=NO MinNodes=1 MaxNodes=2 DefaultTime=10 MaxTime=30-0 Shared=NO Priority=5 State=UP And for the whole system we've defined (among other things): SelectType=select/cons_res SelectTypeParameters=CR_Core_Memory Basically there's nothing in the setup that in my opinion should cause the allocation of only half of the requested cores (using #SBATCH -c in the submit script instead of #SBATCH -n did not help) or the sharing of resources between several submitted jobs, but maybe I'm missing something. Any suggestions on how to proceed in solving these issues would be welcome. Best regards, Kai Ruusuvuori -- Kai Ruusuvuori, PhD Student Faculty of Science Department of Physics Division of Atmospheric Sciences P.O. Box 64 Gustaf Hällströmin katu 2 00014 University of Helsinki Finland
