Update: If not only CPUs are removed from gres.conf, but also Procs from NodeName in slurm.conf, then the second issue is gone. But this is not a solution.
Any ideas? Best regards, Taras On Tue, Jul 23, 2013 at 8:00 PM, Taras Shapovalov < [email protected]> wrote: > Hi all, > > We have a SLURM cluster with 2 gpus per node. There are two quite > interesting issues. > I am sending the both issues in a single email, because I guess they are > linked somehow. > > ISSUE 1: > > When CR_CORE_DEFAULT_DIST_BLOCK is set and gres:gpu=1 is requested by > user, > then slurmctld dies with segmentation fault. When --gres:gpu=2, then it > works fine. > > I found that the segfault happens in > ./src/plugins/select/cons_res/dist_tasks.c: > > /* > * If SelectTypeParameters mentions to use a block distribution for > * cores by default, use that kind of distribution if no particular > * cores distribution specified. > * Note : cyclic cores distribution, which is the default, is > treated > * by the next code block > */ > if ( slurmctld_conf.select_type_param & CR_CORE_DEFAULT_DIST_BLOCK > ) { > switch(job_ptr->details->task_dist) { > case SLURM_DIST_ARBITRARY: > case SLURM_DIST_BLOCK: > case SLURM_DIST_CYCLIC: > case SLURM_DIST_UNKNOWN: > _block_sync_core_bitmap(job_ptr, cr_type); > <------------------- > return SLURM_SUCCESS; > } > } > > Disabling CR_CORE_DEFAULT_DIST_BLOCK fixes the segfaults. In particular > slurmctld dies on this line: > > sufficient = sockets_cpu_cnt[s] >= > req_cpus ; > > because s=3154116728 (according gdb), which, in turn, (my guess) happens > because ntasks_per_core=65535 > in the same function, which looks like an integer overflow somewhere. > > Stack trace is attached. > > > ISSUE 2: > > When user requests 2 gpus, then job *always* rejected. For example: > > [roman@headnode ~]$ srun -N1 -c2 -n2 --gres=gpu:2 -p k20 hostname > srun: error: Unable to allocate resources: Requested node configuration is > not available > [roman@headnode ~]$ > > When cons_res is enabled: > > [root@headnode ~]# grep Select /etc/slurm/slurm.conf > SelectType=select/cons_res > #SelectTypeParameters=CR_Core,CR_CORE_DEFAULT_DIST_BLOCK > SelectTypeParameters=CR_Core > > [root@headnode ~]# grep debug -i /etc/slurm/slurm.conf > DebugFlags=Gres,CPU_BIND,Steps > SlurmctldDebug=5 > SlurmdDebug=5 > > then I see these errors in /var/log/slurmctld: > > [2013-07-24T01:03:36+08:00] cons_res: _can_job_run_on_node: 0 cpus on > node007(0), mem 0/64000 > [2013-07-24T01:03:36+08:00] cons_res: _can_job_run_on_node: 0 cpus on > node008(0), mem 0/64000 > > When user requests 1 gpu per node, then it works fine: > > [2013-07-24T01:11:59+08:00] cons_res: _can_job_run_on_node: 8 cpus on > node007(0), mem 0/1 > [2013-07-24T01:11:59+08:00] cons_res: _can_job_run_on_node: 8 cpus on > node008(0), mem 0/1 > > When cons_res is disabled, but 2 gpus are requested I see: > > [2013-07-24T01:17:56+08:00] gres: gpu state for job 3623 > [2013-07-24T01:17:56+08:00] gres_cnt:2 node_cnt:0 > [2013-07-24T01:17:56+08:00] _pick_best_nodes: job 3623 never runnable > [2013-07-24T01:17:56+08:00] debug: (node_scheduler.c:165) job id: 3623 -- > No nodes in bitmap of job_record! > [2013-07-24T01:17:56+08:00] debug: (node_scheduler.c:1785) job id: 3623 > -- job_record->gres: (gpu:2), job_record->gres_alloc: () > [2013-07-24T01:17:56+08:00] debug: (node_scheduler.c:1687) job id: 3623 > -- job_record->gres: (gpu:2), job_record->gres_alloc: () > [2013-07-24T01:17:56+08:00] _slurm_rpc_allocate_resources: Requested node > configuration is not available > > Nodes are configured this way: > > NodeName=node008 Arch=x86_64 CoresPerSocket=8 > CPUAlloc=0 CPUErr=0 CPUTot=16 CPULoad=0.00 Features=(null) > Gres=gpu:2 > NodeAddr=node008 NodeHostName=node008 > OS=Linux RealMemory=64000 Sockets=2 Boards=1 > State=IDLE ThreadsPerCore=1 TmpDisk=0 Weight=1 > BootTime=2013-07-23T00:31:38 SlurmdStartTime=2013-07-24T00:07:13 > CurrentWatts=0 LowestJoules=0 ConsumedJoules=0 > > Each /etc/slurm/gres.conf contains these lines: > > Name=gpu File=/dev/nvidia0 CPUs=0-7 > Name=gpu File=/dev/nvidia1 CPUs=8-15 > > This issue can also be related on > https://groups.google.com/forum/#!topic/slurm-devel/N5j1AjAbsbw > but disabling CPU binding does not help. > > Any ideas about this puzzle are highly appropriated! > > Best regards, > Taras > > >
