Any error in slurmd.log on the node or slurmctld.log on the ctl?

Sean
________________________________
From: slurm-users <[email protected]> on behalf of Wayne 
Hendricks <[email protected]>
Sent: Saturday, 15 January 2022 16:04
To: [email protected] <[email protected]>
Subject: [EXT] [slurm-users] Strange sbatch error with 21.08.2&5

External email: Please exercise caution

Running test job with srun works:
wayneh@login:~$ srun -G16 -p v100 /home/wayne.hendricks/job.sh
179851
Linux dgx1-1 5.4.0-94-generic #106-Ubuntu SMP Thu Jan 6 23:58:14 UTC
2022 x86_64 x86_64 x86_64 GNU/Linux
179851
Linux dgx1-2 5.4.0-94-generic #106-Ubuntu SMP Thu Jan 6 23:58:14 UTC
2022 x86_64 x86_64 x86_64 GNU/Linux

Submitting the same with sbatch does not:
wayneh@login:~$ sbatch test.sh
Submitted batch job 179850
wayneh@login:~$ cat test.out
srun: error: Unable to create step for job 179850: Unspecified error
wayneh@login:~$ cat test.sh
#!/usr/bin/env bash
#SBATCH -J testing
#SBATCH -e /home/wayne.hendricks/test.out
#SBATCH -o /home/wayne.hendricks/test.out
#SBATCH -G 16
#SBATCH --partition v100
srun uname -a

Any idea why srun and sbatch wouldn't run the same way? It seems to
run correctly when I use an odd number of GPUs in sbatch. (#SBATCH -G
15)

Node config:
NodeName=dgx1-[1-10] CPUs=80 Sockets=2 CoresPerSocket=20
ThreadsPerCore=2 RealMemory=490000 Gres=gpu:8 State=UNKNOWN
PartitionName=v100 Nodes=dgx1-[1-10] OverSubscribe=FORCE:8
DefCpuPerGPU=10 DefMemPerGPU=61250 MaxTime=INFINITE State=UP

Reply via email to