Our site also recently upgraded from 15.08 to 17.02 and have noticed the same
behavior.
When the gres type is included with job submission, the type and count appears
to be dropped from the gres value. If there is no type included, the count remains.
The correct gres type and count appear to be allocated to the job, but the gres
count requested is not being fully allocated to the jobsteps
Given this code:
#include <openacc.h>
#include <sys/unistd.h>
int main(int argc, char** argv) {
char hostname[128];
gethostname(hostname, sizeof hostname);
printf("acc_get_num_devices(acc_device_nvidia): %d Host: %s\n",
acc_get_num_devices(acc_device_nvidia), hostname);
return 0;
}
Submitting with "-N 2 --gres=gpu:p100:2"
Output from srun and mpirun on v15.08
+ srun /home/user/test_acc_get_num_devices
acc_get_num_devices(acc_device_nvidia): 2 Host: node1
acc_get_num_devices(acc_device_nvidia): 2 Host: node2
+ mpirun /home/user/test_acc_get_num_devices
acc_get_num_devices(acc_device_nvidia): 2 Host: node1
acc_get_num_devices(acc_device_nvidia): 2 Host: node2
Output from srun and mpirun on v17.02
+ srun /home/user/test_acc_get_num_devices
acc_get_num_devices(acc_device_nvidia): 1 Host: node1
acc_get_num_devices(acc_device_nvidia): 1 Host: node2
+ mpirun /home/bjohanso/test_acc_get_num_devices
acc_get_num_devices(acc_device_nvidia): 2 Host: node1
acc_get_num_devices(acc_device_nvidia): 1 Host: node2
If I specify the gres value again as an environment variable, I get the results
I expect.
+ export SLURM_GRES=gpu:p100:2
+ SLURM_GRES=gpu:2
+ srun /home/user/test_acc_get_num_devices
acc_get_num_devices(acc_device_nvidia): 2 Host: node1
acc_get_num_devices(acc_device_nvidia): 2 Host: node2
+ export SLURM_GRES=gpu:p100:2
+ SLURM_GRES=gpu:2
+ mpirun /home/user/test_acc_get_num_devices
acc_get_num_devices(acc_device_nvidia): 2 Host: node1
acc_get_num_devices(acc_device_nvidia): 2 Host: node2
Submitting with "-N 2 --gres=gpu:2"
+ srun /home/user/test_acc_get_num_devices
acc_get_num_devices(acc_device_nvidia): 2 Host: node1
acc_get_num_devices(acc_device_nvidia): 2 Host: node2
+ mpirun /home/user/test_acc_get_num_devices
acc_get_num_devices(acc_device_nvidia): 2 Host: node1
acc_get_num_devices(acc_device_nvidia): 2 Host: node2
On 07/05/2017 08:15 AM, Shawn Bobbin wrote:
Hi,
One of our users reported a change in behavior in 17.02-2 with the gres output
in the `squeue` command. Specifically it doesn’t show details about the gres
resources being used. For example:
On 15.08:
-bash-4.2$ sbatch --gres=gpu:1 test.sh
Submitted batch job 1124
-bash-4.2$ sbatch --gres=gpu:m40:1 test.sh
Submitted batch job 1125
-bash-4.2$ squeue -o "%i %b"
JOBID GRES
1124 gpu:1
1125 gpu:m40:1
And on 17.02:
-bash-4.2$ sbatch --partition=scavenger --qos=scavenger --gres=gpu:1 test.sh
Submitted batch job 20094
-bash-4.2$ sbatch --partition=scavenger --qos=scavenger --gres=gpu:p6000:1
test.sh
Submitted batch job 20095
-bash-4.2$ squeue -o "%i %b" | grep -e '20094' -e '20095'
20095 gpu
20094 gpu:1
In 15.08 squeue lists whats GPUs are being consumed, and in 17.02 it does not.
We wanted to check in and see if this was an expected change in behavior or a
bug. I briefly searched through the changelog, and didn’t see any mention.
Thanks!
—Shawn