For some projects, we use group passwords and have users authenticate into the 
group when they need to access those files. The password is set with gpasswd 
and stored in /etc/gshadow. However after a user authenticates into a group, 
they can no longer run SLURM jobs, their jobs go into status "(launch failed 
requeued held)".

in slurmd.log I see:
[2016-07-05T13:28:55.167] Launching batch job 2844 for UID 1000
[2016-07-05T13:28:55.175] uid 1000 is not a member of gid 5000
[2016-07-05T13:28:55.175] batch_stepd_step_rec_create() failed: Group ID not 
found on host
[2016-07-05T13:28:55.175] _step_setup: no job returned
[2016-07-05T13:28:55.175] done with job

It looks like this is a sanity check looking if a user is a static member of 
the group - which they're not, so it fails. Is there any way to turn off this 
sanity check?

Below is an example of the commands to reach this point:

[prout@login-0 ~]$ id
uid=1000(prout) gid=1000(prout) groups=1000(prout),10(wheel)
[prout@login-0 ~]$ newgrp ProjectX
Password: abc123
[prout@login-0 ~]$ id
uid=1000(prout) gid=5000(ProjectX) groups=1000(prout),10(wheel),5000(ProjectX)
[prout@login-0 ~]$ sbatch --wrap="srun /bin/sleep 300"
Submitted batch job 2844
[prout@login-0 ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES 
NODELIST(REASON)
              2844    normal     wrap    prout PD       0:00      1 (launch 
failed requeued held)


Reply via email to