Hi, We have a cluster running slurm 2.6.7. For some reason, after executing the command "newgrp" to change the primary group, some users cannot run jobs.
This is the output one gets after changing the group: $ id uid=1091(username) gid=101(xxx) groups=101(xxx),10(yyy),108(zzz),581(gaussian),592(aaa),741(bbb) $ srun -n 1 -p test hostname c1 $ newgrp gaussian $ id uid=1091(username) gid=581(gaussian) groups=101(xxx),10(yyy),108(zzz),581(gaussian),592(aaa),741(bbb) $ srun -n 1 -p test hostname srun: error: Task launch for 1406416.0 failed on node c1: Group ID not found on host srun: error: Application launch failed: Group ID not found on host srun: Job step aborted: Waiting up to 2 seconds for job step to finish. srun: error: Timed out waiting for job step to complete This is odd, because the newgrp command works in most of the cases, but not with a set of groups that have a lot of members (in this case 216 members, 1740 characters). For historical reasons, the groups with a lot of members have been split on multiple lines (in this case, 4 lines). Something like this example: gaussian:x:581:one,two,three,four,[...] gaussian:x:581:tweny-three,twenty-four,twenty-five,[...] gaussian:x:581:forty-nine,fifty,fifty-one,[...] gaussian:x:581:one-hundred-seventy-nine,[...] I have checked, and the files /etc/group and /etc/passwd on the compute nodes and on the login nodes have the same set of user identities. Has anybody noticed this kind of behaviour? Best Regards, Marco Passerini
