[slurm-dev] Unable to submit, possibly caused by lots of members in a unix group

Marco Passerini Mon, 28 Apr 2014 04:56:29 -0700

Hi, 

We have a cluster running slurm 2.6.7. For some reason, after executing the 
command "newgrp" to change the primary group, some users cannot run jobs.


This is the output one gets after changing the group: 


$ id 
uid=1091(username) gid=101(xxx) 
groups=101(xxx),10(yyy),108(zzz),581(gaussian),592(aaa),741(bbb) 

$ srun -n 1 -p test hostname 
c1 

$ newgrp gaussian 

$ id 
uid=1091(username) gid=581(gaussian) 
groups=101(xxx),10(yyy),108(zzz),581(gaussian),592(aaa),741(bbb) 

$ srun -n 1 -p test hostname 
srun: error: Task launch for 1406416.0 failed on node c1: Group ID not found on 
host 
srun: error: Application launch failed: Group ID not found on host 
srun: Job step aborted: Waiting up to 2 seconds for job step to finish. 
srun: error: Timed out waiting for job step to complete 


This is odd, because the newgrp command works in most of the cases, but not 
with a set of groups that have a lot of members (in this case 216 members, 1740 
characters). 
For historical reasons, the groups with a lot of members have been split on 
multiple lines (in this case, 4 lines). Something like this example: 

gaussian:x:581:one,two,three,four,[...] 
gaussian:x:581:tweny-three,twenty-four,twenty-five,[...] 
gaussian:x:581:forty-nine,fifty,fifty-one,[...] 
gaussian:x:581:one-hundred-seventy-nine,[...] 

I have checked, and the files /etc/group and /etc/passwd on the compute nodes 
and on the login nodes have the same set of user identities. 

Has anybody noticed this kind of behaviour? 

Best Regards, 
Marco Passerini

[slurm-dev] Unable to submit, possibly caused by lots of members in a unix group

Reply via email to