On 2016-05-26 10:31, Thekla Loizou wrote: > > Hi Valanti, > > Changing the CacheGroups option of Slurm to 0 resolves the issue. > However, I cannot understand why having the option set to 1, > CacheGroups=1 we were facing this problem. > > We had: > CacheGroups=1 > GroupUpdateForce=1 > GroupUpdateTime=1800 > > so the way I understand this is that the Slurm daemon will cache the > groups entries and every 30 minutes the information about which users > are members of groups should be updated. > In our case this was not happening. When adding a user to a secondary > group the only way to make Slurm "see" this change was to restart the > Slurm deamons on the compute nodes. The information was not updated > every 30 minutes. > > Did we have something wrong in our configuration which I am missing?
Note that 16.05 has some changes in this area which might fix your issue, see https://bugs.schedmd.com/show_bug.cgi?id=1629 . Also, I think it should be sufficient to run "scontrol reconfigure" instead of restarting slurmd to make it reload the groups. > > Thanks, > Thekla > > On 25/05/2016 05:50 μμ, Chrysovalantis Paschoulas wrote: >> >> Hi Thekla, >> >> maybe it is not a real bug of slurmd but a caching issue/race. Do you >> have enabled the CacheGroups option of Slurm? If yes can you try to >> set CacheGroups=0 and then restart Slurm daemons and tell us if the >> behavior has changed? >> >> Also I would like to see the groups that were set when you get a shell >> on the compute nodes after calling salloc. Could you please give us >> the output of the command "cat /proc/$$/status | grep Groups" after >> calling salloc? Try this with and without CacheGroups enabled. >> >> Thanks! >> Valantis >> >> >> >> On 05/25/2016 04:10 PM, Chrysovalantis Paschoulas wrote: >>> >>> With strace you can see what the id command is doing in both cases: >>> 1) When you call without arguments internally is calling getuid and >>> getgid and returns the groups that were set for the current process >>> (bash in this case). You can see the groups that was set in current >>> shell with the command "cat /proc/$$/status | grep Groups". >>> 2) When you call the id command with arguments then it is trying to >>> resolve a username following a different path where it actually asks >>> nsswitch and nslcd in the end, so it returns all secondary groups >>> correctly. >>> >>> For me it looks like a Slurm bug which must be fixed! slurmd should >>> always set all secondary groups for all processes it spaws and runs >>> as another user. I hope the devel team of Slurm will have a look into >>> this and fix it in later versions. >>> >>> Best Regards, >>> Valantis >>> >>> >>> On 05/25/2016 03:56 PM, Thekla Loizou wrote: >>>> >>>> Hi Valanti, >>>> >>>> Thanks a lot for your quick reply :) >>>> >>>> When getting interactive access on a node through SLURM and type the >>>> id command with and without arguments the output is different. >>>> Please see below: >>>> [thekla@node05 ~]$ id >>>> uid=2017(thekla) gid=5000(cstrc) groups=5000(cstrc) >>>> [thekla@node05 ~]$ id thekla >>>> uid=2017(thekla) gid=5000(cstrc) groups=5000(cstrc),10257(build) >>>> >>>> We have had this problem before and we have upgraded a few days ago >>>> to version 15.08.11 to see if a newer version resolves the problem >>>> but unfortunately the result is the same. >>>> >>>> Thanks, >>>> Thekla >>>> >>>> On 25/05/2016 04:48 μμ, Chrysovalantis Paschoulas wrote: >>>>> >>>>> OK Thekla now I understand better what's going on. >>>>> >>>>> It really seems to be a problem of Slurm. More specifically, slurmd >>>>> on the compute nodes which is running as root is changing to the >>>>> user's uid before it starts the application and during that step it >>>>> should set the groups (secondary also) but it looks like Slurm is >>>>> not setting the secondary groups for the new process. >>>>> >>>>> One more thing about the "id" command, can you try to call the id >>>>> command with and without arguments? Is the output the same or >>>>> different? I am interested to see the results ;) >>>>> >>>>> Also, which version of Slurm do you have? Did you update Slurm >>>>> recently? Did you always had this problem or you discovered that >>>>> problem recently? Check if a newer version of Slurm solves this >>>>> problem otherwise report it here. >>>>> >>>>> Best Regards, >>>>> Valantis >>>>> >>>>> >>>>> On 05/25/2016 02:59 PM, Thekla Loizou wrote: >>>>>> >>>>>> Hi Valanti! :) >>>>>> >>>>>> We are using nslcd on the compute nodes. >>>>>> We have indeed changed the default behavior/command of salloc but >>>>>> I don't think that this is the issue because the same happens when >>>>>> we submit jobs via sbatch. So I believe that this is not related >>>>>> to the new command we are using. >>>>>> >>>>>> When logging in as root or as user on the compute nodes via ssh we >>>>>> get all groups after running the "id" command, >>>>>> but when logging in through a SLURM job (interactive with salloc >>>>>> or not interactive with sbatch) we face the problem I described. >>>>>> >>>>>> We have also checked the environment of the user in both cases >>>>>> (ssh or SLURM) and the only differences are the SLURM environment >>>>>> variables and nothing else. >>>>>> >>>>>> Thanks, >>>>>> Thekla >>>>>> >>>>>> >>>>>> On 25/05/2016 02:07 μμ, Chrysovalantis Paschoulas wrote: >>>>>>> >>>>>>> Hi Thekla! :) >>>>>>> >>>>>>> For me it looks like it's a configuration issue of the client >>>>>>> LDAP name >>>>>>> service on the compute nodes. Which service are you using? nslcd or >>>>>>> sssd? I can see that you have change the default behavior/command of >>>>>>> salloc and the command gives you a prompt on the compute node >>>>>>> directly >>>>>>> (by default salloc will return a shell on the login node where it >>>>>>> was >>>>>>> called). Check and be sure that you are not doing something wrong >>>>>>> in the >>>>>>> new salloc command that you defined in slurm.conf >>>>>>> (SallocDefaultCommand >>>>>>> option). >>>>>>> >>>>>>> Can you try to go as root on the compute nodes and try to resolve >>>>>>> a uid >>>>>>> with the id command? What does it give you there, all groups or some >>>>>>> secondary groups are missing? If the secondary groups are missing >>>>>>> then >>>>>>> it's not a problem of Slurm but the config of the ID resolving >>>>>>> service. >>>>>>> As far as I know Slurm changes the environment after salloc (e.g. >>>>>>> exports SLURM_ env vars) but shouldn't change the behavior of >>>>>>> commands >>>>>>> like id.. >>>>>>> >>>>>>> Best Regards, >>>>>>> Chrysovalantis Paschoulas >>>>>>> >>>>>>> >>>>>>> On 05/25/2016 10:32 AM, Thekla Loizou wrote: >>>>>>>> >>>>>>>> Dear all, >>>>>>>> >>>>>>>> We have noticed a very strange problem every time we add an >>>>>>>> existing >>>>>>>> user to a secondary group. >>>>>>>> We manage our users in LDAP. When we add a user to a new group and >>>>>>>> then type the "id" and "groups" commands we see that the user was >>>>>>>> indeed added to the new group. The same happens when running the >>>>>>>> command "getent groups". >>>>>>>> >>>>>>>> For example, for a user "thekla" whose primary group was >>>>>>>> "cstrc" and >>>>>>>> now was also added to the group "build" we get: >>>>>>>> [thekla@node01 ~]$ id >>>>>>>> uid=2017(thekla) gid=5000(cstrc) groups=5000(cstrc),10257(build) >>>>>>>> [thekla@node01 ~]$ groups >>>>>>>> cstrc build >>>>>>>> [thekla@node01 ~]$ getent group | grep build >>>>>>>> build:*:10257:thekla >>>>>>>> >>>>>>>> The above output is the correct one and it is given to us when >>>>>>>> we ssh >>>>>>>> to one of the compute nodes. >>>>>>>> >>>>>>>> But, when we submit a job on the nodes (so getting access through >>>>>>>> SLURM and not with direct ssh), we cannot see the new group the >>>>>>>> user >>>>>>>> was added to: >>>>>>>> [thekla@prometheus ~]$ salloc -N1 >>>>>>>> salloc: Granted job allocation 8136 >>>>>>>> [thekla@node01 ~]$ id >>>>>>>> uid=2017(thekla) gid=5000(cstrc) groups=5000(cstrc) >>>>>>>> [thekla@node01 ~]$ groups >>>>>>>> cstrc >>>>>>>> >>>>>>>> While, the following output shows the correct result: >>>>>>>> [thekla@node01 ~]$ getent group | grep build >>>>>>>> build:*:10257:thekla >>>>>>>> >>>>>>>> This problem appears only when we get access through SLURM i.e. >>>>>>>> when >>>>>>>> we run a job. >>>>>>>> >>>>>>>> Has anyone faced this problem before? The only way we found for >>>>>>>> solving this is to restart the SLURM service on the compute nodes >>>>>>>> every time we add a user to a new group. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Thekla >>>>>>> >>>>>>> >>>>>>> >>>>>>> ------------------------------------------------------------------------------------------------ >>>>>>> >>>>>>> ------------------------------------------------------------------------------------------------ >>>>>>> >>>>>>> Forschungszentrum Juelich GmbH >>>>>>> 52425 Juelich >>>>>>> Sitz der Gesellschaft: Juelich >>>>>>> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498 >>>>>>> Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher >>>>>>> Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt >>>>>>> (Vorsitzender), >>>>>>> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt, >>>>>>> Prof. Dr. Sebastian M. Schmidt >>>>>>> ------------------------------------------------------------------------------------------------ >>>>>>> >>>>>>> ------------------------------------------------------------------------------------------------ >>>>>>> -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || [email protected]
signature.asc
Description: OpenPGP digital signature
