On 2016-05-26 10:31, Thekla Loizou wrote:
> 
> Hi Valanti,
> 
> Changing the CacheGroups option of Slurm to 0 resolves the issue.
> However, I cannot understand why having the option set to 1,
> CacheGroups=1 we were facing this problem.
> 
> We had:
> CacheGroups=1
> GroupUpdateForce=1
> GroupUpdateTime=1800
> 
> so the way I understand this is that the Slurm daemon will cache the
> groups entries and every 30 minutes the information about which users
> are members of groups should be updated.
> In our case this was not happening. When adding a user to a secondary
> group the only way to make Slurm "see" this change was to restart the
> Slurm deamons on the compute nodes. The information was not updated
> every 30 minutes.
> 
> Did we have something wrong in our configuration which I am missing?

Note that 16.05 has some changes in this area which might fix your
issue, see https://bugs.schedmd.com/show_bug.cgi?id=1629 .

Also, I think it should be sufficient to run "scontrol reconfigure"
instead of restarting slurmd to make it reload the groups.

> 
> Thanks,
> Thekla
> 
> On 25/05/2016 05:50 μμ, Chrysovalantis Paschoulas wrote:
>>
>> Hi Thekla,
>>
>> maybe it is not a real bug of slurmd but a caching issue/race. Do you
>> have enabled the CacheGroups option of Slurm? If yes can you try to
>> set CacheGroups=0 and then restart Slurm daemons and tell us if the
>> behavior has changed?
>>
>> Also I would like to see the groups that were set when you get a shell
>> on the compute nodes after calling salloc. Could you please give us
>> the output of the command "cat /proc/$$/status | grep Groups" after
>> calling salloc? Try this with and without CacheGroups enabled.
>>
>> Thanks!
>> Valantis
>>
>>
>>
>> On 05/25/2016 04:10 PM, Chrysovalantis Paschoulas wrote:
>>>
>>> With strace you can see what the id command is doing in both cases:
>>> 1) When you call without arguments internally is calling getuid and
>>> getgid and returns the groups that were set for the current process
>>> (bash in this case). You can see the groups that was set in current
>>> shell with the command "cat /proc/$$/status | grep Groups".
>>> 2) When you call the id command with arguments then it is trying to
>>> resolve a username following a different path where it actually asks
>>> nsswitch and nslcd in the end, so it returns all secondary groups
>>> correctly.
>>>
>>> For me it looks like a Slurm bug which must be fixed! slurmd should
>>> always set all secondary groups for all processes it spaws and runs
>>> as another user. I hope the devel team of Slurm will have a look into
>>> this and fix it in later versions.
>>>
>>> Best Regards,
>>> Valantis
>>>
>>>
>>> On 05/25/2016 03:56 PM, Thekla Loizou wrote:
>>>>
>>>> Hi Valanti,
>>>>
>>>> Thanks a lot for your quick reply :)
>>>>
>>>> When getting interactive access on a node through SLURM and type the
>>>> id command with and without arguments the output is different.
>>>> Please see below:
>>>> [thekla@node05 ~]$ id
>>>> uid=2017(thekla) gid=5000(cstrc) groups=5000(cstrc)
>>>> [thekla@node05 ~]$ id thekla
>>>> uid=2017(thekla) gid=5000(cstrc) groups=5000(cstrc),10257(build)
>>>>
>>>> We have had this problem before and we have upgraded a few days ago
>>>> to version 15.08.11 to see if a newer version resolves the problem
>>>> but unfortunately the result is the same.
>>>>
>>>> Thanks,
>>>> Thekla
>>>>
>>>> On 25/05/2016 04:48 μμ, Chrysovalantis Paschoulas wrote:
>>>>>
>>>>> OK Thekla now I understand better what's going on.
>>>>>
>>>>> It really seems to be a problem of Slurm. More specifically, slurmd
>>>>> on the compute nodes which is running as root is changing to the
>>>>> user's uid before it starts the application and during that step it
>>>>> should set the groups (secondary also) but it looks like Slurm is
>>>>> not setting the secondary groups for the new process.
>>>>>
>>>>> One more thing about the "id" command, can you try to call the id
>>>>> command with and without arguments? Is the output the same or
>>>>> different? I am interested to see the results ;)
>>>>>
>>>>> Also, which version of Slurm do you have? Did you update Slurm
>>>>> recently? Did you always had this problem or you discovered that
>>>>> problem recently? Check if a newer version of Slurm solves this
>>>>> problem otherwise report it here.
>>>>>
>>>>> Best Regards,
>>>>> Valantis
>>>>>
>>>>>
>>>>> On 05/25/2016 02:59 PM, Thekla Loizou wrote:
>>>>>>
>>>>>> Hi Valanti! :)
>>>>>>
>>>>>> We are using nslcd on the compute nodes.
>>>>>> We have indeed changed the default behavior/command of salloc but
>>>>>> I don't think that this is the issue because the same happens when
>>>>>> we submit jobs via sbatch. So I believe that this is not related
>>>>>> to the new command we are using.
>>>>>>
>>>>>> When logging in as root or as user on the compute nodes via ssh we
>>>>>> get all groups after running the "id" command,
>>>>>> but when logging in through a SLURM job (interactive with salloc
>>>>>> or not interactive with sbatch) we face the problem I described.
>>>>>>
>>>>>> We have also checked the environment of the user in both cases
>>>>>> (ssh or SLURM) and the only differences are the SLURM environment
>>>>>> variables and nothing else.
>>>>>>
>>>>>> Thanks,
>>>>>> Thekla
>>>>>>
>>>>>>
>>>>>> On 25/05/2016 02:07 μμ, Chrysovalantis Paschoulas wrote:
>>>>>>>
>>>>>>> Hi Thekla! :)
>>>>>>>
>>>>>>> For me it looks like it's a configuration issue of the client
>>>>>>> LDAP name
>>>>>>> service on the compute nodes. Which service are you using? nslcd or
>>>>>>> sssd? I can see that you have change the default behavior/command of
>>>>>>> salloc and the command gives you a prompt on the compute node
>>>>>>> directly
>>>>>>> (by default salloc will return a shell on the login node where it
>>>>>>> was
>>>>>>> called). Check and be sure that you are not doing something wrong
>>>>>>> in the
>>>>>>> new salloc command that you defined in slurm.conf
>>>>>>> (SallocDefaultCommand
>>>>>>> option).
>>>>>>>
>>>>>>> Can you try to go as root on the compute nodes and try to resolve
>>>>>>> a uid
>>>>>>> with the id command? What does it give you there, all groups or some
>>>>>>> secondary groups are missing? If the secondary groups are missing
>>>>>>> then
>>>>>>> it's not a problem of Slurm but the config of the ID resolving
>>>>>>> service.
>>>>>>> As far as I know Slurm changes the environment after salloc (e.g.
>>>>>>> exports SLURM_ env vars) but shouldn't change the behavior of
>>>>>>> commands
>>>>>>> like id..
>>>>>>>
>>>>>>> Best Regards,
>>>>>>> Chrysovalantis Paschoulas
>>>>>>>
>>>>>>>
>>>>>>> On 05/25/2016 10:32 AM, Thekla Loizou wrote:
>>>>>>>>
>>>>>>>> Dear all,
>>>>>>>>
>>>>>>>> We have noticed a very strange problem every time we add an
>>>>>>>> existing
>>>>>>>> user to a secondary group.
>>>>>>>> We manage our users in LDAP. When we add a user to a new group and
>>>>>>>> then type the "id" and "groups" commands we see that the user was
>>>>>>>> indeed added to the new group. The same happens when running the
>>>>>>>> command "getent groups".
>>>>>>>>
>>>>>>>> For example,  for a user "thekla" whose primary group was
>>>>>>>> "cstrc" and
>>>>>>>> now was also added to the group "build" we get:
>>>>>>>> [thekla@node01 ~]$ id
>>>>>>>> uid=2017(thekla) gid=5000(cstrc) groups=5000(cstrc),10257(build)
>>>>>>>> [thekla@node01 ~]$ groups
>>>>>>>> cstrc build
>>>>>>>> [thekla@node01 ~]$ getent group | grep build
>>>>>>>> build:*:10257:thekla
>>>>>>>>
>>>>>>>> The above output is the correct one and it is given to us when
>>>>>>>> we ssh
>>>>>>>> to one of the compute nodes.
>>>>>>>>
>>>>>>>> But, when we submit a job on the nodes (so getting access through
>>>>>>>> SLURM and not with direct ssh), we cannot see the new group the
>>>>>>>> user
>>>>>>>> was added to:
>>>>>>>> [thekla@prometheus ~]$ salloc -N1
>>>>>>>> salloc: Granted job allocation 8136
>>>>>>>> [thekla@node01 ~]$ id
>>>>>>>> uid=2017(thekla) gid=5000(cstrc) groups=5000(cstrc)
>>>>>>>> [thekla@node01 ~]$ groups
>>>>>>>> cstrc
>>>>>>>>
>>>>>>>> While, the following output shows the correct result:
>>>>>>>> [thekla@node01 ~]$ getent group | grep build
>>>>>>>> build:*:10257:thekla
>>>>>>>>
>>>>>>>> This problem appears only when we get access through SLURM i.e.
>>>>>>>> when
>>>>>>>> we run a job.
>>>>>>>>
>>>>>>>> Has anyone faced this problem before? The only way we found for
>>>>>>>> solving this is to restart the SLURM service on the compute nodes
>>>>>>>> every time we add a user to a new group.
>>>>>>>>
>>>>>>>> Thanks,
>>>>>>>> Thekla
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> ------------------------------------------------------------------------------------------------
>>>>>>>
>>>>>>> ------------------------------------------------------------------------------------------------
>>>>>>>
>>>>>>> Forschungszentrum Juelich GmbH
>>>>>>> 52425 Juelich
>>>>>>> Sitz der Gesellschaft: Juelich
>>>>>>> Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
>>>>>>> Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
>>>>>>> Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt
>>>>>>> (Vorsitzender),
>>>>>>> Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
>>>>>>> Prof. Dr. Sebastian M. Schmidt
>>>>>>> ------------------------------------------------------------------------------------------------
>>>>>>>
>>>>>>> ------------------------------------------------------------------------------------------------
>>>>>>>


-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || [email protected]

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to