Dear Janne,
Running "scontrol reconfigure" does not resolve our problem.
It seems that the only way to solve this is to set CacheGroups=0.
Thanks,
Thekla
On 26/05/2016 10:51 πμ, Janne Blomqvist wrote:
On 2016-05-26 10:31, Thekla Loizou wrote:
Hi Valanti,
Changing the CacheGroups option of Slurm to 0 resolves the issue.
However, I cannot understand why having the option set to 1,
CacheGroups=1 we were facing this problem.
We had:
CacheGroups=1
GroupUpdateForce=1
GroupUpdateTime=1800
so the way I understand this is that the Slurm daemon will cache the
groups entries and every 30 minutes the information about which users
are members of groups should be updated.
In our case this was not happening. When adding a user to a secondary
group the only way to make Slurm "see" this change was to restart the
Slurm deamons on the compute nodes. The information was not updated
every 30 minutes.
Did we have something wrong in our configuration which I am missing?
Note that 16.05 has some changes in this area which might fix your
issue, see https://bugs.schedmd.com/show_bug.cgi?id=1629 .
Also, I think it should be sufficient to run "scontrol reconfigure"
instead of restarting slurmd to make it reload the groups.
Thanks,
Thekla
On 25/05/2016 05:50 μμ, Chrysovalantis Paschoulas wrote:
Hi Thekla,
maybe it is not a real bug of slurmd but a caching issue/race. Do you
have enabled the CacheGroups option of Slurm? If yes can you try to
set CacheGroups=0 and then restart Slurm daemons and tell us if the
behavior has changed?
Also I would like to see the groups that were set when you get a shell
on the compute nodes after calling salloc. Could you please give us
the output of the command "cat /proc/$$/status | grep Groups" after
calling salloc? Try this with and without CacheGroups enabled.
Thanks!
Valantis
On 05/25/2016 04:10 PM, Chrysovalantis Paschoulas wrote:
With strace you can see what the id command is doing in both cases:
1) When you call without arguments internally is calling getuid and
getgid and returns the groups that were set for the current process
(bash in this case). You can see the groups that was set in current
shell with the command "cat /proc/$$/status | grep Groups".
2) When you call the id command with arguments then it is trying to
resolve a username following a different path where it actually asks
nsswitch and nslcd in the end, so it returns all secondary groups
correctly.
For me it looks like a Slurm bug which must be fixed! slurmd should
always set all secondary groups for all processes it spaws and runs
as another user. I hope the devel team of Slurm will have a look into
this and fix it in later versions.
Best Regards,
Valantis
On 05/25/2016 03:56 PM, Thekla Loizou wrote:
Hi Valanti,
Thanks a lot for your quick reply :)
When getting interactive access on a node through SLURM and type the
id command with and without arguments the output is different.
Please see below:
[thekla@node05 ~]$ id
uid=2017(thekla) gid=5000(cstrc) groups=5000(cstrc)
[thekla@node05 ~]$ id thekla
uid=2017(thekla) gid=5000(cstrc) groups=5000(cstrc),10257(build)
We have had this problem before and we have upgraded a few days ago
to version 15.08.11 to see if a newer version resolves the problem
but unfortunately the result is the same.
Thanks,
Thekla
On 25/05/2016 04:48 μμ, Chrysovalantis Paschoulas wrote:
OK Thekla now I understand better what's going on.
It really seems to be a problem of Slurm. More specifically, slurmd
on the compute nodes which is running as root is changing to the
user's uid before it starts the application and during that step it
should set the groups (secondary also) but it looks like Slurm is
not setting the secondary groups for the new process.
One more thing about the "id" command, can you try to call the id
command with and without arguments? Is the output the same or
different? I am interested to see the results ;)
Also, which version of Slurm do you have? Did you update Slurm
recently? Did you always had this problem or you discovered that
problem recently? Check if a newer version of Slurm solves this
problem otherwise report it here.
Best Regards,
Valantis
On 05/25/2016 02:59 PM, Thekla Loizou wrote:
Hi Valanti! :)
We are using nslcd on the compute nodes.
We have indeed changed the default behavior/command of salloc but
I don't think that this is the issue because the same happens when
we submit jobs via sbatch. So I believe that this is not related
to the new command we are using.
When logging in as root or as user on the compute nodes via ssh we
get all groups after running the "id" command,
but when logging in through a SLURM job (interactive with salloc
or not interactive with sbatch) we face the problem I described.
We have also checked the environment of the user in both cases
(ssh or SLURM) and the only differences are the SLURM environment
variables and nothing else.
Thanks,
Thekla
On 25/05/2016 02:07 μμ, Chrysovalantis Paschoulas wrote:
Hi Thekla! :)
For me it looks like it's a configuration issue of the client
LDAP name
service on the compute nodes. Which service are you using? nslcd or
sssd? I can see that you have change the default behavior/command of
salloc and the command gives you a prompt on the compute node
directly
(by default salloc will return a shell on the login node where it
was
called). Check and be sure that you are not doing something wrong
in the
new salloc command that you defined in slurm.conf
(SallocDefaultCommand
option).
Can you try to go as root on the compute nodes and try to resolve
a uid
with the id command? What does it give you there, all groups or some
secondary groups are missing? If the secondary groups are missing
then
it's not a problem of Slurm but the config of the ID resolving
service.
As far as I know Slurm changes the environment after salloc (e.g.
exports SLURM_ env vars) but shouldn't change the behavior of
commands
like id..
Best Regards,
Chrysovalantis Paschoulas
On 05/25/2016 10:32 AM, Thekla Loizou wrote:
Dear all,
We have noticed a very strange problem every time we add an
existing
user to a secondary group.
We manage our users in LDAP. When we add a user to a new group and
then type the "id" and "groups" commands we see that the user was
indeed added to the new group. The same happens when running the
command "getent groups".
For example, for a user "thekla" whose primary group was
"cstrc" and
now was also added to the group "build" we get:
[thekla@node01 ~]$ id
uid=2017(thekla) gid=5000(cstrc) groups=5000(cstrc),10257(build)
[thekla@node01 ~]$ groups
cstrc build
[thekla@node01 ~]$ getent group | grep build
build:*:10257:thekla
The above output is the correct one and it is given to us when
we ssh
to one of the compute nodes.
But, when we submit a job on the nodes (so getting access through
SLURM and not with direct ssh), we cannot see the new group the
user
was added to:
[thekla@prometheus ~]$ salloc -N1
salloc: Granted job allocation 8136
[thekla@node01 ~]$ id
uid=2017(thekla) gid=5000(cstrc) groups=5000(cstrc)
[thekla@node01 ~]$ groups
cstrc
While, the following output shows the correct result:
[thekla@node01 ~]$ getent group | grep build
build:*:10257:thekla
This problem appears only when we get access through SLURM i.e.
when
we run a job.
Has anyone faced this problem before? The only way we found for
solving this is to restart the SLURM service on the compute nodes
every time we add a user to a new group.
Thanks,
Thekla
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------
Forschungszentrum Juelich GmbH
52425 Juelich
Sitz der Gesellschaft: Juelich
Eingetragen im Handelsregister des Amtsgerichts Dueren Nr. HR B 3498
Vorsitzender des Aufsichtsrats: MinDir Dr. Karl Eugen Huthmacher
Geschaeftsfuehrung: Prof. Dr.-Ing. Wolfgang Marquardt
(Vorsitzender),
Karsten Beneke (stellv. Vorsitzender), Prof. Dr.-Ing. Harald Bolt,
Prof. Dr. Sebastian M. Schmidt
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------