Re: [slurm-users] slurmstepd crash 18.03 when using pmi2 interface

2018-11-02 Thread Matthieu Hautreux
This may be due because of this commit : https://github.com/SchedMD/slurm/commit/ee2813870fed48827aa0ec99e1b4baeaca710755 It seems that the behavior was changed from a fatal error to something different when requesting cgroup devices on in cgroup.conf without the proper conf file. If you do not r

Re: [slurm-users] '--x11' or no '--x11' when using srun when both methods work for X11 graphical applications

2017-11-29 Thread Matthieu Hautreux
Hi Kevin, Based on my understanding and a discussion with the SLURM dev team on that subject, here are some information about the new support of X11 in slurm-17.11 : - slurm's native support of X11 forwarding is based on libssh2 - slurm's native support of X11 can be disabled at configure/compila

Re: [slurm-users] Strange problem with Slurm 17.11.0: "batch job complete failure"

2017-11-30 Thread Matthieu Hautreux
Hi, You should look at that bug : https://bugs.schedmd.com/show_bug.cgi?id=4412 I thought it would be resolved in 17.11.0. Regards Matthieu Le 30 nov. 2017 00:56, "Andy Riebs" a écrit : > We've just installed 17.11.0 on our 100+ node x86_64 cluster running > CentOS 7.4 this afternoon, and per

Re: [slurm-users] slurm 17.11.2: Socket timed out on send/recv operation

2018-01-16 Thread Matthieu Hautreux
Hi, In this kind if issues, one good thing to do is to get a backtrace of slurmctld during the slowdown. You should thus easily identify the subcomponent responsible for the issue. I would bet on something like LDAP requests taking too much time because of a missing sssd cache. Regards Matthieu

Re: [slurm-users] Too many single-stream jobs?

2018-02-12 Thread Matthieu Hautreux
Hi, your login node may have a heavy load while starting such a large number of independant sruns. This may induce issues not seen under normal load, like partial read/write on sockets, triggering bugs in slurm, for functions not properly protected against such events. Quickly looking at the sou

Re: [slurm-users] MCS plugin and SlurmDBD

2018-05-03 Thread Matthieu Hautreux
Hi, At the time the MCS logic was added to Slurm, the filtering of slurmdbd related information based on the MCS label was defered because it requires a new field (mcs_label) into the slurmdbd job/step records. The addition of this label in the main branch took times and only appears in 17.11 (se

Re: [slurm-users] Job step aborted

2018-05-17 Thread Matthieu Hautreux
Le jeu. 17 mai 2018 11:28, Mahmood Naderan a écrit : > Hi, > For an interactive job via srun, I see that after opening the gui, the > session is terminated automatically which is weird. > > [mahmood@rocks7 ansys_test]$ srun --x11 -A y8 -p RUBY --ntasks=10 > --mem=8GB --pty bash > [mahmood@compute

Re: [slurm-users] SLURM nodes flap in "Not responding" status when iptables firewall enabled

2018-05-17 Thread Matthieu Hautreux
Hi, Communications in Slurm are not only performed from controller to slurmd and from slurmd to controller. You need to ensure that your login nodes can reach the controller and the slurmd nodes as well as ensure that slurmd on the various nodes can contact each other. This last requirement is bec

Re: [slurm-users] SLURM nodes flap in "Not responding" status when iptables firewall enabled

2018-05-21 Thread Matthieu Hautreux
nks again, Matthieu! > > Best, > > Sean > > > On Thu, May 17, 2018 at 8:06 PM, Sean Caron wrote: > >> Awesome tip. Thanks so much, Matthieu. I hadn't considered that. I will >> give that a shot and see what happens. >> >> Best, >>