Hello,

Did you changed /etc/slurm.conf?
Have, several times. But I do this with configuration management tools, which does restart the slurmd and slurmctld daemons afterwards.

Do you have prolog scripts running on the compute nodes that might be stuck??
No, I do not have any epi/prolog scripts configured.

Is there any Slurm plugin that is issuing any external command (like the nhc that runs on Cray nodes) at job termination?
We have no plugins installed, nor NHC or other tools like this. It's quite a vanilla installation.

Just in case adding the slurm config from a random node:

https://pastebin.com/zcXYFmHB

Best regards,

--
Sander Kuusemets
University of Tartu, High Performance Computing, IT Specialist
Skype: sander.kuusemets1
+372 737 5694

On 04/12/2017 01:36 PM, Miguel Gila wrote:
Hello,

Do you have prolog scripts running on the compute nodes that might be stuck (e.g. doing IO)??? Is there any Slurm plugin that is issuing any external command (like the nhc that runs on Cray nodes) at job termination?

M.

--
Miguel Gila
CSCS Swiss National Supercomputing Centre
HPC Operations
Via Trevano 131 | CH-6900 Lugano | Switzerland
mg [at] cscs.ch <http://cscs.ch>

On 12 Apr 2017, at 11:29, Benedikt Schäfer <benedikt.schae...@emea.nec.com <mailto:benedikt.schae...@emea.nec.com>> wrote:

Did you changed /etc/slurm.conf?
You can try:
- on clients
systemctl restart slurmd (be sure that slurm service is down and only slurmd is running)
- do on master:
scontrol reconfigure

best regards
Benedikt
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ Benedikt Schaefer benedikt.schae...@emea.nec.com <mailto:benedikt.schae...@emea.nec.com> ~ ~ Senior System Analyst ~ ~ NEC Deutschland GmbH ~ ~ HPCE Division ~ ~ Raiffeisenstr.14, 70771 Leinfelden-Echterdingen, Germany ~ ~ Tel:+49 711 780 55 21 Mobile: +49 152 22851542 Fax:+49 711 780 55 25 ~ ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
~ NEC Deutschland GmbH, Hansaallee 101, D-40549 Duesseldorf     ~
~ Geschaeftsfuehrer: Yuichi Kojima ~
~ Handelsregister Düsseldorf HRB 57941; VAT ID DE129424743       ~
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

-----Ursprüngliche Nachricht-----
Von: Sander Kuusemets [mailto:sander.kuusem...@ut.ee]
Gesendet: Mittwoch, 12. April 2017 10:22
An: slurm-dev <slurm-dev@schedmd.com <mailto:slurm-dev@schedmd.com>>
Betreff: [slurm-dev] Slurm leaving nodes in COMPLETING state


Hello!

I have a problem with my slurm cluster (16.05, 135 nodes, Centos 7
across the cluster) where after jobs complete, nodes are left into
COMPLETING state for a very long time.

Right now, for an example, I cancelled a 34 node MPI job, made sure that
I killed every process on the nodes, but slurmctld refuses to put them
back into normal state

I can fix this by either doing:

1)

scontrol update NodeList=<completingNode> state=down reason="CG"

scontrol update NodeList=<completingNode> state=resume reason="CG"

After which the node is in idle* state, and I have to restart the
slurmd on the node.

OR

2)

Restarting the systemctld on the headnode, in which case some jobs
that were previously on the completing nodes get requeued. Yet this
cleans up the completing state from the nodes.
The nodes themselves have nothing interesting in their slurm log.
Neither does the headnode, only recommends setting max_rpc_cnt, which I
did try setting to 16, but does not help.

We're currently having a huge amount of srun jobs.

Does anyone have any clues to start debugging this? Currently we're
restarting the headnode every few hours.

Best regards,

--
Sander Kuusemets
University of Tartu, High Performance Computing, IT Specialist
Skype: sander.kuusemets1
+372 737 5694


Click https://www.mailcontrol.com/sr/9KVwggiAb6rGX2PQPOmvUs3oG7lEfVxCVA70dgIczgxliD8cLu5W4+wClKvcpquW+F5t+P9zGDdkxH4nYzTQYQ== to report this email as spam.


Reply via email to