I had a bit of an adventure with that - I accidentally brought all the
slurm nodes to a "not responding" state (no job interruption). I'm mid
upgrade from 2.6.1 to 14.11.7 - so transient state, no call to fix
anything, just a question.

After my adventure, I realized I'm not sure what is really supposed to
happen on 'reload'. It looks like the controller asks all the nodes to
reload their configs?  Didn't find anything in the docs.

Here are the details...more as an fyi and entertainment:

I upgraded the dbd, controller, and one compute node - no problems - using
a pre-existing init.d/slurm script.  I'm improving the Debian package for
internal distribution and see that the init.d scripts in the distribution
do not support Debian. ( 1. location of init functions is
/lib/lsb/init-functions and 2. and mandatory support of force_reload() -
I'll be happy to add to the dev stream) I'm also newish to slurm, so I'm
trying to gain a deep understanding of the init.d script to learn about
slurm.

Investigating how to implement force_reload(), I see in /etc/init.d/slum
that 'reload' equates to:

     killproc $prog -HUP

I think, "fine, I can reload the configs instead of restarting the service
every time!"  So I do a:

     kill -1 <controller-pid>  #-- that's kill-dash-one

At first, the controller log shows:

[2015-06-11T07:44:12.224] Reconfigure signal (SIGHUP) received
[2015-06-11T07:44:12.235] restoring original state of nodes
[2015-06-11T07:44:12.235] restoring original partition state
[2015-06-11T07:44:12.246] cons_res: select_p_node_init
[2015-06-11T07:44:12.246] cons_res: preparing for 5 partitions
[2015-06-11T07:44:12.447] read_slurm_conf: backup_controller not specified.
[2015-06-11T07:44:12.447] cons_res: select_p_reconfigure
[2015-06-11T07:44:12.447] cons_res: select_p_node_init
[2015-06-11T07:44:12.447] cons_res: preparing for 5 partitions

Then the controller starts spitting out errors.

[2015-06-11T07:45:15.228] agent/is_node_resp: node:<hostname> rpc:1001 :
Incompatible versions of client and server code

One per node. Then all the nodes stop responding.

     error: Nodes <node1,node2,...,nodeN> not responding

I restarted every node's slurmd and everything was back to normal.

I guess I expected only the controller to reload its config file, but all
the better that it asks all the nodes to do the same.

I'll of course not do this again until the upgrade is complete.



Thanks
Sean

-- 
Sean Blanton, Ph.D.

Reply via email to