Minor correction, that's support for Ubuntu, and the missing directive is 'force-reload', with a dash, not an underscore.
On Thu, Jun 11, 2015 at 8:36 AM, Sean Blanton <[email protected]> wrote: > I had a bit of an adventure with that - I accidentally brought all the > slurm nodes to a "not responding" state (no job interruption). I'm mid > upgrade from 2.6.1 to 14.11.7 - so transient state, no call to fix > anything, just a question. > > After my adventure, I realized I'm not sure what is really supposed to > happen on 'reload'. It looks like the controller asks all the nodes to > reload their configs? Didn't find anything in the docs. > > Here are the details...more as an fyi and entertainment: > > I upgraded the dbd, controller, and one compute node - no problems - using > a pre-existing init.d/slurm script. I'm improving the Debian package for > internal distribution and see that the init.d scripts in the distribution > do not support Debian. ( 1. location of init functions is > /lib/lsb/init-functions and 2. and mandatory support of force_reload() - > I'll be happy to add to the dev stream) I'm also newish to slurm, so I'm > trying to gain a deep understanding of the init.d script to learn about > slurm. > > Investigating how to implement force_reload(), I see in /etc/init.d/slum > that 'reload' equates to: > > killproc $prog -HUP > > I think, "fine, I can reload the configs instead of restarting the service > every time!" So I do a: > > kill -1 <controller-pid> #-- that's kill-dash-one > > At first, the controller log shows: > > [2015-06-11T07:44:12.224] Reconfigure signal (SIGHUP) received > [2015-06-11T07:44:12.235] restoring original state of nodes > [2015-06-11T07:44:12.235] restoring original partition state > [2015-06-11T07:44:12.246] cons_res: select_p_node_init > [2015-06-11T07:44:12.246] cons_res: preparing for 5 partitions > [2015-06-11T07:44:12.447] read_slurm_conf: backup_controller not specified. > [2015-06-11T07:44:12.447] cons_res: select_p_reconfigure > [2015-06-11T07:44:12.447] cons_res: select_p_node_init > [2015-06-11T07:44:12.447] cons_res: preparing for 5 partitions > > Then the controller starts spitting out errors. > > [2015-06-11T07:45:15.228] agent/is_node_resp: node:<hostname> rpc:1001 : > Incompatible versions of client and server code > > One per node. Then all the nodes stop responding. > > error: Nodes <node1,node2,...,nodeN> not responding > > I restarted every node's slurmd and everything was back to normal. > > I guess I expected only the controller to reload its config file, but all > the better that it asks all the nodes to do the same. > > I'll of course not do this again until the upgrade is complete. > > > > Thanks > Sean > > -- > Sean Blanton, Ph.D. > > -- Sean Blanton, Ph.D. Quantitative Technologist Radix Trading, LLC Desk: 773.985.0456 Cell: 773.960.3495
