Minor correction, that's support for Ubuntu, and the missing directive is
'force-reload',  with a dash, not an underscore.

On Thu, Jun 11, 2015 at 8:36 AM, Sean Blanton <[email protected]> wrote:

>  I had a bit of an adventure with that - I accidentally brought all the
> slurm nodes to a "not responding" state (no job interruption). I'm mid
> upgrade from 2.6.1 to 14.11.7 - so transient state, no call to fix
> anything, just a question.
>
> After my adventure, I realized I'm not sure what is really supposed to
> happen on 'reload'. It looks like the controller asks all the nodes to
> reload their configs?  Didn't find anything in the docs.
>
> Here are the details...more as an fyi and entertainment:
>
> I upgraded the dbd, controller, and one compute node - no problems - using
> a pre-existing init.d/slurm script.  I'm improving the Debian package for
> internal distribution and see that the init.d scripts in the distribution
> do not support Debian. ( 1. location of init functions is
> /lib/lsb/init-functions and 2. and mandatory support of force_reload() -
> I'll be happy to add to the dev stream) I'm also newish to slurm, so I'm
> trying to gain a deep understanding of the init.d script to learn about
> slurm.
>
> Investigating how to implement force_reload(), I see in /etc/init.d/slum
> that 'reload' equates to:
>
>      killproc $prog -HUP
>
> I think, "fine, I can reload the configs instead of restarting the service
> every time!"  So I do a:
>
>      kill -1 <controller-pid>  #-- that's kill-dash-one
>
> At first, the controller log shows:
>
> [2015-06-11T07:44:12.224] Reconfigure signal (SIGHUP) received
> [2015-06-11T07:44:12.235] restoring original state of nodes
> [2015-06-11T07:44:12.235] restoring original partition state
> [2015-06-11T07:44:12.246] cons_res: select_p_node_init
> [2015-06-11T07:44:12.246] cons_res: preparing for 5 partitions
> [2015-06-11T07:44:12.447] read_slurm_conf: backup_controller not specified.
> [2015-06-11T07:44:12.447] cons_res: select_p_reconfigure
> [2015-06-11T07:44:12.447] cons_res: select_p_node_init
> [2015-06-11T07:44:12.447] cons_res: preparing for 5 partitions
>
> Then the controller starts spitting out errors.
>
> [2015-06-11T07:45:15.228] agent/is_node_resp: node:<hostname> rpc:1001 :
> Incompatible versions of client and server code
>
> One per node. Then all the nodes stop responding.
>
>      error: Nodes <node1,node2,...,nodeN> not responding
>
> I restarted every node's slurmd and everything was back to normal.
>
> I guess I expected only the controller to reload its config file, but all
> the better that it asks all the nodes to do the same.
>
> I'll of course not do this again until the upgrade is complete.
>
>
>
> Thanks
> Sean
>
> --
> Sean Blanton, Ph.D.
>
>


-- 
Sean Blanton, Ph.D.
Quantitative Technologist
Radix Trading, LLC
Desk: 773.985.0456
Cell:   773.960.3495

Reply via email to