[slurm-dev] Re: Impact to jobs when reconfiguring partitions?

Tuo Chen Peng Mon, 24 Oct 2016 15:18:06 -0700

Oh ok thanks for pointing this out.
I thought ‘scontrol update’ command is for letting slurmctld to pick up any 
change in slurm.conf.
But after reading the manual again, it seems this command is instead to change 
the setting at runtime, instead of reading any change from slurm.conf.


So is restarting slurmctld the only way to let it pick up changes in slurm.conf?
And if I change (2.2) in my plan to
(2.2) restart slurmctld to pick changes in slurm.conf, then use ‘scontrol 
reconfigure’ to push changes to all nodes
Do you see any impact to the running jobs in the cluster?

Thanks

From: Lachlan Musicman [mailto:data...@gmail.com]
Sent: Monday, October 24, 2016 2:58 PM
To: slurm-dev
Subject: [slurm-dev] Re: Impact to jobs when reconfiguring partitions?

On 25 October 2016 at 08:42, Tuo Chen Peng 
<tp...@nvidia.com<mailto:tp...@nvidia.com>> wrote:
Hello all,
This is my first post in the mailing list - nice to join the community!

Welcome!


I have a general question regarding slurm partition change:
If I move one node from one partition to the other, will it cause any impact to 
the jobs that are still running on other nodes, in both partitions?

No, it shouldn't, depending on how you execute the plan...

But we would like to do this without interrupting existing, running jobs.
What would be the safe way to do this?

And here’s my plan:
(1) drain the node in main partition for the move, and only drain that node - 
keep other nodes available for job submission.
(2) move node from main partition to short job partition
(2.1) update slurm.conf on both control node and node to be moved, so that this 
node is listed under short job partition
(2.2) Run scontrol update on both control node and node just moved, to let 
slurm pick up configuration change.
(3) node should now be moved to short job partition, set the node back to 
normal / idle state.

Is “scontrol update” the right command to use in this case?
Does anyone see any impact / concern in above sequence?
I’m mostly worried mostly about whether such partition change could cause 
user’s existing jobs to be killed or fail for some reason.

Looks correct except for 2.2 - my understanding is that you would need to 
restart the slurmctld process (`systemctl restart slurm`) at this point - which 
is the point the slurm "head" node picks up the changes to the slurm.conf - and 
 then `scontrol reconfigure` to distribute that change to the nodes.


Cheers
L.


-----------------------------------------------------------------------------------
This email message is for the sole use of the intended recipient(s) and may 
contain
confidential information.  Any unauthorized review, use, disclosure or 
distribution
is prohibited.  If you are not the intended recipient, please contact the 
sender by
reply email and destroy all copies of the original message.
-----------------------------------------------------------------------------------

[slurm-dev] Re: Impact to jobs when reconfiguring partitions?

Reply via email to