[slurm-dev] Re: Running jobs are stopped and reqeued when adding new nodes

Ole Holm Nielsen Mon, 23 Oct 2017 02:44:48 -0700

I have added nodes to an existing partition several times using the sameprocedure which you describe, and no bad side effects have been noticed.This is a very normal kind of operation in a cluster, where hardwaremay be added or retired from time to time, while the cluster of coursecontinues its normal production. We must be able to do this, especiallywhen transferring existing nodes into a new Slurm cluster.

Douglas Jacobsen explained very well why problems may arise. It seemsto me that this completely rigid nodelist bit mask in the network is aSlurm design problem, and that it ought to be fixed.

Question: How can we pinpoint the problem more precisely in a bug reportto SchedMD (for support-customers only :-).


/Ole


On 10/22/2017 08:44 PM, JinSung Kang wrote:

I am having trouble with adding new nodes into slurm cluster withoutkilling the jobs that are currently running.
Right now I

1. Update the slurm.conf and add a new node to it
2. Copy new slurm.conf to all the nodes,
3. Restart the slurmd on all nodes
4. Restart the slurmctld
But when I restart slurmctld all the jobs that were currently runningare requeued (Begin Time) as reason for not running. The new added nodeworks perfectly fine.
I've included the slurm.conf. I've also included slurmctld.log outputwhen I'm trying to add the new node.

[slurm-dev] Re: Running jobs are stopped and reqeued when adding new nodes

Reply via email to