I have added nodes to an existing partition several times using the same
procedure which you describe, and no bad side effects have been noticed.
This is a very normal kind of operation in a cluster, where hardware
may be added or retired from time to time, while the cluster of course
continues its normal production. We must be able to do this, especially
when transferring existing nodes into a new Slurm cluster.
Douglas Jacobsen explained very well why problems may arise. It seems
to me that this completely rigid nodelist bit mask in the network is a
Slurm design problem, and that it ought to be fixed.
Question: How can we pinpoint the problem more precisely in a bug report
to SchedMD (for support-customers only :-).
/Ole
On 10/22/2017 08:44 PM, JinSung Kang wrote:
I am having trouble with adding new nodes into slurm cluster without
killing the jobs that are currently running.
Right now I
1. Update the slurm.conf and add a new node to it
2. Copy new slurm.conf to all the nodes,
3. Restart the slurmd on all nodes
4. Restart the slurmctld
But when I restart slurmctld all the jobs that were currently running
are requeued (Begin Time) as reason for not running. The new added node
works perfectly fine.
I've included the slurm.conf. I've also included slurmctld.log output
when I'm trying to add the new node.