Hi, I have some new hardware I am hoping to move the slurmctld to and am wondering if I need to haul along the IP address or if there is some other sneaky method.
Has anyone done this recently and been able to keep jobs from going to CG state? I was testing this out with a dev cluster, started some jobs, shutdown ctld/slurmd and updated the config file and started things backup. I then noticed squeue showing jobs in CG state and upon logging in to the corresponding compute nodes found that slurmstepd processes were in the process list. An strace on them showed they were trying to contact the old controller for the rpc for job complete. Maybe I messed up the order of the operations. Or maybe I really should haul along the IP. Would slurmstepd fall back to the backup controller in the slurm.conf file it read on starting that job (if I had one defined)? e.g. Could I define the new IP as the backup, roll that out, wait until some point when all jobs running prior to the change completed and then switch the primary and backup controllers in the slurm.conf file? Is this up to date? http://slurm.schedmd.com/faq.html#controller *2. How should I relocate the primary or backup controller?* If the cluster's computers used for the primary or backup controller will be out of service for an extended period of time, it may be desirable to relocate them. In order to do so, follow this procedure: 1. Stop all Slurm daemons 2. Modify the *ControlMachine*, *ControlAddr*, *BackupController*, and/or *BackupAddr* in the *slurm.conf* file 3. Distribute the updated *slurm.conf* file to all nodes 4. Copy the *StateSaveLocation* directory to the new host and make sure the permissions allow the *SlurmUser* to read and write it. 5. Restart all Slurm daemons There should be no loss of any running or pending jobs. Insure that any nodes added to the cluster have a current *slurm.conf* file installed. *CAUTION:* If two nodes are simultaneously configured as the primary controller (two nodes on which *ControlMachine* specify the local host and the *slurmctld* daemon is executing on each), system behavior will be destructive. If a compute node has an incorrect *ControlMachine* or *BackupController* parameter, that node may be rendered unusable, but no other harm will result.
