Hi,

I have some new hardware I am hoping to move the slurmctld to and am
wondering if I need to haul along the IP address or if there is some other
sneaky method.

Has anyone done this recently and been able to keep jobs from going to CG
state?

I was testing this out with a dev cluster, started some jobs, shutdown
ctld/slurmd and updated the config file and started things backup.  I then
noticed squeue showing jobs in CG state and upon logging in to the
corresponding compute nodes found that slurmstepd processes were in the
process list. An strace on them showed they were trying to contact the old
controller for the rpc for job complete.  Maybe I messed up the order of
the operations. Or maybe I really should haul along the IP.  Would
slurmstepd fall back to the backup controller in the slurm.conf file it
read on starting that job (if I had one defined)? e.g. Could I define the
new IP as the backup, roll that out, wait until some point when all jobs
running prior to the change completed and then switch the primary and
backup controllers in the slurm.conf file?


Is this up to date?
http://slurm.schedmd.com/faq.html#controller

*2. How should I relocate the primary or backup controller?*
If the cluster's computers used for the primary or backup controller will
be out of service for an extended period of time, it may be desirable to
relocate them. In order to do so, follow this procedure:

   1. Stop all Slurm daemons
   2. Modify the *ControlMachine*, *ControlAddr*, *BackupController*,
   and/or *BackupAddr* in the *slurm.conf* file
   3. Distribute the updated *slurm.conf* file to all nodes
   4. Copy the *StateSaveLocation* directory to the new host and make sure
   the permissions allow the *SlurmUser* to read and write it.
   5. Restart all Slurm daemons

There should be no loss of any running or pending jobs. Insure that any
nodes added to the cluster have a current *slurm.conf* file installed.
*CAUTION:* If two nodes are simultaneously configured as the primary
controller (two nodes on which *ControlMachine* specify the local host and
the *slurmctld* daemon is executing on each), system behavior will be
destructive. If a compute node has an incorrect *ControlMachine* or
*BackupController* parameter, that node may be rendered unusable, but no
other harm will result.

Reply via email to