So SLURM has this obnoxious function of if the name cannot be resolved when you do a reconfigure, it causes the master to die. This functionality caused us to lose roughly 6000 jobs as we did a reconfigure but then both our master and HA fail over died. I understand that currently it is the case that SLURM must be able to resolve all the host names for it to run. However, this design is not good for an environment that is in high flux, such as our own, also it doesn't provide good protection against fat fingering a name.
We would like the functionality of SLURM to be changed such that either: 1. SLURM gives a warning that it cannot resolve a name when it reconfigures and then does not do the reconfigure. 2. SLURM gives a warning that it cannot resolve the name and then ignores that name treating it either as down or misconfigured some how and continues with the reconfigure. Both of these are far safer than having the master simply fail as the last time this happened we lost a lot of jobs that were in flight. Thanks. -Paul Edmon-
