So SLURM has this obnoxious function of if the name cannot be resolved 
when you do a reconfigure, it causes the master to die. This 
functionality caused us to lose roughly 6000 jobs as we did a 
reconfigure but then both our master and HA fail over died.  I 
understand that currently it is the case that SLURM must be able to 
resolve all the host names for it to run.  However, this design is not 
good for an environment that is in high flux, such as our own, also it 
doesn't provide good protection against fat fingering a name.

We would like the functionality of SLURM to be changed such that either:

1. SLURM gives a warning that it cannot resolve a name when it 
reconfigures and then does not do the reconfigure.
2. SLURM gives a warning that it cannot resolve the name and then 
ignores that name treating it either as down or misconfigured some how 
and continues with the reconfigure.

Both of these are far safer than having the master simply fail as the 
last time this happened we lost a lot of jobs that were in flight.

Thanks.

-Paul Edmon-

Reply via email to