Occassionally when we reset the master some of our nodes go into an unknown state or take a bit to get back in contact with the master. If srun is being launched on the nodes at that time it tends to make it hang which causes the mpirun dependent on the srun being launched to fail. Even stranger the sbatch that originally launched the srun keeps running and not failing out right.

Is there a way to prevent srun from failing but rather just have it wait until the master comes back? Or is the timeout the only way to set this? Or if this isn't possible can we have the parent sbatch die with an error rather than have srun just hang up?

Thanks for any insight.

-Paul Edmon-

Reply via email to