So we ran into a situation where our master was under high load due to
alot of jobs exiting and running all at once (basically high traffic).
A user tried to launch a srun interactive job during this period. It
actually scheduled and allocated resources. However, when it tried to
launch the connection it timed out and dropped the job. As might guess
this can be frustrating especially if you have been sitting in the queue
for a while.
Is there a way to prevent this behavior? We've already dialed up the
timeouts.
-Paul Edmon-
- [slurm-dev] Interactive Jobs Not Launching Under High L... Paul Edmon
-