So we ran into a situation where our master was under high load due to alot of jobs exiting and running all at once (basically high traffic). A user tried to launch a srun interactive job during this period. It actually scheduled and allocated resources. However, when it tried to launch the connection it timed out and dropped the job. As might guess this can be frustrating especially if you have been sitting in the queue for a while.

Is there a way to prevent this behavior? We've already dialed up the timeouts.

-Paul Edmon-

Reply via email to