Hi everyone,

We had a networking issue between slurm daemon and nodes for about 15 minutes. 
All jobs re-queued and resubmitted when we fixed the networking issue. The 
problem was that we had jobs running for several weeks and all jobs started 
again from the beginning.

In slurm configuration file, there is only one parameter "JobRequeue" which 
either will terminate jobs or resubmit after the connection is back between the 
slurm daemon and nodes. Neither of the options are useful for us because we 
wanted the jobs to keep running by the time the networking is back.

Is there a way to keep the running jobs continue after a netwokring issue 
between slurm daemon and nodes?

Thank you.

Warm regards,
Teshome

Reply via email to