Hi everyone, We had a networking issue between slurm daemon and nodes for about 15 minutes. All jobs re-queued and resubmitted when we fixed the networking issue. The problem was that we had jobs running for several weeks and all jobs started again from the beginning.
In slurm configuration file, there is only one parameter "JobRequeue" which either will terminate jobs or resubmit after the connection is back between the slurm daemon and nodes. Neither of the options are useful for us because we wanted the jobs to keep running by the time the networking is back. Is there a way to keep the running jobs continue after a netwokring issue between slurm daemon and nodes? Thank you. Warm regards, Teshome
