On 2/17/26 12:56 pm, Adam Novak via slurm-users wrote:
I'm working on the Slurm integration in our Toil workflow runner project. I'm having a problem where an `sbatch` command to submit a job to Slurm can fail (with exit code 1 and message "sbatch: error: Batch job submission failed: Socket timed out on send/recv operation", in my case, but possibly in other ways), but the job can still actually have been submitted, and can still execute.
I know others have given ideas on working around this, but have you had a chance to dig into why this is happening for you? That sort of network timeout points to either the slurmctld being totally overwhelmed with RPCs, or wedged in I/O, or some odd network problem.
Do you see anything diagnostic in the slurmctld logs when that's happening? All the best, Chris -- Chris Samuel : http://www.csamuel.org/ : Philadelphia, PA, USA -- slurm-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
