I'm not really in a position to check, since I'm not our cluster admin. I asked him and he thought it might be down to high load on the client node at that point in time; we often run submission commands from our shared compute nodes, which can become overloaded because they aren't themselves managed by a scheduler. If it's *not* that and it's something we really need to investigate, that would be good to know.
On Mon, Feb 23, 2026 at 9:42 PM Christopher Samuel via slurm-users < [email protected]> wrote: > On 2/17/26 12:56 pm, Adam Novak via slurm-users wrote: > > > I'm working on the Slurm integration in our Toil workflow runner > > project. I'm having a problem where an `sbatch` command to submit a job > > to Slurm can fail (with exit code 1 and message "sbatch: error: Batch > > job submission failed: Socket timed out on send/recv operation", in my > > case, but possibly in other ways), but the job can still actually have > > been submitted, and can still execute. > > I know others have given ideas on working around this, but have you had > a chance to dig into why this is happening for you? That sort of network > timeout points to either the slurmctld being totally overwhelmed with > RPCs, or wedged in I/O, or some odd network problem. > > Do you see anything diagnostic in the slurmctld logs when that's happening? > > All the best, > Chris > -- > Chris Samuel : http://www.csamuel.org/ : Philadelphia, PA, USA > > -- > slurm-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] > -- Adam Novak (He/Him) Senior Software Engineer Computational Genomics Lab UC Santa Cruz Genomics Institute "Revealing life’s code." Personal Feedback: https://forms.gle/UXZhZc123knF65Dw5
-- slurm-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
