Hi, I posted this a while back but didn’t get any responses. I prefer using `srun` to invoke commands on our cluster because it is way more convenient then writing wrappers for sbatch for running single process jobs (no multiple steps). The problem is that if I submit to many srun jobs, the head node starts running out of socket resources (or other?) and I start getting timeouts and some of the srun processes start using 100% CPU.
I’ve tried redirecting all I/O to prevent use of sockets, etc., but still see this problem. Can anyone suggest an alternative approach or fix? Something that doesn’t require I write shell wrappers, but also doesn’t keep a running process going on the head node? Thanks, -Craig
