Maybe this has something to do with the requirement that ports be open to use srun (if you don't have that port open, it won't work at all)? Perhaps there is some limit on each port, etc.? ________________________________________ From: Craig Yoshioka <[email protected]> Sent: Wednesday, July 5, 2017 1:37:00 PM To: slurm-dev Subject: [slurm-dev] srun CPU use
Hi, I posted this a while back but didn’t get any responses. I prefer using `srun` to invoke commands on our cluster because it is way more convenient then writing wrappers for sbatch for running single process jobs (no multiple steps). The problem is that if I submit to many srun jobs, the head node starts running out of socket resources (or other?) and I start getting timeouts and some of the srun processes start using 100% CPU. I’ve tried redirecting all I/O to prevent use of sockets, etc., but still see this problem. Can anyone suggest an alternative approach or fix? Something that doesn’t require I write shell wrappers, but also doesn’t keep a running process going on the head node? Thanks, -Craig
