Hi, 

I posted this a while back but didn’t get any responses.  I prefer using `srun` 
to invoke commands on our cluster because it is way more convenient then 
writing wrappers for sbatch for running single process jobs (no multiple 
steps).  The problem is that if I submit to many srun jobs, the head node 
starts running out of socket resources (or other?) and I start getting timeouts 
and some of the srun processes start using 100% CPU.  

I’ve tried redirecting all I/O to prevent use of sockets, etc., but still see 
this problem.  Can anyone suggest an alternative approach or fix?  Something 
that doesn’t require I write shell wrappers, but also doesn’t keep a running 
process going on the head node?

Thanks,
-Craig
  

Reply via email to