[slurm-dev] Re: Large job socket timed out errors.

2015-09-24 Thread CB
Hi Tim, I would also check if the slurmd daemon overwrites the user limits when a job is launched by slurm. Submit a job with "ulimit -a" and see what's set when the job is submitted by slurm. In other words, I would also check the /proc//limits and see what limits the slurmd process have. In

[slurm-dev] Re: Large job socket timed out errors.

2015-09-24 Thread Timothy Brown
Hi Chansup and Trey, Thanks, yes the slurmd init script does contain: ulimit -n 1048576 ulimit -l unlimited ulimit -s unlimited However I think we finally figured it out. I'm going to look like a fool when I explain it. It's all been a wild goose chase. I didn't check the obvious. We mount

[slurm-dev] Re: Large job socket timed out errors.

2015-09-23 Thread Timothy Brown
Hi Moe and Antony, Thanks for the link. On further thinking, I think your right in saying it's on the Linux network side. In looking at our system we have: /proc/sys/fs/file-max: 2346778 /proc/sys/net/ipv4/tcp_max_syn_backlog: 2048 /proc/sys/net/core/somaxconn: 128 So we bumped somaxconn up to

[slurm-dev] Re: Large job socket timed out errors.

2015-09-23 Thread CB
Hi Tim, I'm not sure if you've check the "ulimit -n" value for the user who runs the job. In my experience, I had to bump up the limit much higher than the default 1024. Just my 2 cents, - Chansup On Wed, Sep 23, 2015 at 9:54 AM, Timothy Brown wrote: > Hi Moe

[slurm-dev] Re: Large job socket timed out errors.

2015-09-23 Thread Timothy Brown
Hi Chansup, Yes, that's way up there too: node0202 ~$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 185698 max locked

[slurm-dev] Re: Large job socket timed out errors.

2015-09-23 Thread Antony Cleave
I've seen similar behaviour on another system about a year ago and it was due to socket limits. We fixed it by implementing the high throughput cluster suggestions. Antony On 22 Sep 2015 15:56, "Moe Jette" wrote: > > I suspect that you are hitting some Linux system limit,

[slurm-dev] Re: Large job socket timed out errors.

2015-09-23 Thread Timothy Brown
I tried a couple more things this afternoon. A 250 node job (12 tasks per node), however before running srun, I set PMI_TIME=4000, this is the error I received: size = 3000, rank = 2853 size = 3000, rank = 1853 size = 3000, rank = 2353 size = 3000, rank = 853 srun: error: timeout waiting for

[slurm-dev] Re: Large job socket timed out errors.

2015-09-22 Thread Timothy Brown
Hi Ralph, > On Sep 21, 2015, at 8:36 PM, Ralph Castain wrote: > > > This sounds like something in Slurm - I don’t know how srun would know to > emit a message if the app was failing to open a socket between its own procs. > > Try starting the OMPI job with “mpirun” instead

[slurm-dev] Re: Large job socket timed out errors.

2015-09-22 Thread Moe Jette
I suspect that you are hitting some Linux system limit, such as open files, or socket backlog. For information on how to address, see: http://slurm.schedmd.com/big_sys.html Quoting Timothy Brown : Hi Moe, On Sep 21, 2015, at 10:02 PM, Moe Jette

[slurm-dev] Re: Large job socket timed out errors.

2015-09-22 Thread Timothy Brown
Hi Moe, > On Sep 21, 2015, at 10:02 PM, Moe Jette wrote: > > > What version of Slurm? We're currently running 14.11.7 > How many tasks/ranks in your job? I've been trying 500 nodes with 12 tasks per node, giving a total of 6000. Although after this failed I started

[slurm-dev] Re: Large job socket timed out errors.

2015-09-21 Thread Moe Jette
What version of Slurm? How many tasks/ranks in your job? Can you run a non-MPI job of the same size (i.e. srun hostname)? Quoting Ralph Castain : This sounds like something in Slurm - I don’t know how srun would know to emit a message if the app was failing to open a socket

[slurm-dev] Re: Large job socket timed out errors.

2015-09-21 Thread Timothy Brown
Hi Chris, > On Sep 21, 2015, at 7:36 PM, Christopher Samuel wrote: > > > On 22/09/15 07:17, Timothy Brown wrote: > >> This is using mpiexec.hydra with slurm as the bootstrap. > > Have you tried Intel MPI's native PMI start up mode? > > You just need to set the

[slurm-dev] Re: Large job socket timed out errors.

2015-09-21 Thread Ralph Castain
This sounds like something in Slurm - I don’t know how srun would know to emit a message if the app was failing to open a socket between its own procs. Try starting the OMPI job with “mpirun” instead of srun and see if it has the same issue. If not, then that’s pretty convincing that it’s

[slurm-dev] Re: Large job socket timed out errors.

2015-09-21 Thread Christopher Samuel
On 22/09/15 07:17, Timothy Brown wrote: > This is using mpiexec.hydra with slurm as the bootstrap. Have you tried Intel MPI's native PMI start up mode? You just need to set the environment variable I_MPI_PMI_LIBRARY to the path to the Slurm libpmi.so file and then you should be able to use