Hi Tim,
I would also check if the slurmd daemon overwrites the user limits when a
job is launched by slurm.
Submit a job with "ulimit -a" and see what's set when the job is submitted
by slurm.
In other words, I would also check the /proc//limits and
see what limits the slurmd process have.
In
Hi Chansup and Trey,
Thanks, yes the slurmd init script does contain:
ulimit -n 1048576
ulimit -l unlimited
ulimit -s unlimited
However I think we finally figured it out. I'm going to look like a fool when I
explain it. It's all been a wild goose chase. I didn't check the obvious.
We mount
Hi Moe and Antony,
Thanks for the link. On further thinking, I think your right in saying it's on
the Linux network side. In looking at our system we have:
/proc/sys/fs/file-max: 2346778
/proc/sys/net/ipv4/tcp_max_syn_backlog: 2048
/proc/sys/net/core/somaxconn: 128
So we bumped somaxconn up to
Hi Tim,
I'm not sure if you've check the "ulimit -n" value for the user who runs
the job.
In my experience, I had to bump up the limit much higher than the default
1024.
Just my 2 cents,
- Chansup
On Wed, Sep 23, 2015 at 9:54 AM, Timothy Brown wrote:
> Hi Moe
Hi Chansup,
Yes, that's way up there too:
node0202 ~$ ulimit -a
core file size (blocks, -c) 0
data seg size (kbytes, -d) unlimited
scheduling priority (-e) 0
file size (blocks, -f) unlimited
pending signals (-i) 185698
max locked
I've seen similar behaviour on another system about a year ago and it was
due to socket limits. We fixed it by implementing the high throughput
cluster suggestions.
Antony
On 22 Sep 2015 15:56, "Moe Jette" wrote:
>
> I suspect that you are hitting some Linux system limit,
I tried a couple more things this afternoon.
A 250 node job (12 tasks per node), however before running srun, I set
PMI_TIME=4000, this is the error I received:
size = 3000, rank = 2853
size = 3000, rank = 1853
size = 3000, rank = 2353
size = 3000, rank = 853
srun: error: timeout waiting for
Hi Ralph,
> On Sep 21, 2015, at 8:36 PM, Ralph Castain wrote:
>
>
> This sounds like something in Slurm - I don’t know how srun would know to
> emit a message if the app was failing to open a socket between its own procs.
>
> Try starting the OMPI job with “mpirun” instead
I suspect that you are hitting some Linux system limit, such as open
files, or socket backlog. For information on how to address, see:
http://slurm.schedmd.com/big_sys.html
Quoting Timothy Brown :
Hi Moe,
On Sep 21, 2015, at 10:02 PM, Moe Jette
Hi Moe,
> On Sep 21, 2015, at 10:02 PM, Moe Jette wrote:
>
>
> What version of Slurm?
We're currently running 14.11.7
> How many tasks/ranks in your job?
I've been trying 500 nodes with 12 tasks per node, giving a total of 6000.
Although after this failed I started
What version of Slurm?
How many tasks/ranks in your job?
Can you run a non-MPI job of the same size (i.e. srun hostname)?
Quoting Ralph Castain :
This sounds like something in Slurm - I don’t know how srun would
know to emit a message if the app was failing to open a socket
Hi Chris,
> On Sep 21, 2015, at 7:36 PM, Christopher Samuel wrote:
>
>
> On 22/09/15 07:17, Timothy Brown wrote:
>
>> This is using mpiexec.hydra with slurm as the bootstrap.
>
> Have you tried Intel MPI's native PMI start up mode?
>
> You just need to set the
This sounds like something in Slurm - I don’t know how srun would know to emit
a message if the app was failing to open a socket between its own procs.
Try starting the OMPI job with “mpirun” instead of srun and see if it has the
same issue. If not, then that’s pretty convincing that it’s
On 22/09/15 07:17, Timothy Brown wrote:
> This is using mpiexec.hydra with slurm as the bootstrap.
Have you tried Intel MPI's native PMI start up mode?
You just need to set the environment variable I_MPI_PMI_LIBRARY to the
path to the Slurm libpmi.so file and then you should be able to use
14 matches
Mail list logo