On 2015-03-15 15:46, Daniel Letai wrote:
> Hi,
> 
> Testing a new slurm cluster (14.11.4) on a 1k nodes cluster.
> 
> Several things we've tried:
> Increase slurmctld threads (8 ports range)
> Increase munge threads (threads=10)
> Increase messageTimeout to 30
> 
> 
> We are using accounting (db on different server)
> 
> Thanks for any help

Take a look at

http://slurm.schedmd.com/high_throughput.html

For us, setting somaxconn to 4096 fixed the socket timeout issues
("sysctl net.core.somaxconn=4096"). Check with "netstat -s | grep
LISTEN" for listen queue overflows, does the number increase, and if it
does, does bumping somaxconn fix it?

Put a line like

net.core.somaxconn = 4096

in /etc/sysctl.conf if you want the setting to survive a reboot.

-- 
Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist
Aalto University School of Science, PHYS & NBE
+358503841576 || [email protected]

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to