On 2015-03-15 15:46, Daniel Letai wrote: > Hi, > > Testing a new slurm cluster (14.11.4) on a 1k nodes cluster. > > Several things we've tried: > Increase slurmctld threads (8 ports range) > Increase munge threads (threads=10) > Increase messageTimeout to 30 > > > We are using accounting (db on different server) > > Thanks for any help
Take a look at http://slurm.schedmd.com/high_throughput.html For us, setting somaxconn to 4096 fixed the socket timeout issues ("sysctl net.core.somaxconn=4096"). Check with "netstat -s | grep LISTEN" for listen queue overflows, does the number increase, and if it does, does bumping somaxconn fix it? Put a line like net.core.somaxconn = 4096 in /etc/sysctl.conf if you want the setting to survive a reboot. -- Janne Blomqvist, D.Sc. (Tech.), Scientific Computing Specialist Aalto University School of Science, PHYS & NBE +358503841576 || [email protected]
signature.asc
Description: OpenPGP digital signature
