> > On 04/05/2017 03:59 PM, Loris Bennett wrote: > > We are running 16.05.10-2 with power-saving. However, we have noticed a > > problem recently when nodes are woken up in order to start a job. The > > node will go from 'idle~' to, say, 'mixed#', but then the job will fail > > and the node will be put in 'down*'. We have turned up the log level to > > 'debug' with the DebugFlag 'Power', but this hasn't produced anything > > relevant. The problem is, however, resolved if the node is rebooted. > > > > Thus, there seems to be some disturbance of the communication between > > the slurmd on the woken node and the slurmctd on the administration > > node. Does anyone have any idea what might be going on? > > We have seen something similar with Slurm 16.05.10. > > How many nodes are in your network? If there are more than about 400 > devices in the network, you must tune the kernel ARP cache of the slurmctld > server, see > https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks
There are only ~120 nodes in the network and our current parameters are ==> /proc/sys/net/ipv4/neigh/default/gc_thresh1 <== 2048 ==> /proc/sys/net/ipv4/neigh/default/gc_thresh2 <== 4096 ==> /proc/sys/net/ipv4/neigh/default/gc_thresh3 <== 8192 ==> /proc/sys/net/core/somaxconn <== 128 ==> /proc/sys/net/ipv4/neigh/default/gc_interval <== 30 ==> /proc/sys/net/ipv4/neigh/default/gc_stale_time <== 60 Mit freundlichen Gr��en Bernd Melchers --