[slurm-dev] Re: Nodes in state 'down*' despite slurmd running

Bernd Melchers Wed, 05 Apr 2017 08:00:15 -0700

> 
> On 04/05/2017 03:59 PM, Loris Bennett wrote:
> > We are running 16.05.10-2 with power-saving.  However, we have noticed a
> > problem recently when nodes are woken up in order to start a job.  The
> > node will go from 'idle~' to, say, 'mixed#', but then the job will fail
> > and the node will be put in 'down*'.  We have turned up the log level to
> > 'debug' with the DebugFlag 'Power', but this hasn't produced anything
> > relevant.  The problem is, however, resolved if the node is rebooted.
> > 
> > Thus, there seems to be some disturbance of the communication between
> > the slurmd on the woken node and the slurmctd on the administration
> > node.  Does anyone have any idea what might be going on?
> 
> We have seen something similar with Slurm 16.05.10.
> 
> How many nodes are in your network?  If there are more than about 400
> devices in the network, you must tune the kernel ARP cache of the slurmctld
> server, see 
> https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration#configure-arp-cache-for-large-networks


There are only ~120 nodes in the network and our current parameters are
==> /proc/sys/net/ipv4/neigh/default/gc_thresh1 <==
2048

==> /proc/sys/net/ipv4/neigh/default/gc_thresh2 <==
4096

==> /proc/sys/net/ipv4/neigh/default/gc_thresh3 <==
8192

==> /proc/sys/net/core/somaxconn <==
128

==> /proc/sys/net/ipv4/neigh/default/gc_interval <==
30

==> /proc/sys/net/ipv4/neigh/default/gc_stale_time <==
60


Mit freundlichen Grï¿½ï¿½en
Bernd Melchers

--

[slurm-dev] Re: Nodes in state 'down*' despite slurmd running

Reply via email to