I think this error usually means that on your node cn7 it has either the wrong /etc/hosts or the wrong /etc/slurm/slurm.conf
E.g. try 'srun --nodelist=cn7 ping -c 1 cn7' On Wed, May 29, 2019 at 6:00 AM Alexander Åhman <alexan...@ydesign.se> wrote: > Hi, > Have a very strange problem. The cluster has been working just fine > until one node died and now I can't submit jobs to 2 of the nodes using > srun from the login machine. Using sbatch works just fine and also if I > use srun from the same host as slurmctld. > All the other nodes works just fine as they always has, only 2 nodes are > experiencing this problem. Very strange... > > Have checked network connectivity and DNS and that is OK. I can ping, > ssh to all nodes just fine. All nodes are identical and using Slurm 18.08. > Also tested to reboot the 2 nodes and slurmctld but still same problem. > > [alex@li1 ~]$ srun -w cn7 hostname > srun: error: fwd_tree_thread: can't find address for host cn7, check > slurm.conf > srun: error: Task launch for 1088816.0 failed on node cn7: Can't find an > address, check slurm.conf > srun: error: Application launch failed: Can't find an address, check > slurm.conf > srun: Job step aborted: Waiting up to 32 seconds for job step to finish. > srun: error: Timed out waiting for job step to complete > > [alex@li1 ~]$ srun -w cn6 hostname > cn6.ydesign.se > > What is this error "can't find address for host" about? Have searched > the web but can't find any good information about what the problem is or > what to do to resolve it. > > Any kind soul out there who knows what to do next? > > Regards, > Alexander Åhman > > >