You should be able to start the daemons in any order. If the slurmctld is down, the slurmd will report an error connecting, but when slurmctld starts, it should connect to the slurmd and all will be well. I still suspect that you have some configuration problem with respect to the network. ________________________________________ From: [email protected] [[email protected]] On Behalf Of Paul Thirumalai [[email protected]] Sent: Tuesday, February 15, 2011 11:14 AM To: [email protected] Subject: Re: [slurm-dev] sbatch seems to have stopped working
Hi All I figured out the issue that caused the _slurm_connect failure. I was launching the slurm controller and slurm daemons on the compute nodes using /etc/slurm/start_all.sh. This script was starting the daemons before starting the slurm controller and slurmdbd, which is why the slurmd processes could not connect to the controller. Once I changed the order of startup, the _slurm_connect error went away. I still have the error where adding the compute nodes to SLURM causes jobs to fail.
