You should be able to start the daemons in any order. 
If the slurmctld is down, the slurmd will report an error connecting, but when
slurmctld starts, it should connect to the slurmd and all will be well.
I still suspect that you have some configuration problem with respect to the 
network.
________________________________________
From: [email protected] [[email protected]] On Behalf 
Of Paul Thirumalai [[email protected]]
Sent: Tuesday, February 15, 2011 11:14 AM
To: [email protected]
Subject: Re: [slurm-dev] sbatch seems to have stopped working

Hi All
I figured out the issue that caused the _slurm_connect failure. I was launching 
the slurm controller and slurm daemons on the compute nodes using 
/etc/slurm/start_all.sh. This script was starting the daemons before starting 
the slurm controller and slurmdbd, which is why the slurmd processes could not 
connect to the controller.

Once I changed the order of startup, the _slurm_connect error went away.

I still have the error where adding the compute nodes to SLURM causes jobs to 
fail.


Reply via email to