Hi All
I figured out the issue that caused the _slurm_connect failure. I was
launching the slurm controller and slurm daemons on the compute nodes using
/etc/slurm/start_all.sh. This script was starting the daemons before
starting the slurm controller and slurmdbd, which is why the slurmd
processes could not connect to the controller.

Once I changed the order of startup, the _slurm_connect error went away.

I still have the error where adding the compute nodes to SLURM causes jobs
to fail.

Reply via email to