The Slurm 15.08.7 is really old, the current version is 17.02.7.
Still, if you read my Wiki about Slurm configuration, perhaps the
missing item will be discovered:
https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration
/Ole
On 09/19/2017 05:17 PM, Kyle Mills wrote:
Hi Ole,
I'm using Ubuntu 16.04 on each head/compute node, and have installed
slurm-wlm from the apt repositories. It is slurm 15.08.7.
On Tue, Sep 19, 2017 at 11:07 AM, Ole Holm Nielsen
<ole.h.niel...@fysik.dtu.dk <mailto:ole.h.niel...@fysik.dtu.dk>> wrote:
If your OS is CentOS/RHEL 7, you may want to consult my Wiki page
about setting up Slurm: https://wiki.fysik.dtu.dk/niflheim/SLURM
<https://wiki.fysik.dtu.dk/niflheim/SLURM>.
If you do things correctly, there should be no problems :-)
/Ole
On 09/19/2017 05:02 PM, Kyle Mills wrote:
Hello,
I'm trying to get SLURM set up on a small cluster comprised of a
head node and 4 compute nodes. On the head node, I have run
```
sudo systemctl enable slurmctld
```
but after a reboot SLURM is not running and
`sudo systemctl status slurmctld` returns:
```
● slurmctld.service - Slurm controller daemon
Loaded: loaded (/lib/systemd/system/slurmctld.service;
enabled; vendor preset: enabled)
Active: failed (Result: exit-code) since Tue 2017-09-19
10:38:00 EDT; 9min ago
Process: 1363 ExecStart=/usr/sbin/slurmctld
$SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
Main PID: 1395 (code=exited, status=1/FAILURE)
Sep 19 10:38:00 arcesius slurmctld[1395]: Recovered state of 4 nodes
Sep 19 10:38:00 arcesius slurmctld[1395]: Recovered information
about 0 jobs
Sep 19 10:38:00 arcesius slurmctld[1395]: Recovered state of 0
reservations
Sep 19 10:38:00 arcesius slurmctld[1395]: read_slurm_conf:
backup_controller not specified.
Sep 19 10:38:00 arcesius slurmctld[1395]: Running as primary
controller
Sep 19 10:38:00 arcesius slurmctld[1395]: Recovered information
about 0 sicp jobs
Sep 19 10:38:00 arcesius slurmctld[1395]: error: Error binding
slurm stream socket: Cannot assign requested address
Sep 19 10:38:00 arcesius systemd[1]: slurmctld.service: Main
process exited, code=exited, status=1/FAILURE
Sep 19 10:38:00 arcesius systemd[1]: slurmctld.service: Unit
entered failed state.
Sep 19 10:38:00 arcesius systemd[1]: slurmctld.service: Failed
with result 'exit-code'.
```
If I then run `sudo systemctl start slurmctld`, it starts up
without any errors and my compute nodes can communicate with the
server. Launching `slurmctld -Dvvvvvv` works, and doesn't print
anything that I deem concerning.
Why would it work manually, but not automatically on boot? If
you need any more information, please let me know; I'm not sure
what is necessary to diagnose this problem.