The Slurm 15.08.7 is really old, the current version is 17.02.7.
Still, if you read my Wiki about Slurm configuration, perhaps the missing item will be discovered: https://wiki.fysik.dtu.dk/niflheim/Slurm_configuration

/Ole

On 09/19/2017 05:17 PM, Kyle Mills wrote:
Hi Ole,

I'm using Ubuntu 16.04 on each head/compute node, and have installed slurm-wlm from the apt repositories.  It is slurm 15.08.7.

On Tue, Sep 19, 2017 at 11:07 AM, Ole Holm Nielsen <ole.h.niel...@fysik.dtu.dk <mailto:ole.h.niel...@fysik.dtu.dk>> wrote:


    If your OS is CentOS/RHEL 7, you may want to consult my Wiki page
    about setting up Slurm: https://wiki.fysik.dtu.dk/niflheim/SLURM
    <https://wiki.fysik.dtu.dk/niflheim/SLURM>.
    If you do things correctly, there should be no problems :-)

    /Ole



    On 09/19/2017 05:02 PM, Kyle Mills wrote:

        Hello,

        I'm trying to get SLURM set up on a small cluster comprised of a
        head node and 4 compute nodes.  On the head node, I have run

        ```
        sudo systemctl enable slurmctld
        ```

        but after a reboot SLURM is not running and
        `sudo systemctl status slurmctld` returns:

        ```
        ● slurmctld.service - Slurm controller daemon
             Loaded: loaded (/lib/systemd/system/slurmctld.service;
        enabled; vendor preset: enabled)
             Active: failed (Result: exit-code) since Tue 2017-09-19
        10:38:00 EDT; 9min ago
            Process: 1363 ExecStart=/usr/sbin/slurmctld
        $SLURMCTLD_OPTIONS (code=exited, status=0/SUCCESS)
           Main PID: 1395 (code=exited, status=1/FAILURE)

        Sep 19 10:38:00 arcesius slurmctld[1395]: Recovered state of 4 nodes
        Sep 19 10:38:00 arcesius slurmctld[1395]: Recovered information
        about 0 jobs
        Sep 19 10:38:00 arcesius slurmctld[1395]: Recovered state of 0
        reservations
        Sep 19 10:38:00 arcesius slurmctld[1395]: read_slurm_conf:
        backup_controller not specified.
        Sep 19 10:38:00 arcesius slurmctld[1395]: Running as primary
        controller
        Sep 19 10:38:00 arcesius slurmctld[1395]: Recovered information
        about 0 sicp jobs
        Sep 19 10:38:00 arcesius slurmctld[1395]: error: Error binding
        slurm stream socket: Cannot assign requested address
        Sep 19 10:38:00 arcesius systemd[1]: slurmctld.service: Main
        process exited, code=exited, status=1/FAILURE
        Sep 19 10:38:00 arcesius systemd[1]: slurmctld.service: Unit
        entered failed state.
        Sep 19 10:38:00 arcesius systemd[1]: slurmctld.service: Failed
        with result 'exit-code'.
        ```

        If I then run `sudo systemctl start slurmctld`, it starts up
        without any errors and my compute nodes can communicate with the
        server.  Launching `slurmctld -Dvvvvvv` works, and doesn't print
        anything that I deem concerning.

        Why would it work manually, but not automatically on boot?  If
        you need any more information, please let me know; I'm not sure
        what is necessary to diagnose this problem.

Reply via email to