Check the slurmd log file on the node.

Ensure slurmd is still running. Sounds possible that OOM Killer or such may be killing slurmd

Brian Andrus

On 1/20/2020 1:12 PM, Dean Schulze wrote:
If I restart slurmd the asterisk goes away.  Then I can run the job once and the asterisk is back, and the node remains in comp*:

[liqid@liqidos-dean-node1 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1   idle liqidos-dean-node1
[liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
liqidos-dean-node1
[liqid@liqidos-dean-node1 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1  comp* liqidos-dean-node1

I can get it back to idle* with scontrol:

[liqid@liqidos-dean-node1 ~]$ sudo /usr/local/bin/scontrol update NodeName=liqidos-dean-node1 State=down Reason=none [liqid@liqidos-dean-node1 ~]$ sudo /usr/local/bin/scontrol update NodeName=liqidos-dean-node1 State=resume
[liqid@liqidos-dean-node1 ~]$ sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
debug*       up   infinite      1  idle* liqidos-dean-node1

I'm beginning to wonder if I got some bad code from github.


On Mon, Jan 20, 2020 at 1:50 PM Carlos Fenoy <mini...@gmail.com <mailto:mini...@gmail.com>> wrote:

    Hi,

    The * next to the idle status in sinfo means that the node is
    unreachable/not responding. Check the status of the slurmd on the
    node and check the connectivity from the slurmctld host to the
    compute node (telnet may be enough). You can also check the
    slurmctld logs for more information.

    Regards,
    Carlos

    On Mon, 20 Jan 2020 at 21:04, Dean Schulze
    <dean.w.schu...@gmail.com <mailto:dean.w.schu...@gmail.com>> wrote:

        I've got a node running on CentOS 7.7 build from the recent
        20.02.0pre1 code base.  It's behavior is strange to say the
        least.

        The controller was built from the same code base, but on
        Ubuntu 19.10.  The controller reports the nodes state with
        sinfo, but can't run a simple job with srun because it thinks
        the node isn't available, even when it is idle.  (And squeue
        shows an empty queue.)

        On the controller:
        $ srun -N 1 hostname
        srun: Required node not available (down, drained or reserved)
        srun: job 30 queued and waiting for resources
        ^Csrun: Job allocation 30 has been revoked
        srun: Force Terminated job 30
        $ sinfo
        PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
        debug*       up   infinite      1  idle* liqidos-dean-node1
        $ squeue
                     JOBID  PARTITION      USER  ST  TIME   NODES
        NODELIST(REASON)


        When I try to run the simple job on the node I get:

        [liqid@liqidos-dean-node1 ~]$ sinfo
        PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
        debug*       up   infinite      1  idle* liqidos-dean-node1
        [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
        srun: Required node not available (down, drained or reserved)
        srun: job 27 queued and waiting for resources
        ^Csrun: Job allocation 27 has been revoked
        [liqid@liqidos-dean-node1 ~]$ squeue
                     JOBID  PARTITION      USER  ST  TIME   NODES
        NODELIST(REASON)
        [liqid@liqidos-dean-node1 ~]$ srun -N 1 hostname
        srun: Required node not available (down, drained or reserved)
        srun: job 28 queued and waiting for resources
        ^Csrun: Job allocation 28 has been revoked
        [liqid@liqidos-dean-node1 ~]$ sinfo
        PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
        debug*       up   infinite      1  idle* liqidos-dean-node1

        Apparently slurm thinks there are a bunch of jobs queued, but
        shows an empty queue.  How do I get rid of these?

        If these zombie jobs aren't the problem what else could be
        keeping this from running?

        Thanks.

-- --
    Carles Fenoy

Reply via email to