Correction, the job is actually still running but shows as configuring.
# sacct
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
13186 mz626_il4+ d2d3 users 112 RUNNING 0:0
# squeue -l
Fri Jan 31 08:35:22 2014
JOBID PARTITION NAME USER STATE TIME TIMELIMIT
NODES NODELIST(REASON)
13186 d2d3 mz626_il nobody CONFIGUR 41:19 UNLIMITED
28 delta[29-56]
On Thu, 2014-01-30 at 16:26 -0800, Franco Broi wrote:
>
> A user input a job for which half the nodes were sleeping. Slurm woke
> the nodes and the job ran but the state hasn't changed from configuring.
>
> slurm 2.6.5
>
> [2014-01-31T07:54:03.848] sched: update_job: releasing user hold for job_id
> 13186
> [2014-01-31T07:54:03.848] _slurm_rpc_update_job complete JobId=13186 uid=1345
> usec=150
> [2014-01-31T07:54:03.849] debug: sched: Running job scheduler
> [2014-01-31T07:54:03.880] sched: Allocate JobId=13186 NodeList=delta[29-56]
> #CPUs=112
> [2014-01-31T07:54:04.712] power_save: waking nodes delta[29-42]
> [2014-01-31T07:54:11.000] debug: backfill: beginning
> [2014-01-31T07:54:11.000] debug: backfill: no jobs to backfill
> [2014-01-31T07:54:58.015] debug: sched: Running job scheduler
> [2014-01-31T07:55:45.025] debug: Spawning ping agent for delta[29-42]
> [2014-01-31T07:55:51.026] error: Nodes delta[29-42] not responding
> [2014-01-31T07:55:58.027] debug: sched: Running job scheduler
> [2014-01-31T07:56:25.950] Node delta37 rebooted 58 secs ago
> [2014-01-31T07:56:26.244] Node delta31 rebooted 58 secs ago
> [2014-01-31T07:56:26.284] Node delta35 rebooted 58 secs ago
> [2014-01-31T07:56:27.134] Node delta36 rebooted 58 secs ago
> [2014-01-31T07:56:27.811] Node delta30 rebooted 63 secs ago
> [2014-01-31T07:56:28.222] Node delta34 rebooted 64 secs ago
> [2014-01-31T07:56:28.313] Node delta38 rebooted 60 secs ago
> [2014-01-31T07:56:28.609] Node delta40 rebooted 63 secs ago
> [2014-01-31T07:56:29.175] Node delta33 rebooted 58 secs ago
> [2014-01-31T07:56:29.196] Node delta41 rebooted 60 secs ago
> [2014-01-31T07:56:29.287] Node delta42 rebooted 63 secs ago
> [2014-01-31T07:56:30.834] Node delta32 rebooted 63 secs ago
> [2014-01-31T07:56:33.651] Node delta29 rebooted 63 secs ago
> [2014-01-31T07:56:33.772] Node delta39 rebooted 64 secs ago
> [2014-01-31T07:57:35.000] debug: backfill: beginning
> [2014-01-31T07:57:35.000] debug: backfill: no jobs to backfill
> [2014-01-31T07:57:58.179] debug: sched: Running job scheduler
> [2014-01-31T07:58:58.186] debug: sched: Running job scheduler
> [2014-01-31T07:59:05.191] debug: Spawning ping agent for delta[29-42]
>
>
> # squeue -l
> Fri Jan 31 08:11:35 2014
> JOBID PARTITION NAME USER STATE TIME TIMELIMIT
> NODES NODELIST(REASON)
> 13186 d2d3 mz626_il nobody CONFIGUR 17:32 UNLIMITED
> 28 delta[29-56]
>
> [2014-01-31T08:15:22.499] _slurm_rpc_update_job complete JobId=13186 uid=1348
> usec=140
>
> Fri Jan 31 08:19:23 2014
> JOBID PARTITION NAME USER STATE TIME TIMELIMIT
> NODES NODELIST(REASON)
> 13186 d2d3 mz626_il nobody CONFIGUR 25:20 UNLIMITED
> 28 delta[29-56]
>
>