Correction, the job is actually still running but shows as configuring.

# sacct
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
13186        mz626_il4+       d2d3      users        112    RUNNING      0:0 

# squeue -l 
Fri Jan 31 08:35:22 2014
             JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  
NODES NODELIST(REASON)
             13186      d2d3 mz626_il   nobody CONFIGUR      41:19 UNLIMITED    
 28 delta[29-56]


On Thu, 2014-01-30 at 16:26 -0800, Franco Broi wrote: 
> 
> A user input a job for which half the nodes were sleeping. Slurm woke
> the nodes and the job ran but the state hasn't changed from configuring.
> 
> slurm 2.6.5
> 
> [2014-01-31T07:54:03.848] sched: update_job: releasing user hold for job_id 
> 13186
> [2014-01-31T07:54:03.848] _slurm_rpc_update_job complete JobId=13186 uid=1345 
> usec=150
> [2014-01-31T07:54:03.849] debug:  sched: Running job scheduler
> [2014-01-31T07:54:03.880] sched: Allocate JobId=13186 NodeList=delta[29-56] 
> #CPUs=112
> [2014-01-31T07:54:04.712] power_save: waking nodes delta[29-42]
> [2014-01-31T07:54:11.000] debug:  backfill: beginning
> [2014-01-31T07:54:11.000] debug:  backfill: no jobs to backfill
> [2014-01-31T07:54:58.015] debug:  sched: Running job scheduler
> [2014-01-31T07:55:45.025] debug:  Spawning ping agent for delta[29-42]
> [2014-01-31T07:55:51.026] error: Nodes delta[29-42] not responding
> [2014-01-31T07:55:58.027] debug:  sched: Running job scheduler
> [2014-01-31T07:56:25.950] Node delta37 rebooted 58 secs ago
> [2014-01-31T07:56:26.244] Node delta31 rebooted 58 secs ago
> [2014-01-31T07:56:26.284] Node delta35 rebooted 58 secs ago
> [2014-01-31T07:56:27.134] Node delta36 rebooted 58 secs ago
> [2014-01-31T07:56:27.811] Node delta30 rebooted 63 secs ago
> [2014-01-31T07:56:28.222] Node delta34 rebooted 64 secs ago
> [2014-01-31T07:56:28.313] Node delta38 rebooted 60 secs ago
> [2014-01-31T07:56:28.609] Node delta40 rebooted 63 secs ago
> [2014-01-31T07:56:29.175] Node delta33 rebooted 58 secs ago
> [2014-01-31T07:56:29.196] Node delta41 rebooted 60 secs ago
> [2014-01-31T07:56:29.287] Node delta42 rebooted 63 secs ago
> [2014-01-31T07:56:30.834] Node delta32 rebooted 63 secs ago
> [2014-01-31T07:56:33.651] Node delta29 rebooted 63 secs ago
> [2014-01-31T07:56:33.772] Node delta39 rebooted 64 secs ago
> [2014-01-31T07:57:35.000] debug:  backfill: beginning
> [2014-01-31T07:57:35.000] debug:  backfill: no jobs to backfill
> [2014-01-31T07:57:58.179] debug:  sched: Running job scheduler
> [2014-01-31T07:58:58.186] debug:  sched: Running job scheduler
> [2014-01-31T07:59:05.191] debug:  Spawning ping agent for delta[29-42]
> 
> 
> # squeue -l 
> Fri Jan 31 08:11:35 2014
>              JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  
> NODES NODELIST(REASON)
>              13186      d2d3 mz626_il   nobody CONFIGUR      17:32 UNLIMITED  
>    28 delta[29-56]
> 
> [2014-01-31T08:15:22.499] _slurm_rpc_update_job complete JobId=13186 uid=1348 
> usec=140
> 
> Fri Jan 31 08:19:23 2014
>              JOBID PARTITION     NAME     USER    STATE       TIME TIMELIMIT  
> NODES NODELIST(REASON)
>              13186      d2d3 mz626_il   nobody CONFIGUR      25:20 UNLIMITED  
>    28 delta[29-56]
> 
> 

Reply via email to