[slurm-dev] Re: slurmctld lockups after job requeue

Stuart Rankin Wed, 26 Mar 2014 12:06:42 -0700


Hi Danny

Unknown at the moment. Updating to rc1 will be part of the next planned maintenance. Right now weare seeing this problem about twice a month with pre5 on 700 nodes.


Regards

Stuart

On 26/03/14 18:55, Danny Auble wrote:

Does this happen with rc1?  Much has changed since pre5.

On 03/26/14 11:51, Stuart Rankin wrote:


Hi,

I am still using 14.03.0-0pre5 on Scientific Linux 6/RHEL6.

We have occasionally seen a nasty problem whereby slurmctld stops responding, 
and the number of
threads just keeps increasing until it hits the limit. Despite our best efforts 
using all the
usual tricks we have been unable to get a core dump in this situation (e.g. 
scontrol abort does
nothing, similarly killall -ABRT, the only way to recover is to killall -9 
slurmctld). We can get
core dumps using these steps before the problem appears.

However the trigger appears to be a job which is requeued (e.g. due to apparent 
node failure), and
restarted on another node, but with processes and an active slurmd remaining on 
the original node.
The slurmd on the original node eventually contacts slurmctld producing an 
error message like the
following:

[2014-03-26T10:22:18.464] error: Batch completion for job 175049 sent from 
wrong node (sand-6-29
rather than sand-2-44), ignored request

At this point, the problem occurs - threads begin to accumulate, but no client 
commands receive a
response. On each occasion on which we have seen this issue, there has been a 
similar message in
the slurmctld.log.

The history of the job above from the slurmctld.log is included below. To 
recover I restarted
slurmd on the original node sand-6-29, did kill -9 slurmctld and restarted it.

Any thoughts or suggestions would be gratefully received.

Best regards

Stuart

[2014-03-26T09:39:31.013] backfill: Started JobId=175049 on sand-6-29
[2014-03-26T09:39:34.316] debug:  _slurm_rpc_job_alloc_info_lite JobId=175049 
NodeList=sand-6-29
usec=2
[2014-03-26T09:39:34.318] debug:  Configuration for job 175049 complete
[2014-03-26T09:39:34.319] sched: _slurm_rpc_job_step_create: StepId=175049.0 
sand-6-29 usec=1070
[2014-03-26T10:18:12.517] requeue job 175049 due to failure of node sand-6-29
[2014-03-26T10:18:12.519] debug:  email msg to abc123: SLURM Job_id=175049 
Name=LiGe Failed, Run
time 00:38:41
[2014-03-26T10:18:31.010] backfill: Started JobId=175049 on sand-2-44
[2014-03-26T10:18:34.336] debug:  _slurm_rpc_job_alloc_info_lite JobId=175049 
NodeList=sand-2-44
usec=2
[2014-03-26T10:18:34.340] sched: _slurm_rpc_job_step_create: StepId=175049.1 
sand-2-44 usec=1160
[2014-03-26T10:22:17.497] debug:  Processing RPC: REQUEST_STEP_COMPLETE for 
175049.0 nodes 0-0
rc=4294967294 uid=0
[2014-03-26T10:22:17.497] step_partial_comp: StepID=175049.0 invalid
[2014-03-26T10:22:18.464] error: Batch completion for job 175049 sent from 
wrong node (sand-6-29
rather than sand-2-44), ignored request
[2014-03-26T10:22:26.423] debug:  Processing RPC: REQUEST_STEP_COMPLETE for 
175096.0 nodes 0-0
rc=0 uid=0
[2014-03-26T10:23:05.834] debug:  Processing RPC: REQUEST_STEP_COMPLETE for 
175044.0 nodes 0-0
rc=0 uid=0
[2014-03-26T10:24:06.524] debug:  Processing RPC: REQUEST_STEP_COMPLETE for 
174743.0 nodes 0-0
rc=0 uid=0
[2014-03-26T10:24:26.494] debug:  Processing RPC: REQUEST_STEP_COMPLETE for 
175096.0 nodes 0-0
rc=0 uid=0
[2014-03-26T10:24:42.211] debug:  Processing RPC: REQUEST_STEP_COMPLETE for 
174712.0 nodes 0-0
rc=0 uid=0
[2014-03-26T10:25:05.925] debug:  Processing RPC: REQUEST_STEP_COMPLETE for 
175044.0 nodes 0-0
rc=0 uid=0
[2014-03-26T10:25:44.532] Performing RPC: REQUEST_SHUTDOWN
[2014-03-26T10:25:44.532] performing immeditate shutdown without state save
[2014-03-26T10:25:44.532] SIGABRT received
[2014-03-26T10:25:44.533] debug:  sched: slurmctld terminating

==> killall -9 slurmctld here followed by slurm restart

[2014-03-26T10:27:21.544] pidfile not locked, assuming no running daemon
[2014-03-26T10:27:21.545] debug:  sched: slurmctld starting


--
Dr. Stuart Rankin

Senior System Administrator
High Performance Computing Service
University of Cambridge
Email: [email protected]
Tel: (+)44 1223 763517

[slurm-dev] Re: slurmctld lockups after job requeue

Reply via email to