[slurm-dev] slurmctld lockups after job requeue

Stuart Rankin Wed, 26 Mar 2014 11:51:56 -0700

Hi,


I am still using 14.03.0-0pre5 on Scientific Linux 6/RHEL6.

We have occasionally seen a nasty problem whereby slurmctld stops responding, and the number ofthreads just keeps increasing until it hits the limit. Despite our best efforts using all the usualtricks we have been unable to get a core dump in this situation (e.g. scontrol abort does nothing,similarly killall -ABRT, the only way to recover is to killall -9 slurmctld). We can get core dumpsusing these steps before the problem appears.

However the trigger appears to be a job which is requeued (e.g. due to apparent node failure), andrestarted on another node, but with processes and an active slurmd remaining on the original node.The slurmd on the original node eventually contacts slurmctld producing an error message like thefollowing:

[2014-03-26T10:22:18.464] error: Batch completion for job 175049 sent from wrong node (sand-6-29rather than sand-2-44), ignored request

At this point, the problem occurs - threads begin to accumulate, but no client commands receive aresponse. On each occasion on which we have seen this issue, there has been a similar message in theslurmctld.log.

The history of the job above from the slurmctld.log is included below. To recover I restarted slurmdon the original node sand-6-29, did kill -9 slurmctld and restarted it.


Any thoughts or suggestions would be gratefully received.

Best regards

Stuart

[2014-03-26T09:39:31.013] backfill: Started JobId=175049 on sand-6-29
[2014-03-26T09:39:34.316] debug:  _slurm_rpc_job_alloc_info_lite JobId=175049 
NodeList=sand-6-29 usec=2
[2014-03-26T09:39:34.318] debug:  Configuration for job 175049 complete
[2014-03-26T09:39:34.319] sched: _slurm_rpc_job_step_create: StepId=175049.0 
sand-6-29 usec=1070
[2014-03-26T10:18:12.517] requeue job 175049 due to failure of node sand-6-29

[2014-03-26T10:18:12.519] debug: email msg to abc123: SLURM Job_id=175049 Name=LiGe Failed, Runtime 00:38:41

[2014-03-26T10:18:31.010] backfill: Started JobId=175049 on sand-2-44
[2014-03-26T10:18:34.336] debug:  _slurm_rpc_job_alloc_info_lite JobId=175049 
NodeList=sand-2-44 usec=2
[2014-03-26T10:18:34.340] sched: _slurm_rpc_job_step_create: StepId=175049.1 
sand-2-44 usec=1160

[2014-03-26T10:22:17.497] debug: Processing RPC: REQUEST_STEP_COMPLETE for 175049.0 nodes 0-0rc=4294967294 uid=0

[2014-03-26T10:22:17.497] step_partial_comp: StepID=175049.0 invalid

[2014-03-26T10:22:18.464] error: Batch completion for job 175049 sent from wrong node (sand-6-29rather than sand-2-44), ignored request[2014-03-26T10:22:26.423] debug: Processing RPC: REQUEST_STEP_COMPLETE for 175096.0 nodes 0-0 rc=0uid=0[2014-03-26T10:23:05.834] debug: Processing RPC: REQUEST_STEP_COMPLETE for 175044.0 nodes 0-0 rc=0uid=0[2014-03-26T10:24:06.524] debug: Processing RPC: REQUEST_STEP_COMPLETE for 174743.0 nodes 0-0 rc=0uid=0[2014-03-26T10:24:26.494] debug: Processing RPC: REQUEST_STEP_COMPLETE for 175096.0 nodes 0-0 rc=0uid=0[2014-03-26T10:24:42.211] debug: Processing RPC: REQUEST_STEP_COMPLETE for 174712.0 nodes 0-0 rc=0uid=0[2014-03-26T10:25:05.925] debug: Processing RPC: REQUEST_STEP_COMPLETE for 175044.0 nodes 0-0 rc=0uid=0

[2014-03-26T10:25:44.532] Performing RPC: REQUEST_SHUTDOWN
[2014-03-26T10:25:44.532] performing immeditate shutdown without state save
[2014-03-26T10:25:44.532] SIGABRT received
[2014-03-26T10:25:44.533] debug:  sched: slurmctld terminating

==> killall -9 slurmctld here followed by slurm restart

[2014-03-26T10:27:21.544] pidfile not locked, assuming no running daemon
[2014-03-26T10:27:21.545] debug:  sched: slurmctld starting



--
Dr. Stuart Rankin

Senior System Administrator
High Performance Computing Service
University of Cambridge
Email: [email protected]
Tel: (+)44 1223 763517

[slurm-dev] slurmctld lockups after job requeue

Reply via email to