[slurm-dev] Re: slurmctld lockups after job requeue

Danny Auble Wed, 26 Mar 2014 11:57:03 -0700


Does this happen with rc1?  Much has changed since pre5.


On 03/26/14 11:51, Stuart Rankin wrote:

Hi,

I am still using 14.03.0-0pre5 on Scientific Linux 6/RHEL6.
We have occasionally seen a nasty problem whereby slurmctld stopsresponding, and the number of threads just keeps increasing until ithits the limit. Despite our best efforts using all the usual tricks wehave been unable to get a core dump in this situation (e.g. scontrolabort does nothing, similarly killall -ABRT, the only way to recoveris to killall -9 slurmctld). We can get core dumps using these stepsbefore the problem appears.
However the trigger appears to be a job which is requeued (e.g. due toapparent node failure), and restarted on another node, but withprocesses and an active slurmd remaining on the original node. Theslurmd on the original node eventually contacts slurmctld producing anerror message like the following:
[2014-03-26T10:22:18.464] error: Batch completion for job 175049 sentfrom wrong node (sand-6-29 rather than sand-2-44), ignored request
At this point, the problem occurs - threads begin to accumulate, butno client commands receive a response. On each occasion on which wehave seen this issue, there has been a similar message in theslurmctld.log.
The history of the job above from the slurmctld.log is included below.To recover I restarted slurmd on the original node sand-6-29, did kill-9 slurmctld and restarted it.
Any thoughts or suggestions would be gratefully received.

Best regards

Stuart

[2014-03-26T09:39:31.013] backfill: Started JobId=175049 on sand-6-29
[2014-03-26T09:39:34.316] debug: _slurm_rpc_job_alloc_info_liteJobId=175049 NodeList=sand-6-29 usec=2
[2014-03-26T09:39:34.318] debug:  Configuration for job 175049 complete
[2014-03-26T09:39:34.319] sched: _slurm_rpc_job_step_create:StepId=175049.0 sand-6-29 usec=1070[2014-03-26T10:18:12.517] requeue job 175049 due to failure of nodesand-6-29[2014-03-26T10:18:12.519] debug: email msg to abc123: SLURMJob_id=175049 Name=LiGe Failed, Run time 00:38:41
[2014-03-26T10:18:31.010] backfill: Started JobId=175049 on sand-2-44
[2014-03-26T10:18:34.336] debug: _slurm_rpc_job_alloc_info_liteJobId=175049 NodeList=sand-2-44 usec=2[2014-03-26T10:18:34.340] sched: _slurm_rpc_job_step_create:StepId=175049.1 sand-2-44 usec=1160[2014-03-26T10:22:17.497] debug: Processing RPC:REQUEST_STEP_COMPLETE for 175049.0 nodes 0-0 rc=4294967294 uid=0
[2014-03-26T10:22:17.497] step_partial_comp: StepID=175049.0 invalid
[2014-03-26T10:22:18.464] error: Batch completion for job 175049 sentfrom wrong node (sand-6-29 rather than sand-2-44), ignored request[2014-03-26T10:22:26.423] debug: Processing RPC:REQUEST_STEP_COMPLETE for 175096.0 nodes 0-0 rc=0 uid=0[2014-03-26T10:23:05.834] debug: Processing RPC:REQUEST_STEP_COMPLETE for 175044.0 nodes 0-0 rc=0 uid=0[2014-03-26T10:24:06.524] debug: Processing RPC:REQUEST_STEP_COMPLETE for 174743.0 nodes 0-0 rc=0 uid=0[2014-03-26T10:24:26.494] debug: Processing RPC:REQUEST_STEP_COMPLETE for 175096.0 nodes 0-0 rc=0 uid=0[2014-03-26T10:24:42.211] debug: Processing RPC:REQUEST_STEP_COMPLETE for 174712.0 nodes 0-0 rc=0 uid=0[2014-03-26T10:25:05.925] debug: Processing RPC:REQUEST_STEP_COMPLETE for 175044.0 nodes 0-0 rc=0 uid=0
[2014-03-26T10:25:44.532] Performing RPC: REQUEST_SHUTDOWN
[2014-03-26T10:25:44.532] performing immeditate shutdown without statesave
[2014-03-26T10:25:44.532] SIGABRT received
[2014-03-26T10:25:44.533] debug:  sched: slurmctld terminating

==> killall -9 slurmctld here followed by slurm restart

[2014-03-26T10:27:21.544] pidfile not locked, assuming no running daemon
[2014-03-26T10:27:21.545] debug:  sched: slurmctld starting

[slurm-dev] Re: slurmctld lockups after job requeue

Reply via email to