Hi,
I am still using 14.03.0-0pre5 on Scientific Linux 6/RHEL6.
We have occasionally seen a nasty problem whereby slurmctld stops responding, and the number of
threads just keeps increasing until it hits the limit. Despite our best efforts using all the usual
tricks we have been unable to get a core dump in this situation (e.g. scontrol abort does nothing,
similarly killall -ABRT, the only way to recover is to killall -9 slurmctld). We can get core dumps
using these steps before the problem appears.
However the trigger appears to be a job which is requeued (e.g. due to apparent node failure), and
restarted on another node, but with processes and an active slurmd remaining on the original node.
The slurmd on the original node eventually contacts slurmctld producing an error message like the
following:
[2014-03-26T10:22:18.464] error: Batch completion for job 175049 sent from wrong node (sand-6-29
rather than sand-2-44), ignored request
At this point, the problem occurs - threads begin to accumulate, but no client commands receive a
response. On each occasion on which we have seen this issue, there has been a similar message in the
slurmctld.log.
The history of the job above from the slurmctld.log is included below. To recover I restarted slurmd
on the original node sand-6-29, did kill -9 slurmctld and restarted it.
Any thoughts or suggestions would be gratefully received.
Best regards
Stuart
[2014-03-26T09:39:31.013] backfill: Started JobId=175049 on sand-6-29
[2014-03-26T09:39:34.316] debug: _slurm_rpc_job_alloc_info_lite JobId=175049
NodeList=sand-6-29 usec=2
[2014-03-26T09:39:34.318] debug: Configuration for job 175049 complete
[2014-03-26T09:39:34.319] sched: _slurm_rpc_job_step_create: StepId=175049.0
sand-6-29 usec=1070
[2014-03-26T10:18:12.517] requeue job 175049 due to failure of node sand-6-29
[2014-03-26T10:18:12.519] debug: email msg to abc123: SLURM Job_id=175049 Name=LiGe Failed, Run
time 00:38:41
[2014-03-26T10:18:31.010] backfill: Started JobId=175049 on sand-2-44
[2014-03-26T10:18:34.336] debug: _slurm_rpc_job_alloc_info_lite JobId=175049
NodeList=sand-2-44 usec=2
[2014-03-26T10:18:34.340] sched: _slurm_rpc_job_step_create: StepId=175049.1
sand-2-44 usec=1160
[2014-03-26T10:22:17.497] debug: Processing RPC: REQUEST_STEP_COMPLETE for 175049.0 nodes 0-0
rc=4294967294 uid=0
[2014-03-26T10:22:17.497] step_partial_comp: StepID=175049.0 invalid
[2014-03-26T10:22:18.464] error: Batch completion for job 175049 sent from wrong node (sand-6-29
rather than sand-2-44), ignored request
[2014-03-26T10:22:26.423] debug: Processing RPC: REQUEST_STEP_COMPLETE for 175096.0 nodes 0-0 rc=0
uid=0
[2014-03-26T10:23:05.834] debug: Processing RPC: REQUEST_STEP_COMPLETE for 175044.0 nodes 0-0 rc=0
uid=0
[2014-03-26T10:24:06.524] debug: Processing RPC: REQUEST_STEP_COMPLETE for 174743.0 nodes 0-0 rc=0
uid=0
[2014-03-26T10:24:26.494] debug: Processing RPC: REQUEST_STEP_COMPLETE for 175096.0 nodes 0-0 rc=0
uid=0
[2014-03-26T10:24:42.211] debug: Processing RPC: REQUEST_STEP_COMPLETE for 174712.0 nodes 0-0 rc=0
uid=0
[2014-03-26T10:25:05.925] debug: Processing RPC: REQUEST_STEP_COMPLETE for 175044.0 nodes 0-0 rc=0
uid=0
[2014-03-26T10:25:44.532] Performing RPC: REQUEST_SHUTDOWN
[2014-03-26T10:25:44.532] performing immeditate shutdown without state save
[2014-03-26T10:25:44.532] SIGABRT received
[2014-03-26T10:25:44.533] debug: sched: slurmctld terminating
==> killall -9 slurmctld here followed by slurm restart
[2014-03-26T10:27:21.544] pidfile not locked, assuming no running daemon
[2014-03-26T10:27:21.545] debug: sched: slurmctld starting
--
Dr. Stuart Rankin
Senior System Administrator
High Performance Computing Service
University of Cambridge
Email: [email protected]
Tel: (+)44 1223 763517