Hi Danny
Unknown at the moment. Updating to rc1 will be part of the next planned maintenance. Right now we
are seeing this problem about twice a month with pre5 on 700 nodes.
Regards
Stuart
On 26/03/14 18:55, Danny Auble wrote:
Does this happen with rc1? Much has changed since pre5.
On 03/26/14 11:51, Stuart Rankin wrote:
Hi,
I am still using 14.03.0-0pre5 on Scientific Linux 6/RHEL6.
We have occasionally seen a nasty problem whereby slurmctld stops responding,
and the number of
threads just keeps increasing until it hits the limit. Despite our best efforts
using all the
usual tricks we have been unable to get a core dump in this situation (e.g.
scontrol abort does
nothing, similarly killall -ABRT, the only way to recover is to killall -9
slurmctld). We can get
core dumps using these steps before the problem appears.
However the trigger appears to be a job which is requeued (e.g. due to apparent
node failure), and
restarted on another node, but with processes and an active slurmd remaining on
the original node.
The slurmd on the original node eventually contacts slurmctld producing an
error message like the
following:
[2014-03-26T10:22:18.464] error: Batch completion for job 175049 sent from
wrong node (sand-6-29
rather than sand-2-44), ignored request
At this point, the problem occurs - threads begin to accumulate, but no client
commands receive a
response. On each occasion on which we have seen this issue, there has been a
similar message in
the slurmctld.log.
The history of the job above from the slurmctld.log is included below. To
recover I restarted
slurmd on the original node sand-6-29, did kill -9 slurmctld and restarted it.
Any thoughts or suggestions would be gratefully received.
Best regards
Stuart
[2014-03-26T09:39:31.013] backfill: Started JobId=175049 on sand-6-29
[2014-03-26T09:39:34.316] debug: _slurm_rpc_job_alloc_info_lite JobId=175049
NodeList=sand-6-29
usec=2
[2014-03-26T09:39:34.318] debug: Configuration for job 175049 complete
[2014-03-26T09:39:34.319] sched: _slurm_rpc_job_step_create: StepId=175049.0
sand-6-29 usec=1070
[2014-03-26T10:18:12.517] requeue job 175049 due to failure of node sand-6-29
[2014-03-26T10:18:12.519] debug: email msg to abc123: SLURM Job_id=175049
Name=LiGe Failed, Run
time 00:38:41
[2014-03-26T10:18:31.010] backfill: Started JobId=175049 on sand-2-44
[2014-03-26T10:18:34.336] debug: _slurm_rpc_job_alloc_info_lite JobId=175049
NodeList=sand-2-44
usec=2
[2014-03-26T10:18:34.340] sched: _slurm_rpc_job_step_create: StepId=175049.1
sand-2-44 usec=1160
[2014-03-26T10:22:17.497] debug: Processing RPC: REQUEST_STEP_COMPLETE for
175049.0 nodes 0-0
rc=4294967294 uid=0
[2014-03-26T10:22:17.497] step_partial_comp: StepID=175049.0 invalid
[2014-03-26T10:22:18.464] error: Batch completion for job 175049 sent from
wrong node (sand-6-29
rather than sand-2-44), ignored request
[2014-03-26T10:22:26.423] debug: Processing RPC: REQUEST_STEP_COMPLETE for
175096.0 nodes 0-0
rc=0 uid=0
[2014-03-26T10:23:05.834] debug: Processing RPC: REQUEST_STEP_COMPLETE for
175044.0 nodes 0-0
rc=0 uid=0
[2014-03-26T10:24:06.524] debug: Processing RPC: REQUEST_STEP_COMPLETE for
174743.0 nodes 0-0
rc=0 uid=0
[2014-03-26T10:24:26.494] debug: Processing RPC: REQUEST_STEP_COMPLETE for
175096.0 nodes 0-0
rc=0 uid=0
[2014-03-26T10:24:42.211] debug: Processing RPC: REQUEST_STEP_COMPLETE for
174712.0 nodes 0-0
rc=0 uid=0
[2014-03-26T10:25:05.925] debug: Processing RPC: REQUEST_STEP_COMPLETE for
175044.0 nodes 0-0
rc=0 uid=0
[2014-03-26T10:25:44.532] Performing RPC: REQUEST_SHUTDOWN
[2014-03-26T10:25:44.532] performing immeditate shutdown without state save
[2014-03-26T10:25:44.532] SIGABRT received
[2014-03-26T10:25:44.533] debug: sched: slurmctld terminating
==> killall -9 slurmctld here followed by slurm restart
[2014-03-26T10:27:21.544] pidfile not locked, assuming no running daemon
[2014-03-26T10:27:21.545] debug: sched: slurmctld starting
--
Dr. Stuart Rankin
Senior System Administrator
High Performance Computing Service
University of Cambridge
Email: [email protected]
Tel: (+)44 1223 763517