Does this happen with rc1? Much has changed since pre5.
On 03/26/14 11:51, Stuart Rankin wrote:
Hi,
I am still using 14.03.0-0pre5 on Scientific Linux 6/RHEL6.
We have occasionally seen a nasty problem whereby slurmctld stops
responding, and the number of threads just keeps increasing until it
hits the limit. Despite our best efforts using all the usual tricks we
have been unable to get a core dump in this situation (e.g. scontrol
abort does nothing, similarly killall -ABRT, the only way to recover
is to killall -9 slurmctld). We can get core dumps using these steps
before the problem appears.
However the trigger appears to be a job which is requeued (e.g. due to
apparent node failure), and restarted on another node, but with
processes and an active slurmd remaining on the original node. The
slurmd on the original node eventually contacts slurmctld producing an
error message like the following:
[2014-03-26T10:22:18.464] error: Batch completion for job 175049 sent
from wrong node (sand-6-29 rather than sand-2-44), ignored request
At this point, the problem occurs - threads begin to accumulate, but
no client commands receive a response. On each occasion on which we
have seen this issue, there has been a similar message in the
slurmctld.log.
The history of the job above from the slurmctld.log is included below.
To recover I restarted slurmd on the original node sand-6-29, did kill
-9 slurmctld and restarted it.
Any thoughts or suggestions would be gratefully received.
Best regards
Stuart
[2014-03-26T09:39:31.013] backfill: Started JobId=175049 on sand-6-29
[2014-03-26T09:39:34.316] debug: _slurm_rpc_job_alloc_info_lite
JobId=175049 NodeList=sand-6-29 usec=2
[2014-03-26T09:39:34.318] debug: Configuration for job 175049 complete
[2014-03-26T09:39:34.319] sched: _slurm_rpc_job_step_create:
StepId=175049.0 sand-6-29 usec=1070
[2014-03-26T10:18:12.517] requeue job 175049 due to failure of node
sand-6-29
[2014-03-26T10:18:12.519] debug: email msg to abc123: SLURM
Job_id=175049 Name=LiGe Failed, Run time 00:38:41
[2014-03-26T10:18:31.010] backfill: Started JobId=175049 on sand-2-44
[2014-03-26T10:18:34.336] debug: _slurm_rpc_job_alloc_info_lite
JobId=175049 NodeList=sand-2-44 usec=2
[2014-03-26T10:18:34.340] sched: _slurm_rpc_job_step_create:
StepId=175049.1 sand-2-44 usec=1160
[2014-03-26T10:22:17.497] debug: Processing RPC:
REQUEST_STEP_COMPLETE for 175049.0 nodes 0-0 rc=4294967294 uid=0
[2014-03-26T10:22:17.497] step_partial_comp: StepID=175049.0 invalid
[2014-03-26T10:22:18.464] error: Batch completion for job 175049 sent
from wrong node (sand-6-29 rather than sand-2-44), ignored request
[2014-03-26T10:22:26.423] debug: Processing RPC:
REQUEST_STEP_COMPLETE for 175096.0 nodes 0-0 rc=0 uid=0
[2014-03-26T10:23:05.834] debug: Processing RPC:
REQUEST_STEP_COMPLETE for 175044.0 nodes 0-0 rc=0 uid=0
[2014-03-26T10:24:06.524] debug: Processing RPC:
REQUEST_STEP_COMPLETE for 174743.0 nodes 0-0 rc=0 uid=0
[2014-03-26T10:24:26.494] debug: Processing RPC:
REQUEST_STEP_COMPLETE for 175096.0 nodes 0-0 rc=0 uid=0
[2014-03-26T10:24:42.211] debug: Processing RPC:
REQUEST_STEP_COMPLETE for 174712.0 nodes 0-0 rc=0 uid=0
[2014-03-26T10:25:05.925] debug: Processing RPC:
REQUEST_STEP_COMPLETE for 175044.0 nodes 0-0 rc=0 uid=0
[2014-03-26T10:25:44.532] Performing RPC: REQUEST_SHUTDOWN
[2014-03-26T10:25:44.532] performing immeditate shutdown without state
save
[2014-03-26T10:25:44.532] SIGABRT received
[2014-03-26T10:25:44.533] debug: sched: slurmctld terminating
==> killall -9 slurmctld here followed by slurm restart
[2014-03-26T10:27:21.544] pidfile not locked, assuming no running daemon
[2014-03-26T10:27:21.545] debug: sched: slurmctld starting