Your test is working fine for me with the v2.3.2 code. I can't recall
any recent changes in this part of the SLURM code. The slurmd log file
below with KillWait=10 (results from KillWait=300 were basically the
same). Since this seems to work properly on some of your systems,
there might possibly be some issue with a Prolog, Epilog, SPANK plugin
or ProctrackType plugin.
[2011-12-05T13:35:25] debug2: got this type of message 6009
[2011-12-05T13:35:25] debug2: Processing RPC: REQUEST_KILL_TIMELIMIT
[2011-12-05T13:35:25] [391216] auth plugin for Munge
(http://home.gna.org/munge/) loaded
[2011-12-05T13:35:25] debug2: container signal 996 to job 391216.4294967294
[2011-12-05T13:35:25] [391216] Handling REQUEST_SIGNAL_CONTAINER
[2011-12-05T13:35:25] [391216] _handle_signal_container for job
391216.4294967294
[2011-12-05T13:35:25] [391216] *** JOB 391216 CANCELLED AT
2011-12-05T13:35:25 DUE TO TIME LIMIT ***
[2011-12-05T13:35:25] debug2: No steps in jobid 391216 to send signal 15
[2011-12-05T13:35:25] Job 391216: timeout: sent SIGTERM to 0 active steps
[2011-12-05T13:35:25] debug: _rpc_terminate_job, uid = 1001
[2011-12-05T13:35:25] debug: task_slurmd_release_resources: 391216
[2011-12-05T13:35:25] debug: credential for job 391216 revoked
[2011-12-05T13:35:25] debug2: container signal 18 to job 391216.4294967294
[2011-12-05T13:35:25] [391216] Handling REQUEST_SIGNAL_CONTAINER
[2011-12-05T13:35:25] [391216] _handle_signal_container for job
391216.4294967294
[2011-12-05T13:35:25] [391216] Sent signal 18 to 391216.4294967294
[2011-12-05T13:35:25] debug2: container signal 15 to job 391216.4294967294
[2011-12-05T13:35:25] [391216] Handling REQUEST_SIGNAL_CONTAINER
[2011-12-05T13:35:25] [391216] _handle_signal_container for job
391216.4294967294
[2011-12-05T13:35:25] [391216] Sent signal 15 to 391216.4294967294
[2011-12-05T13:35:25] [391216] Handling REQUEST_STATE
[2011-12-05T13:35:26] [391216] Handling REQUEST_STATE
[2011-12-05T13:35:27] [391216] Handling REQUEST_STATE
[2011-12-05T13:35:28] [391216] Handling REQUEST_STATE
[2011-12-05T13:35:29] [391216] Handling REQUEST_STATE
[2011-12-05T13:35:30] [391216] Handling REQUEST_STATE
[2011-12-05T13:35:31] [391216] Handling REQUEST_STATE
[2011-12-05T13:35:32] [391216] Handling REQUEST_STATE
[2011-12-05T13:35:33] [391216] Handling REQUEST_STATE
[2011-12-05T13:35:34] [391216] Handling REQUEST_STATE
[2011-12-05T13:35:35] debug2: terminate job step 391216.4294967294
[2011-12-05T13:35:35] [391216] Handling REQUEST_STEP_TERMINATE
[2011-12-05T13:35:35] [391216] _handle_terminate for job 391216.4294967294
[2011-12-05T13:35:35] [391216] Sent SIGKILL signal to 391216.4294967294
[2011-12-05T13:35:35] [391216] task 0 (19174) exited. Killed by signal 9.
[2011-12-05T13:35:35] [391216] task_post_term: 391216.4294967294, task 0
[2011-12-05T13:35:35] [391216] Aggregated 1 task exit messages
[2011-12-05T13:35:35] [391216] Before call to spank_fini()
[2011-12-05T13:35:35] [391216] After call to spank_fini()
[2011-12-05T13:35:35] [391216] false, shutdown
[2011-12-05T13:35:35] [391216] Handling REQUEST_STATE
[2011-12-05T13:35:35] [391216] Message thread exited
[2011-12-05T13:35:35] [391216] job 391216 completed with slurm_rc = 0,
job_rc = 9
[2011-12-05T13:35:35] [391216] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0
[2011-12-05T13:35:35] [391216] done with job
Quoting Per Lundqvist <[email protected]>:
Hi,
We have observed that jobs trapping SIGTERM without exiting the job
script, doesn't get killed with SIGKILL after 5m (KillWait=300) and
instead gets stuck in COMPLETING.
Reproduced by:
#!/bin/bash
#SBATCH -t 00:01:00
trap "while true; do sleep 1; done" TERM
sleep 365d
We are running slurm 2.2.7. It is however not reproducable on one of
our other systems that is running 2.2.4.
thanks,
--
Per Lundqvist
National Supercomputer Centre
Linköping University, Sweden
http://www.nsc.liu.se