Your test is working fine for me with the v2.3.2 code. I can't recall any recent changes in this part of the SLURM code. The slurmd log file below with KillWait=10 (results from KillWait=300 were basically the same). Since this seems to work properly on some of your systems, there might possibly be some issue with a Prolog, Epilog, SPANK plugin or ProctrackType plugin.

[2011-12-05T13:35:25] debug2: got this type of message 6009
[2011-12-05T13:35:25] debug2: Processing RPC: REQUEST_KILL_TIMELIMIT
[2011-12-05T13:35:25] [391216] auth plugin for Munge (http://home.gna.org/munge/) loaded
[2011-12-05T13:35:25] debug2: container signal 996 to job 391216.4294967294
[2011-12-05T13:35:25] [391216] Handling REQUEST_SIGNAL_CONTAINER
[2011-12-05T13:35:25] [391216] _handle_signal_container for job 391216.4294967294 [2011-12-05T13:35:25] [391216] *** JOB 391216 CANCELLED AT 2011-12-05T13:35:25 DUE TO TIME LIMIT ***
[2011-12-05T13:35:25] debug2: No steps in jobid 391216 to send signal 15
[2011-12-05T13:35:25] Job 391216: timeout: sent SIGTERM to 0 active steps
[2011-12-05T13:35:25] debug:  _rpc_terminate_job, uid = 1001
[2011-12-05T13:35:25] debug:  task_slurmd_release_resources: 391216
[2011-12-05T13:35:25] debug:  credential for job 391216 revoked
[2011-12-05T13:35:25] debug2: container signal 18 to job 391216.4294967294
[2011-12-05T13:35:25] [391216] Handling REQUEST_SIGNAL_CONTAINER
[2011-12-05T13:35:25] [391216] _handle_signal_container for job 391216.4294967294
[2011-12-05T13:35:25] [391216] Sent signal 18 to 391216.4294967294
[2011-12-05T13:35:25] debug2: container signal 15 to job 391216.4294967294
[2011-12-05T13:35:25] [391216] Handling REQUEST_SIGNAL_CONTAINER
[2011-12-05T13:35:25] [391216] _handle_signal_container for job 391216.4294967294
[2011-12-05T13:35:25] [391216] Sent signal 15 to 391216.4294967294
[2011-12-05T13:35:25] [391216] Handling REQUEST_STATE
[2011-12-05T13:35:26] [391216] Handling REQUEST_STATE
[2011-12-05T13:35:27] [391216] Handling REQUEST_STATE
[2011-12-05T13:35:28] [391216] Handling REQUEST_STATE
[2011-12-05T13:35:29] [391216] Handling REQUEST_STATE
[2011-12-05T13:35:30] [391216] Handling REQUEST_STATE
[2011-12-05T13:35:31] [391216] Handling REQUEST_STATE
[2011-12-05T13:35:32] [391216] Handling REQUEST_STATE
[2011-12-05T13:35:33] [391216] Handling REQUEST_STATE
[2011-12-05T13:35:34] [391216] Handling REQUEST_STATE
[2011-12-05T13:35:35] debug2: terminate job step 391216.4294967294
[2011-12-05T13:35:35] [391216] Handling REQUEST_STEP_TERMINATE
[2011-12-05T13:35:35] [391216] _handle_terminate for job 391216.4294967294
[2011-12-05T13:35:35] [391216] Sent SIGKILL signal to 391216.4294967294
[2011-12-05T13:35:35] [391216] task 0 (19174) exited. Killed by signal 9.
[2011-12-05T13:35:35] [391216] task_post_term: 391216.4294967294, task 0
[2011-12-05T13:35:35] [391216] Aggregated 1 task exit messages
[2011-12-05T13:35:35] [391216] Before call to spank_fini()
[2011-12-05T13:35:35] [391216] After call to spank_fini()
[2011-12-05T13:35:35] [391216]   false, shutdown
[2011-12-05T13:35:35] [391216] Handling REQUEST_STATE
[2011-12-05T13:35:35] [391216] Message thread exited
[2011-12-05T13:35:35] [391216] job 391216 completed with slurm_rc = 0, job_rc = 9
[2011-12-05T13:35:35] [391216] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0
[2011-12-05T13:35:35] [391216] done with job

Quoting Per Lundqvist <[email protected]>:

Hi,

We have observed that jobs trapping SIGTERM without exiting the job
script, doesn't get killed with SIGKILL after 5m (KillWait=300) and
instead gets stuck in COMPLETING.

Reproduced by:

   #!/bin/bash
   #SBATCH -t 00:01:00
   trap "while true; do sleep 1; done" TERM
   sleep 365d

We are running slurm 2.2.7. It is however not reproducable on one of
our other systems that is running 2.2.4.

thanks,

--
Per Lundqvist

National Supercomputer Centre
Linköping University, Sweden

http://www.nsc.liu.se



Reply via email to