On Fri, 2 Dec 2011, Mark A. Grondona wrote:

> On Fri, 2 Dec 2011 11:05:51 +0100 (CET), Per Lundqvist <[email protected]> 
> wrote:
> > Hi,
> > 
> > We have observed that jobs trapping SIGTERM without exiting the job
> > script, doesn't get killed with SIGKILL after 5m (KillWait=300) and
> > instead gets stuck in COMPLETING.
> 
> That is interesting. There is no way to trap SIGKILL so I wonder if
> SLURM is not ever sending the SIGKILL?

Yes, I think so too, here's the verbose slurmd log from around the
time when it's supposed to send the kill signal (in this case it is
expected to be killed 5 minutes after 09:32:45, == 09:37:45):

   [2011-12-02T09:36:54] debug4: found jobid = 323254, stepid = 4294967294
   [2011-12-02T09:36:54] [323254] Called _msg_socket_accept
   [2011-12-02T09:36:54] [323254] Leaving _msg_socket_accept
   [2011-12-02T09:36:54] [323254] eio: handling events for 1 objects
   [2011-12-02T09:36:54] [323254] Called _msg_socket_readable
   [2011-12-02T09:36:54] [323254] Entering _handle_accept (new thread)
   [2011-12-02T09:36:54] [323254]   Identity: uid=0, gid=0
   [2011-12-02T09:36:54] [323254] Entering _handle_request
   [2011-12-02T09:36:54] [323254] Got request 5
   [2011-12-02T09:36:54] [323254] Handling REQUEST_STATE
   [2011-12-02T09:36:54] [323254] Leaving  _handle_request: SLURM_SUCCESS
   [2011-12-02T09:36:54] [323254] Entering _handle_request
   [2011-12-02T09:36:54] [323254] Leaving  _handle_accept
   [2011-12-02T09:37:54] debug4: found jobid = 323254, stepid = 4294967294
   [2011-12-02T09:37:54] [323254] Called _msg_socket_accept
   [2011-12-02T09:37:54] [323254] Leaving _msg_socket_accept
   [2011-12-02T09:37:54] [323254] eio: handling events for 1 objects
   [2011-12-02T09:37:54] [323254] Called _msg_socket_readable
   [2011-12-02T09:37:54] [323254] Entering _handle_accept (new thread)
   [2011-12-02T09:37:54] [323254]   Identity: uid=0, gid=0
   [2011-12-02T09:37:54] [323254] Entering _handle_request
   [2011-12-02T09:37:54] [323254] Got request 5
   [2011-12-02T09:37:54] [323254] Handling REQUEST_STATE
   [2011-12-02T09:37:54] [323254] Leaving  _handle_request: SLURM_SUCCESS
   [2011-12-02T09:37:54] [323254] Entering _handle_request
   [2011-12-02T09:37:54] [323254] Leaving  _handle_accept
   [2011-12-02T09:38:29] debug3: in the service_connection
   [2011-12-02T09:38:29] debug2: got this type of message 6011
   [2011-12-02T09:38:29] debug2: Processing RPC: REQUEST_TERMINATE_JOB
   [2011-12-02T09:38:29] debug:  _rpc_terminate_job, uid = 103
   [2011-12-02T09:38:29] debug:  task_slurmd_release_resources: 323254
   [2011-12-02T09:38:54] debug4: found jobid = 323254, stepid = 4294967294
   
   [2011-12-02T09:38:54] [323254] Called _msg_socket_accept
   [2011-12-02T09:38:54] [323254] Leaving _msg_socket_accept
   [2011-12-02T09:38:54] [323254] eio: handling events for 1 objects
   [2011-12-02T09:38:54] [323254] Called _msg_socket_readable
   [2011-12-02T09:38:54] [323254] Entering _handle_accept (new thread)
   [2011-12-02T09:38:54] [323254]   Identity: uid=0, gid=0
   [2011-12-02T09:38:54] [323254] Entering _handle_request
   [2011-12-02T09:38:54] [323254] Got request 5
   [2011-12-02T09:38:54] [323254] Handling REQUEST_STATE
   [2011-12-02T09:38:54] [323254] Leaving  _handle_request: SLURM_SUCCESS
   [2011-12-02T09:38:54] [323254] Entering _handle_request
   [2011-12-02T09:38:54] [323254] Leaving  _handle_accept
   
And here is the corresponding log when running on our 2.2.4 system,
which succeeds:

   [2011-12-02T09:03:32] debug4: found jobid = 1717364, stepid = 4294967294
   [2011-12-02T09:03:32] [1717364] Called _msg_socket_accept
   [2011-12-02T09:03:32] [1717364] Leaving _msg_socket_accept
   [2011-12-02T09:03:32] [1717364] eio: handling events for 1 objects
   [2011-12-02T09:03:32] [1717364] Called _msg_socket_readable
   [2011-12-02T09:03:32] [1717364] Entering _handle_accept (new thread)
   [2011-12-02T09:03:32] [1717364]   Identity: uid=0, gid=0
   [2011-12-02T09:03:32] [1717364] Entering _handle_request
   [2011-12-02T09:03:32] debug2: terminate job step 1717364.4294967294
   [2011-12-02T09:03:32] [1717364] Got request 12
   [2011-12-02T09:03:32] [1717364] Handling REQUEST_STEP_TERMINATE
   [2011-12-02T09:03:32] [1717364] _handle_terminate for job 1717364.4294967294
   [2011-12-02T09:03:32] [1717364]   uid = 0
   [2011-12-02T09:03:32] [1717364] Sent SIGKILL signal to 1717364.4294967294

   [2011-12-02T09:03:32] [1717364] Leaving  _handle_request: SLURM_SUCCESS
   [2011-12-02T09:03:32] [1717364] Entering _handle_request
   [2011-12-02T09:03:32] [1717364] Leaving  _handle_accept
   [2011-12-02T09:03:32] debug4: found jobid = 1717364, stepid = 4294967294
   [2011-12-02T09:03:32] [1717364] Called _msg_socket_accept
   [2011-12-02T09:03:32] [1717364] Leaving _msg_socket_accept
   [2011-12-02T09:03:32] [1717364] eio: handling events for 1 objects
   [2011-12-02T09:03:32] [1717364] Called _msg_socket_readable
   [2011-12-02T09:03:32] [1717364] Entering _handle_accept (new thread)
   [2011-12-02T09:03:32] [1717364]   Identity: uid=0, gid=0
   [2011-12-02T09:03:32] [1717364] Entering _handle_request
   [2011-12-02T09:03:32] [1717364] Got request 5
   [2011-12-02T09:03:32] [1717364] Handling REQUEST_STATE
   [2011-12-02T09:03:32] [1717364] Leaving  _handle_request: SLURM_SUCCESS
   [2011-12-02T09:03:32] [1717364] Entering _handle_request
   [2011-12-02T09:03:32] [1717364] Leaving  _handle_accept
   [2011-12-02T09:03:32] [1717364] task 0 (16426) exited. Killed by signal 9.
   [2011-12-02T09:03:32] [1717364] task_post_term: 1717364.4294967294, task 0
   [2011-12-02T09:03:32] [1717364] Aggregated 1 task exit messages
   [2011-12-02T09:03:32] [1717364] sending task exit msg for 1 tasks
   [2011-12-02T09:03:32] [1717364] Before call to spank_fini()
   [2011-12-02T09:03:32] [1717364] After call to spank_fini()
   [2011-12-02T09:03:32] [1717364] eio: handling events for 1 objects
   [2011-12-02T09:03:32] [1717364] Called _msg_socket_readable
   [2011-12-02T09:03:32] [1717364]   false, shutdown
   [2011-12-02T09:03:32] [1717364] Message thread exited
   [2011-12-02T09:03:32] [1717364] job 1717364 completed with slurm_rc = 0, 
job_rc = 9
   [2011-12-02T09:03:32] [1717364] sending REQUEST_COMPLETE_BATCH_SCRIPT
   [2011-12-02T09:03:32] [1717364] done with job
   
The one obvious difference between these systems is that on the 2.2.7
system we have SlurmUser running as an unprivileged user (uid == 103),
and on the 2.2.4 system it's running as root. But since SIGTERM
obviously gets received by the job script this shouldn't be the cause
of error.

/Per Lundqvist

-- 
Per Lundqvist

National Supercomputer Centre
Linköping University, Sweden

http://www.nsc.liu.se

Reply via email to