On Fri, 2 Dec 2011, Mark A. Grondona wrote: > On Fri, 2 Dec 2011 11:05:51 +0100 (CET), Per Lundqvist <[email protected]> > wrote: > > Hi, > > > > We have observed that jobs trapping SIGTERM without exiting the job > > script, doesn't get killed with SIGKILL after 5m (KillWait=300) and > > instead gets stuck in COMPLETING. > > That is interesting. There is no way to trap SIGKILL so I wonder if > SLURM is not ever sending the SIGKILL?
Yes, I think so too, here's the verbose slurmd log from around the time when it's supposed to send the kill signal (in this case it is expected to be killed 5 minutes after 09:32:45, == 09:37:45): [2011-12-02T09:36:54] debug4: found jobid = 323254, stepid = 4294967294 [2011-12-02T09:36:54] [323254] Called _msg_socket_accept [2011-12-02T09:36:54] [323254] Leaving _msg_socket_accept [2011-12-02T09:36:54] [323254] eio: handling events for 1 objects [2011-12-02T09:36:54] [323254] Called _msg_socket_readable [2011-12-02T09:36:54] [323254] Entering _handle_accept (new thread) [2011-12-02T09:36:54] [323254] Identity: uid=0, gid=0 [2011-12-02T09:36:54] [323254] Entering _handle_request [2011-12-02T09:36:54] [323254] Got request 5 [2011-12-02T09:36:54] [323254] Handling REQUEST_STATE [2011-12-02T09:36:54] [323254] Leaving _handle_request: SLURM_SUCCESS [2011-12-02T09:36:54] [323254] Entering _handle_request [2011-12-02T09:36:54] [323254] Leaving _handle_accept [2011-12-02T09:37:54] debug4: found jobid = 323254, stepid = 4294967294 [2011-12-02T09:37:54] [323254] Called _msg_socket_accept [2011-12-02T09:37:54] [323254] Leaving _msg_socket_accept [2011-12-02T09:37:54] [323254] eio: handling events for 1 objects [2011-12-02T09:37:54] [323254] Called _msg_socket_readable [2011-12-02T09:37:54] [323254] Entering _handle_accept (new thread) [2011-12-02T09:37:54] [323254] Identity: uid=0, gid=0 [2011-12-02T09:37:54] [323254] Entering _handle_request [2011-12-02T09:37:54] [323254] Got request 5 [2011-12-02T09:37:54] [323254] Handling REQUEST_STATE [2011-12-02T09:37:54] [323254] Leaving _handle_request: SLURM_SUCCESS [2011-12-02T09:37:54] [323254] Entering _handle_request [2011-12-02T09:37:54] [323254] Leaving _handle_accept [2011-12-02T09:38:29] debug3: in the service_connection [2011-12-02T09:38:29] debug2: got this type of message 6011 [2011-12-02T09:38:29] debug2: Processing RPC: REQUEST_TERMINATE_JOB [2011-12-02T09:38:29] debug: _rpc_terminate_job, uid = 103 [2011-12-02T09:38:29] debug: task_slurmd_release_resources: 323254 [2011-12-02T09:38:54] debug4: found jobid = 323254, stepid = 4294967294 [2011-12-02T09:38:54] [323254] Called _msg_socket_accept [2011-12-02T09:38:54] [323254] Leaving _msg_socket_accept [2011-12-02T09:38:54] [323254] eio: handling events for 1 objects [2011-12-02T09:38:54] [323254] Called _msg_socket_readable [2011-12-02T09:38:54] [323254] Entering _handle_accept (new thread) [2011-12-02T09:38:54] [323254] Identity: uid=0, gid=0 [2011-12-02T09:38:54] [323254] Entering _handle_request [2011-12-02T09:38:54] [323254] Got request 5 [2011-12-02T09:38:54] [323254] Handling REQUEST_STATE [2011-12-02T09:38:54] [323254] Leaving _handle_request: SLURM_SUCCESS [2011-12-02T09:38:54] [323254] Entering _handle_request [2011-12-02T09:38:54] [323254] Leaving _handle_accept And here is the corresponding log when running on our 2.2.4 system, which succeeds: [2011-12-02T09:03:32] debug4: found jobid = 1717364, stepid = 4294967294 [2011-12-02T09:03:32] [1717364] Called _msg_socket_accept [2011-12-02T09:03:32] [1717364] Leaving _msg_socket_accept [2011-12-02T09:03:32] [1717364] eio: handling events for 1 objects [2011-12-02T09:03:32] [1717364] Called _msg_socket_readable [2011-12-02T09:03:32] [1717364] Entering _handle_accept (new thread) [2011-12-02T09:03:32] [1717364] Identity: uid=0, gid=0 [2011-12-02T09:03:32] [1717364] Entering _handle_request [2011-12-02T09:03:32] debug2: terminate job step 1717364.4294967294 [2011-12-02T09:03:32] [1717364] Got request 12 [2011-12-02T09:03:32] [1717364] Handling REQUEST_STEP_TERMINATE [2011-12-02T09:03:32] [1717364] _handle_terminate for job 1717364.4294967294 [2011-12-02T09:03:32] [1717364] uid = 0 [2011-12-02T09:03:32] [1717364] Sent SIGKILL signal to 1717364.4294967294 [2011-12-02T09:03:32] [1717364] Leaving _handle_request: SLURM_SUCCESS [2011-12-02T09:03:32] [1717364] Entering _handle_request [2011-12-02T09:03:32] [1717364] Leaving _handle_accept [2011-12-02T09:03:32] debug4: found jobid = 1717364, stepid = 4294967294 [2011-12-02T09:03:32] [1717364] Called _msg_socket_accept [2011-12-02T09:03:32] [1717364] Leaving _msg_socket_accept [2011-12-02T09:03:32] [1717364] eio: handling events for 1 objects [2011-12-02T09:03:32] [1717364] Called _msg_socket_readable [2011-12-02T09:03:32] [1717364] Entering _handle_accept (new thread) [2011-12-02T09:03:32] [1717364] Identity: uid=0, gid=0 [2011-12-02T09:03:32] [1717364] Entering _handle_request [2011-12-02T09:03:32] [1717364] Got request 5 [2011-12-02T09:03:32] [1717364] Handling REQUEST_STATE [2011-12-02T09:03:32] [1717364] Leaving _handle_request: SLURM_SUCCESS [2011-12-02T09:03:32] [1717364] Entering _handle_request [2011-12-02T09:03:32] [1717364] Leaving _handle_accept [2011-12-02T09:03:32] [1717364] task 0 (16426) exited. Killed by signal 9. [2011-12-02T09:03:32] [1717364] task_post_term: 1717364.4294967294, task 0 [2011-12-02T09:03:32] [1717364] Aggregated 1 task exit messages [2011-12-02T09:03:32] [1717364] sending task exit msg for 1 tasks [2011-12-02T09:03:32] [1717364] Before call to spank_fini() [2011-12-02T09:03:32] [1717364] After call to spank_fini() [2011-12-02T09:03:32] [1717364] eio: handling events for 1 objects [2011-12-02T09:03:32] [1717364] Called _msg_socket_readable [2011-12-02T09:03:32] [1717364] false, shutdown [2011-12-02T09:03:32] [1717364] Message thread exited [2011-12-02T09:03:32] [1717364] job 1717364 completed with slurm_rc = 0, job_rc = 9 [2011-12-02T09:03:32] [1717364] sending REQUEST_COMPLETE_BATCH_SCRIPT [2011-12-02T09:03:32] [1717364] done with job The one obvious difference between these systems is that on the 2.2.7 system we have SlurmUser running as an unprivileged user (uid == 103), and on the 2.2.4 system it's running as root. But since SIGTERM obviously gets received by the job script this shouldn't be the cause of error. /Per Lundqvist -- Per Lundqvist National Supercomputer Centre Linköping University, Sweden http://www.nsc.liu.se
