Michael, this was fixed in 2.6.4, commit d7dfa58ef.
Danny On 05/08/2014 03:25 PM, Michael Gutteridge wrote:
Hi all.. I've run into a curious situation on our production cluster (Slurm 2.6.2, MWM 7.1.2). It appears that sometimes when a node crashes the job on that node is being requeued contrary to the configuration (JobRequeue=0). It doesn't appear that the requeue flag is being set on the job, either. Nor do I think Moab is at fault as it notes that the job completed, but doesn't see the requeue. This is curious to me as we are using the wiki2 scheduler, so it appears that the requeue decision was made internal to Slurm (i.e. without consulting MWM). So... logs. slurmctld.log: [2014-05-07T23:24:35.143] sched: Allocate JobId=6568293 NodeList=gizmof84 #CPUs=4 ... [2014-05-08T02:53:47.852] error: Nodes gizmof84 not responding ... [2014-05-08T02:58:01.150] Batch JobId=6568293 missing from node 0 [2014-05-08T02:58:01.150] completing job 6568293 [2014-05-08T02:58:01.150] Job 6568293 cancelled from interactive user [2014-05-08T02:58:01.150] Requeue JobId=6568293 due to node failure [2014-05-08T02:58:01.151] sched: job_complete for JobId=6568293 successful, exit code=4294967294 ... [2014-05-08T02:58:01.863] requeue batch job 6568293 ... [2014-05-08T02:58:11.591] _slurm_rpc_submit_batch_job JobId=6577619 usec=2788 [2014-05-08T02:58:12.024] completing job 6568293 [2014-05-08T02:58:12.025] sched: job_complete for JobId=6568293 successful, exit code=0 That exit code looks really horrible. I'm also certain this job wasn't cancelled by the user (assuming that is what the "cancelled from interactive user" message means. Is it possible that the job record was somehow corrupted as the node failed? Anyway, on the slurmd.log there's indications that munge took some time getting sorted out (communications errors), but once it does: [2014-05-08T02:58:01.122] Purging vestigal job script /var/tmp/slurmd/job6568293/slurm_script [2014-05-08T02:58:11.061] reissued job credential for job 6568293 [2014-05-08T02:58:11.135] Launching batch job 6568293 for UID 45402 [2014-05-08T02:58:11.151] Received cpu frequency information for 4 cpus [2014-05-08T02:58:11.164] [6568293] checkpoint/blcr init [2014-05-08T02:58:12.000] [6568293] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 [2014-05-08T02:58:12.007] [6568293] done with job I don't have a solid reproducible example yet, but I thought I'd see if this is a known issue or if anyone has any thoughts on how to get this to properly fail the jobs. Thanks much Michael
