We are running slurm 2.5.6. This night, about 10 jobs were scheduled and completed, but they are still listed as RUNNING in sacct. For instance:
# date Tue Oct 1 13:31:30 CEST 2013 # sacct --duplicates -o jobid,jobname,alloccpus,start,state,end,exitcode,elapsed -j 3371606 JobID JobName AllocCPUS Start State End ExitCode Elapsed ------------ ---------- ---------- ------------------- ---------- ------------------- -------- ---------- 3371606 parse 11 2013-09-30T03:16:48 RUNNING 2013-09-30T04:11:10 0:0 00:54:22 3371606.bat+ batch 1 2013-09-30T03:16:48 COMPLETED 2013-09-30T04:11:10 0:0 00:54:22 # scontrol show job 3371606 slurm_load_jobs error: Invalid job id specified There are no processes left on the node, and the slurmctld epilog ran fine. I've included the (hopefully) relevant lines from slurmctld.log, slurmdbd.log and slurmd.log below. One thing worth noting: All these jobs were scheduled at the same time, at which slurmdbd was restarting due to log rotation. My questions are: 1) What happened here? 2) How can we fix it (for these jobs)? 3) How can we avoid it in the future? (Our logrotate script restarts slurmdbd and then slurm. Perhaps it is safer to stop slurm, restart slurmdbd and then start slurm?) slurmctld.log: [2013-09-30T02:58:27+02:00] _slurm_rpc_submit_batch_job JobId=3371606 usec=8080 [2013-09-30T03:16:48+02:00] backfill test for job 3371606 [2013-09-30T03:16:48+02:00] debug: backfill: user 4010: #jobs 3 [2013-09-30T03:16:48+02:00] backfill: Started JobId=3371606 on c8-20 [2013-09-30T03:16:49+02:00] Job 3371606 launch delayed by 1 secs, updating end_time [2013-09-30T03:16:55+02:00] Terminate signal (SIGINT or SIGTERM) received [2013-09-30T03:16:48+02:00] backfill: Started JobId=3371606 on c8-20 [2013-09-30T03:16:56+02:00] slurmctld version 2.5.6 started on cluster abel [2013-09-30T03:16:57+02:00] Recovered job 3371606 381 [2013-09-30T03:17:12+02:00] debug: assoc is afa458 (381) for job 3371606 [2013-09-30T03:17:12+02:00] debug: now assoc is 7f3048118118 (381) for job 3371606 [2013-09-30T04:11:10+02:00] completing job 3371606 [2013-09-30T04:11:10+02:00] sched: job_complete for JobId=3371606 successful slurmdbd.log: [2013-09-30T02:58:32+02:00] debug2: DBD_JOB_START: ELIGIBLE CALL ID:3371606 NAME:parse [2013-09-30T02:58:32+02:00] debug2: as_mysql_slurmdb_job_start() called [2013-09-30T02:58:32+02:00] debug3: 7(as_mysql_job.c:528) query insert into "abel_job_table" (id_job, id_assoc, id_qos, id_wckey, id_user, id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, job_name, track_steps, state, priority, cpus_req, cpus_alloc, nodes_alloc, req_disk_pn, req_mem_per_cpu, account, partition, req_ntasks_pn) values (3371606, 381, 10, 0, 4010, 100, 'None assigned', 0, 720, 1380502707, 1380502707, 0, 'parse', 0, 0, 20374, 1, 0, 0, 0, 4608, 'nn9106k', 'normal', 11) on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_wckey=0, id_user=4010, id_group=100, nodelist='None assigned', id_resv=0, timelimit=720, time_submit=1380502707, time_start=0, job_name='parse', track_steps=0, id_qos=10, state=greatest(state, 0), priority=20374, cpus_req=1, cpus_alloc=0, nodes_alloc=0, req_disk_pn=0, req_mem_per_cpu=4608, account='nn9106k', partition='normal', req_ntasks_pn=11 [2013-09-30T03:16:44+02:00] Terminate signal (SIGINT or SIGTERM) received [2013-09-30T03:16:57+02:00] slurmdbd version 2.5.6 started [2013-09-30T03:17:07+02:00] debug2: DBD_JOB_START: START CALL ID:3371606 NAME:parse INX:3842639 [2013-09-30T03:17:07+02:00] debug2: as_mysql_slurmdb_job_start() called [2013-09-30T03:17:07+02:00] debug3: 7(as_mysql_job.c:583) query update "abel_job_table" set nodelist='c8-20', account='nn9106k', partition='normal', node_inx='271', time_start=1380503808, job_name='parse', state=1, cpus_alloc=11, nodes_alloc=1, id_qos=10, id_assoc=381, id_wckey=0, id_resv=0, timelimit=720, time_eligible=1380502707, req_disk_pn=0, req_mem_per_cpu=4608 where job_db_inx=3842639 [2013-09-30T04:11:10+02:00] debug2: DBD_GET_JOBS_COND: called [2013-09-30T04:11:10+02:00] debug3: 8(as_mysql_jobacct_process.c:411) query select t1.account, t1.cpus_alloc, t1.cpus_req, t1.derived_ec, t1.derived_es, t1.exit_code, t1.id_assoc, t1.id_block, t1.id_group, t1.id_job, t1.id_qos, t1.id_resv, t1.id_user, t1.id_wckey, t1.job_db_inx, t1.job_name, t1.kill_requid, t1.node_inx, t1.nodelist, t1.nodes_alloc, t1.partition, t1.priority, t1.state, t1.time_eligible, t1.time_end, t1.time_start, t1.time_submit, t1.time_suspended, t1.timelimit, t1.track_steps, t1.wckey, t2.acct, t2.lft, t2.user from "abel_job_table" as t1 left join "abel_assoc_table" as t2 on t1.id_assoc=t2.id_assoc where (t1.id_job=3371606) group by id_job, time_submit desc [2013-09-30T04:11:10+02:00] debug2: DBD_STEP_START: ID:3371606.4294967294 NAME:batch SUBMIT:1380502707 [2013-09-30T04:11:10+02:00] debug3: 7(as_mysql_job.c:971) query insert into "abel_step_table" (job_db_inx, id_step, time_start, step_name, state, cpus_alloc, nodes_alloc, task_cnt, nodelist, node_inx, task_dist) values (3842639, -2, 1380503808, 'batch', 1, 1, 1, 1, 'c8-20', '271', 0) on duplicate key update cpus_alloc=1, nodes_alloc=1, task_cnt=1, time_end=0, state=1, nodelist='c8-20', node_inx='271', task_dist=0 [2013-09-30T04:11:10+02:00] debug2: DBD_STEP_COMPLETE: ID:3371606.4294967294 SUBMIT:1380502707 [2013-09-30T04:11:10+02:00] debug3: 7(as_mysql_job.c:1126) query update "abel_step_table" set time_end=1380507070, state=3, kill_requid=-1, exit_code=0, user_sec=35, user_usec=250641, sys_sec=3, sys_usec=376486, max_vsize=1548116, max_vsize_task=0, max_vsize_node=0, ave_vsize=208412.000000, max_rss=871156, max_rss_task=0, max_rss_node=0, ave_rss=4100.000000, max_pages=16, max_pages_task=0, max_pages_node=0, ave_pages=0.000000, min_cpu=0, min_cpu_task=0, min_cpu_node=0, ave_cpu=0.000000, act_cpufreq=2601000, consumed_energy=0 where job_db_inx=3842639 and id_step=-2 [2013-09-30T04:11:10+02:00] debug2: DBD_JOB_COMPLETE: ID:3371606 [2013-09-30T04:11:10+02:00] debug2: as_mysql_slurmdb_job_complete() called [2013-09-30T04:11:10+02:00] debug3: 7(as_mysql_job.c:824) query update "abel_job_table" set time_end=1380507070, state=3, nodelist='c8-20', derived_ec=0, exit_code=0, kill_requid=-1 where job_db_inx=3842639; slurmd.log: [2013-09-30T03:16:49+02:00] Running spank/prolog for jobid [3371606] uid [4010] [2013-09-30T03:16:49+02:00] spank: opening plugin stack /etc/slurm/plugstack.conf [2013-09-30T03:16:49+02:00] debug: [job 3371606] attempting to run prolog [/hpc/sbin/prolog_slurmd] [2013-09-30T03:16:49+02:00] Launching batch job 3371606 for UID 4010 [2013-09-30T03:16:49+02:00] [3371606] laying out the 11 tasks on 1 hosts c8-20 dist 2 [2013-09-30T04:11:10+02:00] debug2: got this type of message 5019 [2013-09-30T04:11:10+02:00] [3371606] Handling REQUEST_INFO [2013-09-30T04:11:10+02:00] debug: Entering stepd_stat_jobacct for job 3371606.4294967294 [2013-09-30T04:11:10+02:00] [3371606] Handling REQUEST_STEP_STAT [2013-09-30T04:11:10+02:00] [3371606] _handle_stat_jobacct for job 3371606.4294967294 [2013-09-30T04:11:10+02:00] [3371606] 3538 mem size 871156 1548116 time 0(0+0) [2013-09-30T04:11:10+02:00] [3371606] _get_sys_interface_freq_line: filename = /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq [2013-09-30T04:11:10+02:00] [3371606] cpu 0 freq= 2601000 [2013-09-30T04:11:10+02:00] [3371606] Task average frequency = 2601000 [2013-09-30T04:11:10+02:00] [3371606] pid 3538 mem size 871156 1548116 time 0(0+0) [2013-09-30T04:11:10+02:00] [3371606] energycounted= 0 [2013-09-30T04:11:10+02:00] [3371606] getjoules_task energy = 0 [2013-09-30T04:11:10+02:00] [3371606] Job 3371606 memory used:4100 limit:51904512 KB [2013-09-30T04:11:10+02:00] [3371606] Handling REQUEST_STEP_LIST_PIDS [2013-09-30T04:11:10+02:00] [3371606] _handle_list_pids for job 3371606.4294967294 [2013-09-30T04:11:10+02:00] [3371606] getjoules_task energy = 0 [2013-09-30T04:11:10+02:00] [3371606] removing task 0 pid 3538 from jobacct [2013-09-30T04:11:10+02:00] [3371606] task 0 (3538) exited with exit code 0. [2013-09-30T04:11:10+02:00] [3371606] Aggregated 1 task exit messages [2013-09-30T04:11:10+02:00] [3371606] cpu_freq_reset: #cpus reset = 0 [2013-09-30T04:11:10+02:00] [3371606] Before call to spank_fini() [2013-09-30T04:11:10+02:00] [3371606] After call to spank_fini() [2013-09-30T04:11:10+02:00] [3371606] false, shutdown [2013-09-30T04:11:10+02:00] [3371606] Message thread exited [2013-09-30T04:11:10+02:00] [3371606] job 3371606 completed with slurm_rc = 0, job_rc = 0 [2013-09-30T04:11:10+02:00] [3371606] sending REQUEST_COMPLETE_BATCH_SCRIPT, error:0 [2013-09-30T04:11:10+02:00] [3371606] done with job [2013-09-30T04:11:10+02:00] debug2: got this type of message 6011 [2013-09-30T04:11:10+02:00] debug2: Processing RPC: REQUEST_TERMINATE_JOB [2013-09-30T04:11:10+02:00] debug: _rpc_terminate_job, uid = 501 [2013-09-30T04:11:10+02:00] debug: credential for job 3371606 revoked [2013-09-30T04:11:10+02:00] debug2: No steps in jobid 3371606 to send signal 18 [2013-09-30T04:11:10+02:00] debug2: No steps in jobid 3371606 to send signal 15 [2013-09-30T04:11:10+02:00] debug2: set revoke expiration for jobid 3371606 to 1380508270 UTS [2013-09-30T04:11:10+02:00] debug: Waiting for job 3371606's prolog to complete [2013-09-30T04:11:10+02:00] debug: Finished wait for job 3371606's prolog to complete [2013-09-30T04:11:10+02:00] debug: Calling /opt/slurm/sbin/slurmstepd spank epilog [2013-09-30T04:11:10+02:00] Reading slurm.conf file: /etc/slurm/slurm.conf [2013-09-30T04:11:10+02:00] Running spank/epilog for jobid [3371606] uid [4010] 2013-09-30T04:11:10+02:00] spank: opening plugin stack /etc/slurm/plugstack.conf [2013-09-30T04:11:10+02:00] debug: [job 3371606] attempting to run epilog [/hpc/sbin/epilog_slurmd] [2013-09-30T04:11:10+02:00] debug: completed epilog for jobid 3371606 [2013-09-30T04:11:10+02:00] debug: Job 3371606: sent epilog complete msg: rc = 0 -- Regards, Bjørn-Helge Mevik, dr. scient, Department for Research Computing, University of Oslo