[slurm-dev] Completed jobs stuck in RUNNING state in slurmdbd

Bjørn-Helge Mevik Tue, 01 Oct 2013 05:12:22 -0700

We are running slurm 2.5.6.

This night, about 10 jobs were scheduled and completed, but they are
still listed as RUNNING in sacct.  For instance:


# date
Tue Oct  1 13:31:30 CEST 2013
# sacct --duplicates -o 
jobid,jobname,alloccpus,start,state,end,exitcode,elapsed -j 3371606
       JobID    JobName  AllocCPUS               Start      State               
  End ExitCode    Elapsed 
------------ ---------- ---------- ------------------- ---------- 
------------------- -------- ---------- 
3371606           parse         11 2013-09-30T03:16:48    RUNNING 
2013-09-30T04:11:10      0:0   00:54:22 
3371606.bat+      batch          1 2013-09-30T03:16:48  COMPLETED 
2013-09-30T04:11:10      0:0   00:54:22 
# scontrol show job 3371606
slurm_load_jobs error: Invalid job id specified

There are no processes left on the node, and the slurmctld epilog ran
fine.

I've included the (hopefully) relevant lines from slurmctld.log,
slurmdbd.log and slurmd.log below.  One thing worth noting:  All these jobs
were scheduled at the same time, at which slurmdbd was restarting due to
log rotation.

My questions are:

1) What happened here?

2) How can we fix it (for these jobs)?

3) How can we avoid it in the future?  (Our logrotate script restarts
   slurmdbd and then slurm.  Perhaps it is safer to stop slurm, restart
   slurmdbd and then start slurm?)





slurmctld.log:

[2013-09-30T02:58:27+02:00] _slurm_rpc_submit_batch_job JobId=3371606 usec=8080
[2013-09-30T03:16:48+02:00] backfill test for job 3371606
[2013-09-30T03:16:48+02:00] debug:  backfill: user 4010: #jobs 3
[2013-09-30T03:16:48+02:00] backfill: Started JobId=3371606 on c8-20
[2013-09-30T03:16:49+02:00] Job 3371606 launch delayed by 1 secs, updating 
end_time
[2013-09-30T03:16:55+02:00] Terminate signal (SIGINT or SIGTERM) received

[2013-09-30T03:16:48+02:00] backfill: Started JobId=3371606 on c8-20
[2013-09-30T03:16:56+02:00] slurmctld version 2.5.6 started on cluster abel
[2013-09-30T03:16:57+02:00] Recovered job 3371606 381
[2013-09-30T03:17:12+02:00] debug:  assoc is afa458 (381) for job 3371606
[2013-09-30T03:17:12+02:00] debug:  now assoc is 7f3048118118 (381) for job 
3371606
[2013-09-30T04:11:10+02:00] completing job 3371606
[2013-09-30T04:11:10+02:00] sched: job_complete for JobId=3371606 successful


slurmdbd.log:

[2013-09-30T02:58:32+02:00] debug2: DBD_JOB_START: ELIGIBLE CALL ID:3371606 
NAME:parse
[2013-09-30T02:58:32+02:00] debug2: as_mysql_slurmdb_job_start() called
[2013-09-30T02:58:32+02:00] debug3: 7(as_mysql_job.c:528) query
insert into "abel_job_table" (id_job, id_assoc, id_qos, id_wckey, id_user, 
id_group, nodelist, id_resv, timelimit, time_eligible, time_submit, time_start, 
job_name, track_steps, state, priority, cpus_req, cpus_alloc, nodes_alloc, 
req_disk_pn, req_mem_per_cpu, account, partition, req_ntasks_pn) values 
(3371606, 381, 10, 0, 4010, 100, 'None assigned', 0, 720, 1380502707, 
1380502707, 0, 'parse', 0, 0, 20374, 1, 0, 0, 0, 4608, 'nn9106k', 'normal', 11) 
on duplicate key update job_db_inx=LAST_INSERT_ID(job_db_inx), id_wckey=0, 
id_user=4010, id_group=100, nodelist='None assigned', id_resv=0, timelimit=720, 
time_submit=1380502707, time_start=0, job_name='parse', track_steps=0, 
id_qos=10, state=greatest(state, 0), priority=20374, cpus_req=1, cpus_alloc=0, 
nodes_alloc=0, req_disk_pn=0, req_mem_per_cpu=4608, account='nn9106k', 
partition='normal', req_ntasks_pn=11
[2013-09-30T03:16:44+02:00] Terminate signal (SIGINT or SIGTERM) received

[2013-09-30T03:16:57+02:00] slurmdbd version 2.5.6 started
[2013-09-30T03:17:07+02:00] debug2: DBD_JOB_START: START CALL ID:3371606 
NAME:parse INX:3842639
[2013-09-30T03:17:07+02:00] debug2: as_mysql_slurmdb_job_start() called
[2013-09-30T03:17:07+02:00] debug3: 7(as_mysql_job.c:583) query
update "abel_job_table" set nodelist='c8-20', account='nn9106k', 
partition='normal', node_inx='271', time_start=1380503808, job_name='parse', 
state=1, cpus_alloc=11, nodes_alloc=1, id_qos=10, id_assoc=381, id_wckey=0, 
id_resv=0, timelimit=720, time_eligible=1380502707, req_disk_pn=0, 
req_mem_per_cpu=4608 where job_db_inx=3842639
[2013-09-30T04:11:10+02:00] debug2: DBD_GET_JOBS_COND: called
[2013-09-30T04:11:10+02:00] debug3: 8(as_mysql_jobacct_process.c:411) query
select t1.account, t1.cpus_alloc, t1.cpus_req, t1.derived_ec, t1.derived_es, 
t1.exit_code, t1.id_assoc, t1.id_block, t1.id_group, t1.id_job, t1.id_qos, 
t1.id_resv, t1.id_user, t1.id_wckey, t1.job_db_inx, t1.job_name, 
t1.kill_requid, t1.node_inx, t1.nodelist, t1.nodes_alloc, t1.partition, 
t1.priority, t1.state, t1.time_eligible, t1.time_end, t1.time_start, 
t1.time_submit, t1.time_suspended, t1.timelimit, t1.track_steps, t1.wckey, 
t2.acct, t2.lft, t2.user from "abel_job_table" as t1 left join 
"abel_assoc_table" as t2 on t1.id_assoc=t2.id_assoc where (t1.id_job=3371606) 
group by id_job, time_submit desc
[2013-09-30T04:11:10+02:00] debug2: DBD_STEP_START: ID:3371606.4294967294 
NAME:batch SUBMIT:1380502707
[2013-09-30T04:11:10+02:00] debug3: 7(as_mysql_job.c:971) query
insert into "abel_step_table" (job_db_inx, id_step, time_start, step_name, 
state, cpus_alloc, nodes_alloc, task_cnt, nodelist, node_inx, task_dist) values 
(3842639, -2, 1380503808, 'batch', 1, 1, 1, 1, 'c8-20', '271', 0) on duplicate 
key update cpus_alloc=1, nodes_alloc=1, task_cnt=1, time_end=0, state=1, 
nodelist='c8-20', node_inx='271', task_dist=0
[2013-09-30T04:11:10+02:00] debug2: DBD_STEP_COMPLETE: ID:3371606.4294967294 
SUBMIT:1380502707
[2013-09-30T04:11:10+02:00] debug3: 7(as_mysql_job.c:1126) query
update "abel_step_table" set time_end=1380507070, state=3, kill_requid=-1, 
exit_code=0, user_sec=35, user_usec=250641, sys_sec=3, sys_usec=376486, 
max_vsize=1548116, max_vsize_task=0, max_vsize_node=0, ave_vsize=208412.000000, 
max_rss=871156, max_rss_task=0, max_rss_node=0, ave_rss=4100.000000, 
max_pages=16, max_pages_task=0, max_pages_node=0, ave_pages=0.000000, 
min_cpu=0, min_cpu_task=0, min_cpu_node=0, ave_cpu=0.000000, 
act_cpufreq=2601000, consumed_energy=0 where job_db_inx=3842639 and id_step=-2
[2013-09-30T04:11:10+02:00] debug2: DBD_JOB_COMPLETE: ID:3371606
[2013-09-30T04:11:10+02:00] debug2: as_mysql_slurmdb_job_complete() called
[2013-09-30T04:11:10+02:00] debug3: 7(as_mysql_job.c:824) query
update "abel_job_table" set time_end=1380507070, state=3, nodelist='c8-20', 
derived_ec=0, exit_code=0, kill_requid=-1 where job_db_inx=3842639;

slurmd.log:
[2013-09-30T03:16:49+02:00] Running spank/prolog for jobid [3371606] uid [4010]
[2013-09-30T03:16:49+02:00] spank: opening plugin stack 
/etc/slurm/plugstack.conf
[2013-09-30T03:16:49+02:00] debug:  [job 3371606] attempting to run prolog 
[/hpc/sbin/prolog_slurmd]
[2013-09-30T03:16:49+02:00] Launching batch job 3371606 for UID 4010
[2013-09-30T03:16:49+02:00] [3371606] laying out the 11 tasks on 1 hosts c8-20 
dist 2
[2013-09-30T04:11:10+02:00] debug2: got this type of message 5019
[2013-09-30T04:11:10+02:00] [3371606] Handling REQUEST_INFO
[2013-09-30T04:11:10+02:00] debug:  Entering stepd_stat_jobacct for job 
3371606.4294967294
[2013-09-30T04:11:10+02:00] [3371606] Handling REQUEST_STEP_STAT
[2013-09-30T04:11:10+02:00] [3371606] _handle_stat_jobacct for job 
3371606.4294967294
[2013-09-30T04:11:10+02:00] [3371606] 3538 mem size 871156 1548116 time 0(0+0)
[2013-09-30T04:11:10+02:00] [3371606] _get_sys_interface_freq_line: filename = 
/sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_cur_freq 
[2013-09-30T04:11:10+02:00] [3371606]  cpu 0 freq= 2601000
[2013-09-30T04:11:10+02:00] [3371606] Task average frequency = 2601000
[2013-09-30T04:11:10+02:00] [3371606]  pid 3538 mem size 871156 1548116 time 
0(0+0)
[2013-09-30T04:11:10+02:00] [3371606] energycounted= 0
[2013-09-30T04:11:10+02:00] [3371606] getjoules_task energy = 0
[2013-09-30T04:11:10+02:00] [3371606] Job 3371606 memory used:4100 
limit:51904512 KB
[2013-09-30T04:11:10+02:00] [3371606] Handling REQUEST_STEP_LIST_PIDS
[2013-09-30T04:11:10+02:00] [3371606] _handle_list_pids for job 
3371606.4294967294
[2013-09-30T04:11:10+02:00] [3371606] getjoules_task energy = 0
[2013-09-30T04:11:10+02:00] [3371606] removing task 0 pid 3538 from jobacct
[2013-09-30T04:11:10+02:00] [3371606] task 0 (3538) exited with exit code 0.
[2013-09-30T04:11:10+02:00] [3371606] Aggregated 1 task exit messages
[2013-09-30T04:11:10+02:00] [3371606] cpu_freq_reset: #cpus reset = 0
[2013-09-30T04:11:10+02:00] [3371606] Before call to spank_fini()
[2013-09-30T04:11:10+02:00] [3371606] After call to spank_fini()
[2013-09-30T04:11:10+02:00] [3371606]   false, shutdown
[2013-09-30T04:11:10+02:00] [3371606] Message thread exited
[2013-09-30T04:11:10+02:00] [3371606] job 3371606 completed with slurm_rc = 0, 
job_rc = 0
[2013-09-30T04:11:10+02:00] [3371606] sending REQUEST_COMPLETE_BATCH_SCRIPT, 
error:0
[2013-09-30T04:11:10+02:00] [3371606] done with job
[2013-09-30T04:11:10+02:00] debug2: got this type of message 6011
[2013-09-30T04:11:10+02:00] debug2: Processing RPC: REQUEST_TERMINATE_JOB
[2013-09-30T04:11:10+02:00] debug:  _rpc_terminate_job, uid = 501
[2013-09-30T04:11:10+02:00] debug:  credential for job 3371606 revoked
[2013-09-30T04:11:10+02:00] debug2: No steps in jobid 3371606 to send signal 18
[2013-09-30T04:11:10+02:00] debug2: No steps in jobid 3371606 to send signal 15
[2013-09-30T04:11:10+02:00] debug2: set revoke expiration for jobid 3371606 to 
1380508270 UTS
[2013-09-30T04:11:10+02:00] debug:  Waiting for job 3371606's prolog to complete
[2013-09-30T04:11:10+02:00] debug:  Finished wait for job 3371606's prolog to 
complete
[2013-09-30T04:11:10+02:00] debug:  Calling /opt/slurm/sbin/slurmstepd spank 
epilog
[2013-09-30T04:11:10+02:00] Reading slurm.conf file: /etc/slurm/slurm.conf
[2013-09-30T04:11:10+02:00] Running spank/epilog for jobid [3371606] uid [4010]
2013-09-30T04:11:10+02:00] spank: opening plugin stack /etc/slurm/plugstack.conf
[2013-09-30T04:11:10+02:00] debug:  [job 3371606] attempting to run epilog 
[/hpc/sbin/epilog_slurmd]
[2013-09-30T04:11:10+02:00] debug:  completed epilog for jobid 3371606
[2013-09-30T04:11:10+02:00] debug:  Job 3371606: sent epilog complete msg: rc = 0


-- 
Regards,
Bjørn-Helge Mevik, dr. scient,
Department for Research Computing, University of Oslo

[slurm-dev] Completed jobs stuck in RUNNING state in slurmdbd

Reply via email to