Nevermind the not automatic part... It just ran without user intervention.

Does anybody see a clue why it didn't run between 06:59:59 and now?

Jobs were definitely completing between 06:59:59 and 09:00:00.

[2017-06-09T09:55:19.121] debug:  DBD_INIT: CLUSTER:cluster VERSION:7168 
UID:513563 IP:10.21.2.3 CONN:8
[2017-06-09T09:58:19.236] debug:  DBD_INIT: CLUSTER:cluster VERSION:7168 
UID:513563 IP:10.21.2.3 CONN:9
[2017-06-09T09:58:19.327] debug:  DBD_INIT: CLUSTER:cluster VERSION:7168 
UID:513563 IP:10.21.2.3 CONN:8
[2017-06-09T10:00:01.094] 0(as_mysql_rollup.c:622) cluster curr hour is now 
1497024000-1497027600
[2017-06-09T10:00:01.094] 0(as_mysql_rollup.c:366) query
select node_name, cpu_count, time_start, time_end, state from 
"cluster_event_table" where !(state & 32768) && (time_start < 1497027600 && 
(time_end >= 1497024000 || time_end = 0)) order by node_name, time_start
[2017-06-09T10:00:01.176] 0(as_mysql_rollup.c:646) query
select id_resv, assoclist, cpus, flags, time_start, time_end from 
"cluster_resv_table" where (time_start < 1497027600 && time_end >= 1497024000) 
&& !(flags & 64)order by time_start
[2017-06-09T10:00:01.179] 0(as_mysql_rollup.c:751) query
select job.job_db_inx, job.id_job, job.id_assoc, job.id_wckey, 
job.array_task_pending, job.time_eligible, job.time_start, job.time_end, 
job.time_suspended, job.cpus_alloc, job.cpus_req, job.id_resv, 
SUM(step.consumed_energy) from "cluster_job_table" as job left outer join 
"cluster_step_table" as step on job.job_db_inx=step.job_db_inx and 
(step.id_step>=0) where (job.time_eligible < 1497027600 && (job.time_end >= 
1497024000 || job.time_end = 0)) group by job.job_db_inx order by job.id_assoc, 
job.time_eligible
[2017-06-09T10:00:01.773] debug:  DBD_INIT: CLUSTER:cluster VERSION:7168 
UID:507569 IP:10.21.1.5 CONN:9
[2017-06-09T10:00:02.388] debug:  Need to reroll usage from Fri Jun 09 09:59:59 
2017 Job 9641756 from cluster became eligible then and we are just now hearing 
about it.
[2017-06-09T10:00:05.743] 0(as_mysql_rollup.c:305) query
insert into "cluster_usage_hour_table" (creation_time, mod_time, time_start, 
cpu_count, alloc_cpu_secs, down_cpu_secs, pdown_cpu_secs, idle_cpu_secs, 
over_cpu_secs, resv_cpu_secs, consumed_energy) values (...) on duplicate key 
update mod_time=1497027601, cpu_count=VALUES(cpu_count), 
alloc_cpu_secs=VALUES(alloc_cpu_secs), down_cpu_secs=VALUES(down_cpu_secs), 
pdown_cpu_secs=VALUES(pdown_cpu_secs), idle_cpu_secs=VALUES(idle_cpu_secs), 
over_cpu_secs=VALUES(over_cpu_secs), resv_cpu_secs=VALUES(resv_cpu_secs), 
consumed_energy=VALUES(consumed_energy)
[2017-06-09T10:00:05.743] 0(as_mysql_rollup.c:1113) query
insert into "cluster_assoc_usage_hour_table" (creation_time, mod_time, 
id_assoc, time_start, alloc_cpu_secs, consumed_energy) values (...) on 
duplicate key update mod_time=1497027601, 
alloc_cpu_secs=VALUES(alloc_cpu_secs), consumed_energy=VALUES(consumed_energy);
[2017-06-09T10:00:05.745] 0(as_mysql_usage.c:376) query
update "cluster_last_ran_table" set hourly_rollup=1497027600



[tcooper@cluster ~]# show_last_ran_table.sh
+---------------------+---------------------+---------------------+
| hourly_rollup       | daily_rollup        | monthly_rollup      |
+---------------------+---------------------+---------------------+
| 2017-06-09 10:00:00 | 2017-06-09 09:59:59 | 2017-06-09 09:59:59 |
+---------------------+---------------------+---------------------+

Thanks,

Trevor

> On Jun 9, 2017, at 9:42 AM, Cooper, Trevor <[email protected]> wrote:
> 
> 
> After discovering issues with sreport and tracing back to Slurm DB orphans 
> (running AND pending), working to resolve that problem and clearing DB or 
> running and pending orphans I find that rollups are still not running 
> automatically.
> 
> For example, I triggered re-rollup this morning by modifying the entries in 
> the last_ran_table and restarting slurmdbd and rollups were redone from April 
> 1, 2016 to today at 06:00 AM.
> 
> 
> [tcooper@cluster ~]# show_last_ran_table.sh
> +---------------------+---------------------+---------------------+
> | hourly_rollup       | daily_rollup        | monthly_rollup      |
> +---------------------+---------------------+---------------------+
> | 2017-06-09 06:00:00 | 2017-06-09 00:00:00 | 2017-06-01 00:00:00 |
> +---------------------+---------------------+---------------------+
> 
> 
> Three hours later and the hourly rollups for [07-09]:00 AM had not run. 
> Restart of slurmdbd triggers the rollups which run successfully...
> 
> 
> [tcooper@cluster ~]# service slurmdbd restart
> stopping slurmdbd:                                         [  OK  ]
> slurmdbd (pid 30526) is running...
> slurmdbd (pid 30526) is running...
> slurmdbd (pid 30526) is running...
> slurmdbd (pid 30526) is running...
> starting slurmdbd:                                         [  OK  ]
> 
> 
> [tcooper@cluster ~]# tailf /var/log/slurm/slurmdbd.log | egrep -v "post user"
> ...
> [2017-06-09T09:22:16.553] debug:  DBD_INIT: CLUSTER:cluster VERSION:7168 
> UID:513563 IP:10.21.2.3 CONN:9
> [2017-06-09T09:22:16.644] debug:  DBD_INIT: CLUSTER:cluster VERSION:7168 
> UID:513563 IP:10.21.2.3 CONN:8
> [2017-06-09T09:24:07.700] Terminate signal (SIGINT or SIGTERM) received
> [2017-06-09T09:24:18.077] debug:  auth plugin for Munge 
> (http://code.google.com/p/munge/) loaded
> [2017-06-09T09:24:19.190] Accounting storage MYSQL plugin loaded
> [2017-06-09T09:24:28.114] slurmdbd version 14.11.11 started
> [2017-06-09T09:24:28.115] 0(as_mysql_rollup.c:622) cluster curr hour is now 
> 1497013200-1497016800
> ...
> [2017-06-09T09:24:41.831] Warning: Note very large processing time from 
> hourly_rollup for cluster: usec=13715990 began=09:24:28.115
> [2017-06-09T09:24:41.831] 0(as_mysql_usage.c:376) query
> update "cluster_last_ran_table" set hourly_rollup=1497024000
> [2017-06-09T09:25:16.746] debug:  DBD_INIT: CLUSTER:cluster VERSION:7168 
> UID:513563 IP:10.21.2.3 CONN:9
> [2017-06-09T09:25:16.839] debug:  DBD_INIT: CLUSTER:cluster VERSION:7168 
> UID:513563 IP:10.21.2.3 CONN:8
> ...
> 
> 
> [tcooper@cluster ~]# show_last_ran_table.sh
> +---------------------+---------------------+---------------------+
> | hourly_rollup       | daily_rollup        | monthly_rollup      |
> +---------------------+---------------------+---------------------+
> | 2017-06-09 09:00:00 | 2017-06-09 00:00:00 | 2017-06-01 00:00:00 |
> +---------------------+---------------------+---------------------+
> 
> Before our DB orphan issue generation of rollups was NOT a problem.
> 
> Can anyone provide insight into where the 'trigger' for hourly rollups is 
> supposed to come from and/or know if this is a bug in Slurm 14.11.11 that is 
> fixed in a later version?
> 
> 
> Thanks,
> 
> Trevor Cooper
> HPC Systems Programmer
> San Diego Supercomputer Center, UCSD
> 9500 Gilman Drive, 0505
> La Jolla, CA 92093-0505

Reply via email to