Nevermind the not automatic part... It just ran without user intervention.
Does anybody see a clue why it didn't run between 06:59:59 and now? Jobs were definitely completing between 06:59:59 and 09:00:00. [2017-06-09T09:55:19.121] debug: DBD_INIT: CLUSTER:cluster VERSION:7168 UID:513563 IP:10.21.2.3 CONN:8 [2017-06-09T09:58:19.236] debug: DBD_INIT: CLUSTER:cluster VERSION:7168 UID:513563 IP:10.21.2.3 CONN:9 [2017-06-09T09:58:19.327] debug: DBD_INIT: CLUSTER:cluster VERSION:7168 UID:513563 IP:10.21.2.3 CONN:8 [2017-06-09T10:00:01.094] 0(as_mysql_rollup.c:622) cluster curr hour is now 1497024000-1497027600 [2017-06-09T10:00:01.094] 0(as_mysql_rollup.c:366) query select node_name, cpu_count, time_start, time_end, state from "cluster_event_table" where !(state & 32768) && (time_start < 1497027600 && (time_end >= 1497024000 || time_end = 0)) order by node_name, time_start [2017-06-09T10:00:01.176] 0(as_mysql_rollup.c:646) query select id_resv, assoclist, cpus, flags, time_start, time_end from "cluster_resv_table" where (time_start < 1497027600 && time_end >= 1497024000) && !(flags & 64)order by time_start [2017-06-09T10:00:01.179] 0(as_mysql_rollup.c:751) query select job.job_db_inx, job.id_job, job.id_assoc, job.id_wckey, job.array_task_pending, job.time_eligible, job.time_start, job.time_end, job.time_suspended, job.cpus_alloc, job.cpus_req, job.id_resv, SUM(step.consumed_energy) from "cluster_job_table" as job left outer join "cluster_step_table" as step on job.job_db_inx=step.job_db_inx and (step.id_step>=0) where (job.time_eligible < 1497027600 && (job.time_end >= 1497024000 || job.time_end = 0)) group by job.job_db_inx order by job.id_assoc, job.time_eligible [2017-06-09T10:00:01.773] debug: DBD_INIT: CLUSTER:cluster VERSION:7168 UID:507569 IP:10.21.1.5 CONN:9 [2017-06-09T10:00:02.388] debug: Need to reroll usage from Fri Jun 09 09:59:59 2017 Job 9641756 from cluster became eligible then and we are just now hearing about it. [2017-06-09T10:00:05.743] 0(as_mysql_rollup.c:305) query insert into "cluster_usage_hour_table" (creation_time, mod_time, time_start, cpu_count, alloc_cpu_secs, down_cpu_secs, pdown_cpu_secs, idle_cpu_secs, over_cpu_secs, resv_cpu_secs, consumed_energy) values (...) on duplicate key update mod_time=1497027601, cpu_count=VALUES(cpu_count), alloc_cpu_secs=VALUES(alloc_cpu_secs), down_cpu_secs=VALUES(down_cpu_secs), pdown_cpu_secs=VALUES(pdown_cpu_secs), idle_cpu_secs=VALUES(idle_cpu_secs), over_cpu_secs=VALUES(over_cpu_secs), resv_cpu_secs=VALUES(resv_cpu_secs), consumed_energy=VALUES(consumed_energy) [2017-06-09T10:00:05.743] 0(as_mysql_rollup.c:1113) query insert into "cluster_assoc_usage_hour_table" (creation_time, mod_time, id_assoc, time_start, alloc_cpu_secs, consumed_energy) values (...) on duplicate key update mod_time=1497027601, alloc_cpu_secs=VALUES(alloc_cpu_secs), consumed_energy=VALUES(consumed_energy); [2017-06-09T10:00:05.745] 0(as_mysql_usage.c:376) query update "cluster_last_ran_table" set hourly_rollup=1497027600 [tcooper@cluster ~]# show_last_ran_table.sh +---------------------+---------------------+---------------------+ | hourly_rollup | daily_rollup | monthly_rollup | +---------------------+---------------------+---------------------+ | 2017-06-09 10:00:00 | 2017-06-09 09:59:59 | 2017-06-09 09:59:59 | +---------------------+---------------------+---------------------+ Thanks, Trevor > On Jun 9, 2017, at 9:42 AM, Cooper, Trevor <[email protected]> wrote: > > > After discovering issues with sreport and tracing back to Slurm DB orphans > (running AND pending), working to resolve that problem and clearing DB or > running and pending orphans I find that rollups are still not running > automatically. > > For example, I triggered re-rollup this morning by modifying the entries in > the last_ran_table and restarting slurmdbd and rollups were redone from April > 1, 2016 to today at 06:00 AM. > > > [tcooper@cluster ~]# show_last_ran_table.sh > +---------------------+---------------------+---------------------+ > | hourly_rollup | daily_rollup | monthly_rollup | > +---------------------+---------------------+---------------------+ > | 2017-06-09 06:00:00 | 2017-06-09 00:00:00 | 2017-06-01 00:00:00 | > +---------------------+---------------------+---------------------+ > > > Three hours later and the hourly rollups for [07-09]:00 AM had not run. > Restart of slurmdbd triggers the rollups which run successfully... > > > [tcooper@cluster ~]# service slurmdbd restart > stopping slurmdbd: [ OK ] > slurmdbd (pid 30526) is running... > slurmdbd (pid 30526) is running... > slurmdbd (pid 30526) is running... > slurmdbd (pid 30526) is running... > starting slurmdbd: [ OK ] > > > [tcooper@cluster ~]# tailf /var/log/slurm/slurmdbd.log | egrep -v "post user" > ... > [2017-06-09T09:22:16.553] debug: DBD_INIT: CLUSTER:cluster VERSION:7168 > UID:513563 IP:10.21.2.3 CONN:9 > [2017-06-09T09:22:16.644] debug: DBD_INIT: CLUSTER:cluster VERSION:7168 > UID:513563 IP:10.21.2.3 CONN:8 > [2017-06-09T09:24:07.700] Terminate signal (SIGINT or SIGTERM) received > [2017-06-09T09:24:18.077] debug: auth plugin for Munge > (http://code.google.com/p/munge/) loaded > [2017-06-09T09:24:19.190] Accounting storage MYSQL plugin loaded > [2017-06-09T09:24:28.114] slurmdbd version 14.11.11 started > [2017-06-09T09:24:28.115] 0(as_mysql_rollup.c:622) cluster curr hour is now > 1497013200-1497016800 > ... > [2017-06-09T09:24:41.831] Warning: Note very large processing time from > hourly_rollup for cluster: usec=13715990 began=09:24:28.115 > [2017-06-09T09:24:41.831] 0(as_mysql_usage.c:376) query > update "cluster_last_ran_table" set hourly_rollup=1497024000 > [2017-06-09T09:25:16.746] debug: DBD_INIT: CLUSTER:cluster VERSION:7168 > UID:513563 IP:10.21.2.3 CONN:9 > [2017-06-09T09:25:16.839] debug: DBD_INIT: CLUSTER:cluster VERSION:7168 > UID:513563 IP:10.21.2.3 CONN:8 > ... > > > [tcooper@cluster ~]# show_last_ran_table.sh > +---------------------+---------------------+---------------------+ > | hourly_rollup | daily_rollup | monthly_rollup | > +---------------------+---------------------+---------------------+ > | 2017-06-09 09:00:00 | 2017-06-09 00:00:00 | 2017-06-01 00:00:00 | > +---------------------+---------------------+---------------------+ > > Before our DB orphan issue generation of rollups was NOT a problem. > > Can anyone provide insight into where the 'trigger' for hourly rollups is > supposed to come from and/or know if this is a bug in Slurm 14.11.11 that is > fixed in a later version? > > > Thanks, > > Trevor Cooper > HPC Systems Programmer > San Diego Supercomputer Center, UCSD > 9500 Gilman Drive, 0505 > La Jolla, CA 92093-0505
