After doing a dump of the database I can see that there are no triggers
set up. I can't seen any slurm related cronjobs on the slurmdb or
slurmdbd machine. How is the data for the
<cluster>_assoc_usage_<hour,day,month>_table generated ? Can someone
point me to a place in the github source relvant to this ? I'd like to
see the query used to populate the table from the data in the job and
step tables.
On 06/07/2017 10:11, Emyr James wrote:
Hi,
I'm trying to put together some reports using the slurm accounting
database.
For some reason, some jobs are showing up with an association id of 0
even though there is a valid user id in the job table. Consequently,
the hour, day and monthly summary tables are not showing the right
amount of cpu seconds against each user. Due to issues with missing
associations I'm looking at generating reports directly from a join
between <cluster>_job_table and <cluster>_step_table but the numbers
are not adding up.
MariaDB [slurm_acct_db]> select
sum(sys_sec+sys_usec+user_sec+user_usec),id_user, id_assoc
-> from umic_step_table s ,umic_job_table j
-> where j.job_db_inx=s.job_db_inx
-> group by id_user,id_assoc
-> order by 1 desc;
+------------------------------------------+---------+----------+
| sum(sys_sec+sys_usec+user_sec+user_usec) | id_user | id_assoc |
+------------------------------------------+---------+----------+
| 6493170358 | 20006 | 9 |
| 9857479 | 999 | 0 |
| 4407923 | 20021 | 0 |
| 4372063 | 20022 | 0 |
| 3580520 | 20019 | 0 |
| 3482616 | 20020 | 0 |
| 803045 | 20013 | 10 |
| 103101 | 20013 | 8 |
| 41168 | 20018 | 19 |
| 26423 | 20017 | 18 |
| 24744 | 20026 | 22 |
| 0 | 20022 | 21 |
+------------------------------------------+---------+----------+
The user with the most cpu time has a valid id_assoc for every job in
the job table. I also know for a fact that this user stopped running
jobs a while ago so all the above jobs should have made it into the
summary (hour/day/month) summary tables.
If I do a sum over the <cluster>_assoc_usage_hour_table, I don't get
the same answer as above for this user, and the sum for the id_assoc
above does not equal the value in the NULL:NULL column below...
MariaDB [slurm_acct_db]> select a.acct, a.user,sum(u.alloc_secs)
-> from <cluster>_assoc_usage_hour_table as u
-> left join <cluster>_assoc_table a
-> on a.id_assoc=u.id
-> where u.id_tres=1
-> group by a.acct, a.user
-> order by 3 desc ;
+---------+----------+-------------------+
| acct | user | sum(u.alloc_secs) |
+---------+----------+-------------------+
| acct1 | user1 | 5784629904 |
| NULL | NULL | 301693 |
| acct2 | user2 | 83248 |
| acct3 | user3 | 1167 |
| acct4 | user4 | 20 |
| acct5 | user5 | 3 |
+---------+----------+-------------------+
Am I doing something wrong here or is it something I'm missing. What
do sys_usec and user_usec represent ? Do the summary tables look at
walltime for each host rather than the cpu seconds used in the step
table ? Can anyone point me to detailed documentation on how the
accounting DB schema hangs together ?
Many thanks,
Emyr
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.