Hi,
I'm trying to put together some reports using the slurm accounting database.
For some reason, some jobs are showing up with an association id of 0
even though there is a valid user id in the job table. Consequently, the
hour, day and monthly summary tables are not showing the right amount of
cpu seconds against each user. Due to issues with missing associations
I'm looking at generating reports directly from a join between
<cluster>_job_table and <cluster>_step_table but the numbers are not
adding up.
MariaDB [slurm_acct_db]> select
sum(sys_sec+sys_usec+user_sec+user_usec),id_user, id_assoc
-> from umic_step_table s ,umic_job_table j
-> where j.job_db_inx=s.job_db_inx
-> group by id_user,id_assoc
-> order by 1 desc;
+------------------------------------------+---------+----------+
| sum(sys_sec+sys_usec+user_sec+user_usec) | id_user | id_assoc |
+------------------------------------------+---------+----------+
| 6493170358 | 20006 | 9 |
| 9857479 | 999 | 0 |
| 4407923 | 20021 | 0 |
| 4372063 | 20022 | 0 |
| 3580520 | 20019 | 0 |
| 3482616 | 20020 | 0 |
| 803045 | 20013 | 10 |
| 103101 | 20013 | 8 |
| 41168 | 20018 | 19 |
| 26423 | 20017 | 18 |
| 24744 | 20026 | 22 |
| 0 | 20022 | 21 |
+------------------------------------------+---------+----------+
The user with the most cpu time has a valid id_assoc for every job in
the job table. I also know for a fact that this user stopped running
jobs a while ago so all the above jobs should have made it into the
summary (hour/day/month) summary tables.
If I do a sum over the <cluster>_assoc_usage_hour_table, I don't get the
same answer as above for this user, and the sum for the id_assoc above
does not equal the value in the NULL:NULL column below...
MariaDB [slurm_acct_db]> select a.acct, a.user,sum(u.alloc_secs)
-> from <cluster>_assoc_usage_hour_table as u
-> left join <cluster>_assoc_table a
-> on a.id_assoc=u.id
-> where u.id_tres=1
-> group by a.acct, a.user
-> order by 3 desc ;
+---------+----------+-------------------+
| acct | user | sum(u.alloc_secs) |
+---------+----------+-------------------+
| acct1 | user1 | 5784629904 |
| NULL | NULL | 301693 |
| acct2 | user2 | 83248 |
| acct3 | user3 | 1167 |
| acct4 | user4 | 20 |
| acct5 | user5 | 3 |
+---------+----------+-------------------+
Am I doing something wrong here or is it something I'm missing. What do
sys_usec and user_usec represent ? Do the summary tables look at
walltime for each host rather than the cpu seconds used in the step
table ? Can anyone point me to detailed documentation on how the
accounting DB schema hangs together ?
Many thanks,
Emyr
--
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.