Hi,
I'm trying to put together some reports using the slurm accounting database.
For some reason, some jobs are showing up with an association id of 0 even though there is a valid user id in the job table. Consequently, the hour, day and monthly summary tables are not showing the right amount of cpu seconds against each user. Due to issues with missing associations I'm looking at generating reports directly from a join between <cluster>_job_table and <cluster>_step_table but the numbers are not adding up.

MariaDB [slurm_acct_db]> select sum(sys_sec+sys_usec+user_sec+user_usec),id_user, id_assoc
    -> from umic_step_table s ,umic_job_table j
    -> where j.job_db_inx=s.job_db_inx
    -> group by id_user,id_assoc
    -> order by 1 desc;
+------------------------------------------+---------+----------+
| sum(sys_sec+sys_usec+user_sec+user_usec) | id_user | id_assoc |
+------------------------------------------+---------+----------+
|                               6493170358 |   20006 |        9 |
|                                  9857479 |     999 |        0 |
|                                  4407923 |   20021 |        0 |
|                                  4372063 |   20022 |        0 |
|                                  3580520 |   20019 |        0 |
|                                  3482616 |   20020 |        0 |
|                                   803045 |   20013 |       10 |
|                                   103101 |   20013 |        8 |
|                                    41168 |   20018 |       19 |
|                                    26423 |   20017 |       18 |
|                                    24744 |   20026 |       22 |
|                                        0 |   20022 |       21 |
+------------------------------------------+---------+----------+

The user with the most cpu time has a valid id_assoc for every job in the job table. I also know for a fact that this user stopped running jobs a while ago so all the above jobs should have made it into the summary (hour/day/month) summary tables.

If I do a sum over the <cluster>_assoc_usage_hour_table, I don't get the same answer as above for this user, and the sum for the id_assoc above does not equal the value in the NULL:NULL column below...

MariaDB [slurm_acct_db]> select  a.acct, a.user,sum(u.alloc_secs)
    -> from <cluster>_assoc_usage_hour_table as u
    -> left join <cluster>_assoc_table a
    -> on a.id_assoc=u.id
    -> where u.id_tres=1
    -> group by a.acct, a.user
    -> order by 3 desc ;
+---------+----------+-------------------+
| acct    | user     | sum(u.alloc_secs) |
+---------+----------+-------------------+
| acct1   | user1    |        5784629904 |
| NULL    | NULL     |            301693 |
| acct2   | user2    |             83248 |
| acct3   | user3    |              1167 |
| acct4   | user4    |                20 |
| acct5   | user5    |                 3 |
+---------+----------+-------------------+

Am I doing something wrong here or is it something I'm missing. What do sys_usec and user_usec represent ? Do the summary tables look at walltime for each host rather than the cpu seconds used in the step table ? Can anyone point me to detailed documentation on how the accounting DB schema hangs together ?

Many thanks,

Emyr



--
The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.

Reply via email to