[slurm-dev] Re: Fixing corrupted slurm accounting?

2017-10-28 Thread Douglas Jacobsen
A more complete response would be something like:

MariaDB [slurm_acct_db]> select * from _last_ran_table;
+---+--++
| hourly_rollup | daily_rollup | monthly_rollup |
+---+--++
|1509206400 |   1509174000 | 1506841200 |
+---+--++
1 row in set (0.00 sec)

MariaDB [slurm_acct_db]> update _last_ran_table set
hourly_rollup=UNIX_TIMESTAMP('2017-01-01
00:00:00'),daily_rollup=UNIX_TIMESTAMP('2017-01-01
00:00:00'),monthly_rollup=UNIX_TIMESTAMP('2017-01-01 00:00:00');
Query OK, 1 row affected (0.05 sec)
Rows matched: 1  Changed: 1  Warnings: 0

MariaDB [alva_slurm_acct_db]> select * from _last_ran_table;
+---+--++
| hourly_rollup | daily_rollup | monthly_rollup |
+---+--++
|1483257600 |   1483257600 | 1483257600 |
+---+--++
1 row in set (0.01 sec)

MariaDB [slurm_acct_db]> quit

Making changes to the timestamps and "" as appropriate.

Obviously mucking with the database is dangerous, so be careful.


Doug Jacobsen, Ph.D.
NERSC Computer Systems Engineer
National Energy Research Scientific Computing Center 
dmjacob...@lbl.gov

- __o
-- _ '\<,_
--(_)/  (_)__


On Sat, Oct 28, 2017 at 9:17 AM, Douglas Jacobsen 
wrote:

> Once you've got the end times fixed, youll need to manually update the
> timestamps in the _last_ran table to some time point before the
> start of the earliest job fixed.  Then on the next hour mark, it'll start
> rerolling up the past data to reflect the new reality you've set in the
> database.
>
> Unfortunately I'm away from a keyboard right now so I'm not 100% certain
> of the table name.
>
> On Oct 28, 2017 09:09, "Doug Meyer"  wrote:
>
>> Look up orphan jobs and lost.pl (quick script to find orphans) in
>> https://groups.google.com/forum/#!forum/slurm-devel.
>>
>> Battling this myself right now.
>>
>> Thank you,
>> Doug
>>
>> On Fri, Oct 27, 2017 at 9:00 PM, Bill Broadley 
>> wrote:
>>
>>>
>>>
>>> I noticed crazy high numbers in my reports, things like sreport user top:
>>> Top 10 Users 2017-10-20T00:00:00 - 2017-10-26T23:59:59 (604800 secs)
>>> Use reported in Percentage of Total
>>> 
>>> 
>>>   Cluster Login Proper Name AccountUsed   Energy
>>> - - --- --- ---
>>> 
>>> MyClust   JoeUser   Joe User jgrp   3710.15%0.00%
>>>
>>> This was during a period when JoeUser hadn't submitted a single job.
>>>
>>> We have been through some slurm upgrades, figured one of the schema
>>> tweaks had
>>> confused things.  I looked in the slurm accounting table and found the
>>> job_table.  I found 80,000 jobs with no end_time, that weren't actually
>>> running.
>>>  So I set the end_time = begin time for those 80,000 jobs.  It didn't
>>> help the
>>> reports.
>>>
>>> I then tried deleting all 80,000 jobs from the job_table and that didn't
>>> help
>>> either.
>>>
>>> Is there a way to rebuild the accounting data from the information in
>>> the job_
>>> table?
>>>
>>> Or any other suggestion for getting some sane numbers out?
>>>
>>
>>


[slurm-dev] Re: Fixing corrupted slurm accounting?

2017-10-28 Thread Douglas Jacobsen
Once you've got the end times fixed, youll need to manually update the
timestamps in the _last_ran table to some time point before the
start of the earliest job fixed.  Then on the next hour mark, it'll start
rerolling up the past data to reflect the new reality you've set in the
database.

Unfortunately I'm away from a keyboard right now so I'm not 100% certain of
the table name.

On Oct 28, 2017 09:09, "Doug Meyer"  wrote:

> Look up orphan jobs and lost.pl (quick script to find orphans) in
> https://groups.google.com/forum/#!forum/slurm-devel.
>
> Battling this myself right now.
>
> Thank you,
> Doug
>
> On Fri, Oct 27, 2017 at 9:00 PM, Bill Broadley 
> wrote:
>
>>
>>
>> I noticed crazy high numbers in my reports, things like sreport user top:
>> Top 10 Users 2017-10-20T00:00:00 - 2017-10-26T23:59:59 (604800 secs)
>> Use reported in Percentage of Total
>> 
>> 
>>   Cluster Login Proper Name AccountUsed   Energy
>> - - --- --- ---
>> 
>> MyClust   JoeUser   Joe User jgrp   3710.15%0.00%
>>
>> This was during a period when JoeUser hadn't submitted a single job.
>>
>> We have been through some slurm upgrades, figured one of the schema
>> tweaks had
>> confused things.  I looked in the slurm accounting table and found the
>> job_table.  I found 80,000 jobs with no end_time, that weren't actually
>> running.
>>  So I set the end_time = begin time for those 80,000 jobs.  It didn't
>> help the
>> reports.
>>
>> I then tried deleting all 80,000 jobs from the job_table and that didn't
>> help
>> either.
>>
>> Is there a way to rebuild the accounting data from the information in the
>> job_
>> table?
>>
>> Or any other suggestion for getting some sane numbers out?
>>
>
>


[slurm-dev] Re: Fixing corrupted slurm accounting?

2017-10-28 Thread Doug Meyer
Look up orphan jobs and lost.pl (quick script to find orphans) in
https://groups.google.com/forum/#!forum/slurm-devel.

Battling this myself right now.

Thank you,
Doug

On Fri, Oct 27, 2017 at 9:00 PM, Bill Broadley  wrote:

>
>
> I noticed crazy high numbers in my reports, things like sreport user top:
> Top 10 Users 2017-10-20T00:00:00 - 2017-10-26T23:59:59 (604800 secs)
> Use reported in Percentage of Total
> 
> 
>   Cluster Login Proper Name AccountUsed   Energy
> - - --- --- ---
> 
> MyClust   JoeUser   Joe User jgrp   3710.15%0.00%
>
> This was during a period when JoeUser hadn't submitted a single job.
>
> We have been through some slurm upgrades, figured one of the schema tweaks
> had
> confused things.  I looked in the slurm accounting table and found the
> job_table.  I found 80,000 jobs with no end_time, that weren't actually
> running.
>  So I set the end_time = begin time for those 80,000 jobs.  It didn't help
> the
> reports.
>
> I then tried deleting all 80,000 jobs from the job_table and that didn't
> help
> either.
>
> Is there a way to rebuild the accounting data from the information in the
> job_
> table?
>
> Or any other suggestion for getting some sane numbers out?
>