Hello all,
Recently we have been getting errors in the slurm control daemon log
file (slurmctld.log), stating for example:
[2016-08-30T12:24:01.280] error: slurmdbd: DBD_SEND_MULT_JOB_START
failure: Connection refused
Checking past mails from slurm-dev, I managed to find out that the
service slurmdbd was actually stopped due to old jobs stuck in the
database, that are no longer running on the cluster (most likely due to
a power outage that forcefully stopped the whole cluster). Restarting
slurmdbd, the errors showed up in its log (slurmdbd.log), showing for
example:
[2016-08-30T15:00:00.634] error: We have more allocated time than is
possible (28616400 > 403200) for cluster ung(112) from
2016-08-30T14:00:00 - 2016-08-30T15:00:00 tres 1
Running sacct to determine the jobs that are producing this error, I get
the following:
JobID State Start End Elapsed
ExitCode
------------ ---------- ------------------- -------------------
---------- --------
175291 RUNNING 2016-04-17T19:59:13 Unknown 135-17:20:17 0:0
180943 RUNNING 2016-08-31T09:21:10 Unknown 03:58:53 0:0
With the first job (175291) as an example for a job that is giving
problems, and the second (180943) as an example of a correctly running job.
What I would like to find out is a way to delete the problematic jobs
from the database, giving them an end time or just making them not cause
problems on the cluster. The only solution thus far that I've found is
"The solution is to manually modify the job record in mysql to change
its state from 1 to 3 and update time_end to something reasonable.", but
since I am not a mysql expert, I could use some instructions on how to
do this.
A few relevant info about our cluster, if needed:
- AccountingStorageType=accounting_storage/slurmdbd
- Slurm version: slurm 15.08.1
- mysql slurm_db database tables (assuming the problem is fixed here):
+-----------------------------+
| Tables_in_slurm_db |
+-----------------------------+
| acct_coord_table |
| acct_table |
| clus_res_table |
| cluster_table |
| qos_table |
| res_table |
| table_defs_table |
| tres_table |
| txn_table |
| ung_assoc_table |
| ung_assoc_usage_day_table |
| ung_assoc_usage_hour_table |
| ung_assoc_usage_month_table |
| ung_event_table |
| ung_job_table |
| ung_last_ran_table |
| ung_resv_table |
| ung_step_table |
| ung_suspend_table |
| ung_usage_day_table |
| ung_usage_hour_table |
| ung_usage_month_table |
| ung_wckey_table |
| ung_wckey_usage_day_table |
| ung_wckey_usage_hour_table |
| ung_wckey_usage_month_table |
| user_table |
+-----------------------------+
27 rows in set (0.00 sec)
Cheers,
Gasper Kukec Mezek