Hello all,

Recently we have been getting errors in the slurm control daemon log file (slurmctld.log), stating for example:

[2016-08-30T12:24:01.280] error: slurmdbd: DBD_SEND_MULT_JOB_START failure: Connection refused

Checking past mails from slurm-dev, I managed to find out that the service slurmdbd was actually stopped due to old jobs stuck in the database, that are no longer running on the cluster (most likely due to a power outage that forcefully stopped the whole cluster). Restarting slurmdbd, the errors showed up in its log (slurmdbd.log), showing for example:

[2016-08-30T15:00:00.634] error: We have more allocated time than is possible (28616400 > 403200) for cluster ung(112) from 2016-08-30T14:00:00 - 2016-08-30T15:00:00 tres 1

Running sacct to determine the jobs that are producing this error, I get the following:

JobID State Start End Elapsed ExitCode ------------ ---------- ------------------- ------------------- ---------- --------
175291          RUNNING 2016-04-17T19:59:13 Unknown 135-17:20:17      0:0
180943          RUNNING 2016-08-31T09:21:10 Unknown   03:58:53      0:0

With the first job (175291) as an example for a job that is giving problems, and the second (180943) as an example of a correctly running job.

What I would like to find out is a way to delete the problematic jobs from the database, giving them an end time or just making them not cause problems on the cluster. The only solution thus far that I've found is "The solution is to manually modify the job record in mysql to change its state from 1 to 3 and update time_end to something reasonable.", but since I am not a mysql expert, I could use some instructions on how to do this.

A few relevant info about our cluster, if needed:
- AccountingStorageType=accounting_storage/slurmdbd
- Slurm version: slurm 15.08.1
- mysql slurm_db database tables (assuming the problem is fixed here):
   +-----------------------------+
   | Tables_in_slurm_db          |
   +-----------------------------+
   | acct_coord_table            |
   | acct_table                  |
   | clus_res_table              |
   | cluster_table               |
   | qos_table                   |
   | res_table                   |
   | table_defs_table            |
   | tres_table                  |
   | txn_table                   |
   | ung_assoc_table             |
   | ung_assoc_usage_day_table   |
   | ung_assoc_usage_hour_table  |
   | ung_assoc_usage_month_table |
   | ung_event_table             |
   | ung_job_table               |
   | ung_last_ran_table          |
   | ung_resv_table              |
   | ung_step_table              |
   | ung_suspend_table           |
   | ung_usage_day_table         |
   | ung_usage_hour_table        |
   | ung_usage_month_table       |
   | ung_wckey_table             |
   | ung_wckey_usage_day_table   |
   | ung_wckey_usage_hour_table  |
   | ung_wckey_usage_month_table |
   | user_table                  |
   +-----------------------------+
   27 rows in set (0.00 sec)

Cheers,
Gasper Kukec Mezek

Reply via email to