So we are running slurm-2.5.7 on our cluster with a master and a backup. This morning our primary suffered from this error:
[2013-07-04T11:39:18-04:00] sched: job_complete for JobId=103965 successful [2013-07-04T11:39:18-04:00] Node holy2b09207 now responding [2013-07-04T11:39:18-04:00] sched: Allocate JobId=106801 NodeList=holy2b09203 #CPUs=4 [2013-07-04T11:39:18-04:00] sched: Allocate JobId=106802 NodeList=holy2b07208 #CPUs=4 [2013-07-04T11:39:18-04:00] error: cons_res: _compute_c_b_task_dist oversubscribe When that happened it basically continued to spout that error and was unresponsive. At which point the other master took control. However, once I killed the master spouting this error via kill -9 as just taking down the service cleanly did not work, the backup discovered jobs that were not registered and terminated several valid jobs that were still running. Naturally lost jobs are unacceptable in any situation and its distressing that we lost jobs even with a backup. Does anyone know what this error means and how we can avoid it in the future? Essentially with one master down with the error, the cluster went into a split brain mode. I thought the log files on the shared filesystem was supposed to prevent this as all the current running jobs are written there. Or did I misunderstand and those files are only update when the master goes down? I would like to understand why we lost jobs so we can prevent it from happening again. -Paul Edmon-
