Haven't seen a response on this so I thought I would re-ping. Has anyone ever seen the below error?
-Paul Edmon- On 07/04/2013 01:36 PM, Paul Edmon wrote: > So we are running slurm-2.5.7 on our cluster with a master and a > backup. This morning our primary suffered from this error: > > [2013-07-04T11:39:18-04:00] sched: job_complete for JobId=103965 successful > [2013-07-04T11:39:18-04:00] Node holy2b09207 now responding > [2013-07-04T11:39:18-04:00] sched: Allocate JobId=106801 > NodeList=holy2b09203 #CPUs=4 > [2013-07-04T11:39:18-04:00] sched: Allocate JobId=106802 > NodeList=holy2b07208 #CPUs=4 > [2013-07-04T11:39:18-04:00] error: cons_res: _compute_c_b_task_dist > oversubscribe > > When that happened it basically continued to spout that error and was > unresponsive. At which point the other master took control. However, > once I killed the master spouting this error via kill -9 as just taking > down the service cleanly did not work, the backup discovered jobs that > were not registered and terminated several valid jobs that were still > running. Naturally lost jobs are unacceptable in any situation and its > distressing that we lost jobs even with a backup. > > Does anyone know what this error means and how we can avoid it in the > future? Essentially with one master down with the error, the cluster > went into a split brain mode. I thought the log files on the shared > filesystem was supposed to prevent this as all the current running jobs > are written there. Or did I misunderstand and those files are only > update when the master goes down? I would like to understand why we > lost jobs so we can prevent it from happening again. > > -Paul Edmon-
