Haven't seen a response on this so I thought I would re-ping. Has anyone 
ever seen the below error?

-Paul Edmon-

On 07/04/2013 01:36 PM, Paul Edmon wrote:
> So we are running slurm-2.5.7 on our cluster with a master and a
> backup.  This morning our primary suffered from this error:
>
> [2013-07-04T11:39:18-04:00] sched: job_complete for JobId=103965 successful
> [2013-07-04T11:39:18-04:00] Node holy2b09207 now responding
> [2013-07-04T11:39:18-04:00] sched: Allocate JobId=106801
> NodeList=holy2b09203 #CPUs=4
> [2013-07-04T11:39:18-04:00] sched: Allocate JobId=106802
> NodeList=holy2b07208 #CPUs=4
> [2013-07-04T11:39:18-04:00] error: cons_res: _compute_c_b_task_dist
> oversubscribe
>
> When that happened it basically continued to spout that error and was
> unresponsive.  At which point the other master took control. However,
> once I killed the master spouting this error via kill -9 as just taking
> down the service cleanly did not work, the backup discovered jobs that
> were not registered and terminated several valid jobs that were still
> running.  Naturally lost jobs are unacceptable in any situation and its
> distressing that we lost jobs even with a backup.
>
> Does anyone know what this error means and how we can avoid it in the
> future?  Essentially with one master down with the error, the cluster
> went into a split brain mode.  I thought the log files on the shared
> filesystem was supposed to prevent this as all the current running jobs
> are written there.  Or did I misunderstand and those files are only
> update when the master goes down?  I would like to understand why we
> lost jobs so we can prevent it from happening again.
>
> -Paul Edmon-

Reply via email to