So we are running slurm-2.5.7 on our cluster with a master and a 
backup.  This morning our primary suffered from this error:

[2013-07-04T11:39:18-04:00] sched: job_complete for JobId=103965 successful
[2013-07-04T11:39:18-04:00] Node holy2b09207 now responding
[2013-07-04T11:39:18-04:00] sched: Allocate JobId=106801 
NodeList=holy2b09203 #CPUs=4
[2013-07-04T11:39:18-04:00] sched: Allocate JobId=106802 
NodeList=holy2b07208 #CPUs=4
[2013-07-04T11:39:18-04:00] error: cons_res: _compute_c_b_task_dist 
oversubscribe

When that happened it basically continued to spout that error and was 
unresponsive.  At which point the other master took control. However, 
once I killed the master spouting this error via kill -9 as just taking 
down the service cleanly did not work, the backup discovered jobs that 
were not registered and terminated several valid jobs that were still 
running.  Naturally lost jobs are unacceptable in any situation and its 
distressing that we lost jobs even with a backup.

Does anyone know what this error means and how we can avoid it in the 
future?  Essentially with one master down with the error, the cluster 
went into a split brain mode.  I thought the log files on the shared 
filesystem was supposed to prevent this as all the current running jobs 
are written there.  Or did I misunderstand and those files are only 
update when the master goes down?  I would like to understand why we 
lost jobs so we can prevent it from happening again.

-Paul Edmon-

Reply via email to