Marathon split brain situation

Rogier Dikkes Fri, 28 Aug 2015 07:06:30 -0700

Hello all,

I am running test cluster with Mesos and Marathon in a cluster of 20compute nodes and 2 head nodes running vm's that host all masters,frameworks etc. Till the 0.23 update there were not many issues buttoday i seen an issue that i must share and hope you guys know more about.


We run an updated Mesos version 0.23 and Marathon 0.10.0.

I started a hdfs namenode on docker through marathon and a couple ofdata nodes on the agents, im slowly building this config further withsecondary namenodes, datanodes, journal nodes all in containers. For nowits a very basic setup to see how stable everything is and what weshould consider when running in containers.

Today we found out that the marathon leader suddenly was registered 2times as framework with different id's and to make it worse: It spawnedtask again that was already running. Suddenly we had 2 namenodes withthe name management. Our consul cluster auto registered both containersand started to forward all traffic to these 2 namenodes.

I always thought that zookeeper was taking care of election for marathonand this should prevent scenario's like this. However both frameworkshad a different ID, which should explain why zookeeper didn't handle theelection.

The marathon web interface was no longer responding and everything timedout, i found out that there was only a single marathon process wasrunning. To get hdfs back running again i killed the containers andkilled the marathon process. From logs i couldn't gather why thishappens, the 10 minutes around the registration of the framework thereis nothing but offers, http calls and task syncs.

The strange thing i just noticed is that marathon incidentallyre-registers itself while its process is not restarted or elected.


Does anyone have an idea where to look?

--
Rogier Dikkes
Systeem Programmeur Hadoop & HPC Cloud
SURFsara | Science Park 140 | 1098 XG Amsterdam

Marathon split brain situation

Reply via email to