Hello all,

I am running test cluster with Mesos and Marathon in a cluster of 20 compute nodes and 2 head nodes running vm's that host all masters, frameworks etc. Till the 0.23 update there were not many issues but today i seen an issue that i must share and hope you guys know more about.

We run an updated Mesos version 0.23 and Marathon 0.10.0.

I started a hdfs namenode on docker through marathon and a couple of data nodes on the agents, im slowly building this config further with secondary namenodes, datanodes, journal nodes all in containers. For now its a very basic setup to see how stable everything is and what we should consider when running in containers.

Today we found out that the marathon leader suddenly was registered 2 times as framework with different id's and to make it worse: It spawned task again that was already running. Suddenly we had 2 namenodes with the name management. Our consul cluster auto registered both containers and started to forward all traffic to these 2 namenodes.

I always thought that zookeeper was taking care of election for marathon and this should prevent scenario's like this. However both frameworks had a different ID, which should explain why zookeeper didn't handle the election.

The marathon web interface was no longer responding and everything timed out, i found out that there was only a single marathon process was running. To get hdfs back running again i killed the containers and killed the marathon process. From logs i couldn't gather why this happens, the 10 minutes around the registration of the framework there is nothing but offers, http calls and task syncs.

The strange thing i just noticed is that marathon incidentally re-registers itself while its process is not restarted or elected.

Does anyone have an idea where to look?

--
Rogier Dikkes
Systeem Programmeur Hadoop & HPC Cloud
SURFsara | Science Park 140 | 1098 XG Amsterdam

Reply via email to