Hello all,
I am running test cluster with Mesos and Marathon in a cluster of 20
compute nodes and 2 head nodes running vm's that host all masters,
frameworks etc. Till the 0.23 update there were not many issues but
today i seen an issue that i must share and hope you guys know more about.
We run an updated Mesos version 0.23 and Marathon 0.10.0.
I started a hdfs namenode on docker through marathon and a couple of
data nodes on the agents, im slowly building this config further with
secondary namenodes, datanodes, journal nodes all in containers. For now
its a very basic setup to see how stable everything is and what we
should consider when running in containers.
Today we found out that the marathon leader suddenly was registered 2
times as framework with different id's and to make it worse: It spawned
task again that was already running. Suddenly we had 2 namenodes with
the name management. Our consul cluster auto registered both containers
and started to forward all traffic to these 2 namenodes.
I always thought that zookeeper was taking care of election for marathon
and this should prevent scenario's like this. However both frameworks
had a different ID, which should explain why zookeeper didn't handle the
election.
The marathon web interface was no longer responding and everything timed
out, i found out that there was only a single marathon process was
running. To get hdfs back running again i killed the containers and
killed the marathon process. From logs i couldn't gather why this
happens, the 10 minutes around the registration of the framework there
is nothing but offers, http calls and task syncs.
The strange thing i just noticed is that marathon incidentally
re-registers itself while its process is not restarted or elected.
Does anyone have an idea where to look?
--
Rogier Dikkes
Systeem Programmeur Hadoop & HPC Cloud
SURFsara | Science Park 140 | 1098 XG Amsterdam