I know we're supposed to run the mesos daemons under supervision (i.e.,
bring them back up automatically if they fail). But I'm interested in not
having the mesos-master fail at all, especially a failure in the registry /
replicated_log, which I am already a little scared of.
Situation:
- Mesos version: 0.20.1
- 30 mesos-slave hosts (on bare metal)
- originally had 30, now have 39
- 3 mesos-master hosts (on VMs)
- 5 zookeepers (on bare metal)
Problems during slave addition:
(1) Brought up 1 brand new slave, this caused the acting master to die with
this error:
*"Failed to admit slave ... Failed to update 'registry': Failed to perform
store within 5secs"*
(2) 11 minutes later, brought up 8 more brand new slaves, this caused the
new acting master to die with this error:
*"Failed to admit slave ... Failed to update 'registry': version mismatch"*
I'm now even more afraid of the registry now. :( Is it likely that
there's some fundamental improperness in my configuration and/or setup that
would lead to the registry being so fragile? I was guessing that running
the mesos-master on VMs might be bad and lead to the inital error about the
store not completing within 5 seconds. But the latter problem is just
baffling to me. Everything *seems* ok right now. Maybe. Hopefully.
Thanks!
- Erik