Erik, there were significant improvements to the registry in Mesos 0.21.0. I'd recommend you try a more recent Mesos version, like 0.22.1 just released this week.
I'd also recommend that you make sure the networking between your masters is relatively low-latency, because updates will fail if the active master cannot write to the other masters' registries within --registry_store_timeout. Alternatively, you can just bump up this timeout, and maybe --registry_fetch_timeout. On Thu, May 7, 2015 at 6:15 PM, Erik Weathers <[email protected]> wrote: > I know we're supposed to run the mesos daemons under supervision (i.e., > bring them back up automatically if they fail). But I'm interested in not > having the mesos-master fail at all, especially a failure in the registry / > replicated_log, which I am already a little scared of. > > Situation: > > - Mesos version: 0.20.1 > - 30 mesos-slave hosts (on bare metal) > - originally had 30, now have 39 > - 3 mesos-master hosts (on VMs) > - 5 zookeepers (on bare metal) > > Problems during slave addition: > > (1) Brought up 1 brand new slave, this caused the acting master to die > with this error: > > *"Failed to admit slave ... Failed to update 'registry': Failed to perform > store within 5secs"* > > > (2) 11 minutes later, brought up 8 more brand new slaves, this caused the > new acting master to die with this error: > > *"Failed to admit slave ... Failed to update 'registry': version mismatch"* > > > I'm now even more afraid of the registry now. :( Is it likely that > there's some fundamental improperness in my configuration and/or setup that > would lead to the registry being so fragile? I was guessing that running > the mesos-master on VMs might be bad and lead to the inital error about the > store not completing within 5 seconds. But the latter problem is just > baffling to me. Everything *seems* ok right now. Maybe. Hopefully. > > Thanks! > > - Erik >

