So just to be clear about your setup, were the masters being run under supervision? What was the restart policy?
For a small cluster I'm surprised that 5 seconds was insufficient for performing a replicated log write. Do you ever see writes stall on your VMs? If you share the logs of each master during the write timeout, we would be able to see whether there was significant network or disk latency. The other possibility for the write timeout is that fewer than a quorum of masters were online when the write was performed, have you checked the logs? The "version mismatch" should be very rare. This occurs when there are competing writes being performed (e.g. a new master is elected and writes, but the previous master has yet to demote itself and it performs a write). Have you looked through the logs to see the timeline across the masters? Please share them if you can :) As for *never* having things "fail", totally understand this concern and it's something that has been considered. In this context, "fail" typically means a non-recoverable error that leads to the process exiting. There are a number of technical solutions related to this general problem (e.g. supervisor trees <http://www.erlang.org/doc/design_principles/sup_princ.html>). However, since there always remains the possibility of crashes (e.g. SIGSEGV, SIGABRT, etc), to be safe users must run the components under supervision, at which point we've taken care of supervision via an external solution. All of that being said, you shouldn't be seeing these particular failures, so let's try to diagnose further! On Thu, May 7, 2015 at 6:15 PM, Erik Weathers <[email protected]> wrote: > I know we're supposed to run the mesos daemons under supervision (i.e., > bring them back up automatically if they fail). But I'm interested in not > having the mesos-master fail at all, especially a failure in the registry / > replicated_log, which I am already a little scared of. > > Situation: > > - Mesos version: 0.20.1 > - 30 mesos-slave hosts (on bare metal) > - originally had 30, now have 39 > - 3 mesos-master hosts (on VMs) > - 5 zookeepers (on bare metal) > > Problems during slave addition: > > (1) Brought up 1 brand new slave, this caused the acting master to die > with this error: > > *"Failed to admit slave ... Failed to update 'registry': Failed to perform > store within 5secs"* > > > (2) 11 minutes later, brought up 8 more brand new slaves, this caused the > new acting master to die with this error: > > *"Failed to admit slave ... Failed to update 'registry': version mismatch"* > > > I'm now even more afraid of the registry now. :( Is it likely that > there's some fundamental improperness in my configuration and/or setup that > would lead to the registry being so fragile? I was guessing that running > the mesos-master on VMs might be bad and lead to the inital error about the > store not completing within 5 seconds. But the latter problem is just > baffling to me. Everything *seems* ok right now. Maybe. Hopefully. > > Thanks! > > - Erik >

