Re: preventing registry failures from happening in mesos-master?

2015-05-09 Thread Adam Bordelon
Erik, there were significant improvements to the registry in Mesos 0.21.0.
I'd recommend you try a more recent Mesos version, like 0.22.1 just
released this week.

I'd also recommend that you make sure the networking between your masters
is relatively low-latency, because updates will fail if the active master
cannot write to the other masters' registries within
--registry_store_timeout. Alternatively, you can just bump up this timeout,
and maybe --registry_fetch_timeout.

On Thu, May 7, 2015 at 6:15 PM, Erik Weathers eweath...@groupon.com wrote:

 I know we're supposed to run the mesos daemons under supervision (i.e.,
 bring them back up automatically if they fail).   But I'm interested in not
 having the mesos-master fail at all, especially a failure in the registry /
 replicated_log, which I am already a little scared of.

 Situation:

- Mesos version: 0.20.1
- 30 mesos-slave hosts (on bare metal)
   - originally had 30, now have 39
- 3 mesos-master hosts (on VMs)
- 5 zookeepers (on bare metal)

 Problems during slave addition:

 (1) Brought up 1 brand new slave, this caused the acting master to die
 with this error:

 *Failed to admit slave ... Failed to update 'registry': Failed to perform
 store within 5secs*


 (2) 11 minutes later, brought up 8 more brand new slaves, this caused the
 new acting master to die with this error:

 *Failed to admit slave ... Failed to update 'registry': version mismatch*


 I'm now even more afraid of the registry now. :(Is it likely that
 there's some fundamental improperness in my configuration and/or setup that
 would lead to the registry being so fragile?   I was guessing that running
 the mesos-master on VMs might be bad and lead to the inital error about the
 store not completing within 5 seconds.  But the latter problem is just
 baffling to me.  Everything *seems* ok right now.  Maybe.  Hopefully.

 Thanks!

 - Erik



preventing registry failures from happening in mesos-master?

2015-05-07 Thread Erik Weathers
I know we're supposed to run the mesos daemons under supervision (i.e.,
bring them back up automatically if they fail).   But I'm interested in not
having the mesos-master fail at all, especially a failure in the registry /
replicated_log, which I am already a little scared of.

Situation:

   - Mesos version: 0.20.1
   - 30 mesos-slave hosts (on bare metal)
  - originally had 30, now have 39
   - 3 mesos-master hosts (on VMs)
   - 5 zookeepers (on bare metal)

Problems during slave addition:

(1) Brought up 1 brand new slave, this caused the acting master to die with
this error:

*Failed to admit slave ... Failed to update 'registry': Failed to perform
store within 5secs*


(2) 11 minutes later, brought up 8 more brand new slaves, this caused the
new acting master to die with this error:

*Failed to admit slave ... Failed to update 'registry': version mismatch*


I'm now even more afraid of the registry now. :(Is it likely that
there's some fundamental improperness in my configuration and/or setup that
would lead to the registry being so fragile?   I was guessing that running
the mesos-master on VMs might be bad and lead to the inital error about the
store not completing within 5 seconds.  But the latter problem is just
baffling to me.  Everything *seems* ok right now.  Maybe.  Hopefully.

Thanks!

- Erik