Thanks everybody for all your insights! I totally agree with the last response from Tom. The per-node services definitely belong to the level that provisions the machine and the mesos-slave service itself (in our case, pre-configured GCE images).
So I guess the problem I wanted to solve is more general - how can I make sure there are resources reserved for all of the system-level stuff that are running outside of the mesos context? To be more specific, if I have a machine with 16 CPUs, it is common that my framework will schedule 16 heavy number-crunching processes on it. This can starve anything else that's running on the machine... (like the logging aggregation service, and the mesos-slave service itself) (this probably explains phenomena of lost tasks we've been observing) What's the best-practice solution for this situation? On Wed, Jan 7, 2015 at 2:09 AM, Tom Arnfeld <[email protected]> wrote: > I completely agree with Charles, though I think I can appreciate what > you're trying to do here. Take the log aggregation service as an example, > you want that on every slave to aggregate logs, but want to avoid using yet > another layer of configuration management to deploy it. > > I'm of the opinion that these kind of auxiliary services which all work > together (the mesos-slave process included) to define what we mean by a > "slave" are the responsibility of whoever/whatever is provisioning the > mesos-slave process and possibly even the machine itself. In our case, > that's Chef. IMO once a slave registers with the mesos cluster it's > immediately ready to start doing work, and mesos will actually start > offering that slave immediately. > > If you continue down this path you're also going to run into a variety of > interesting timing issues when these services fail, or when you want to > upgrade them. I'd suggest taking a look at some kind of more advanced > process monitor to run these aux services like M/Monit instead of mesos > (via Marathon). > > Think of it another way, would you want something running through mesos to > install apt package updates once a day? That'd be super weird, so why would > log aggregation by any different? > > -- > > Tom Arnfeld > Developer // DueDil > > > On Tue, Jan 6, 2015 at 11:57 PM, Charles Baker <[email protected]> wrote: > >> It seems like an 'anti-pattern' (for lack of a better term) to attempt to >> force locality on a bunch of dependency services launched through Marathon. >> I thought the whole idea of Mesos (and Marathon) was to treat the data >> center as one giant computer in which it fundamentally should not matter >> where your services are launched. Although I obviously don't know the >> details of the use-case and may be grossly misunderstanding what you are >> trying to do but to me it sounds like you are attempting to shoehorn a >> non-distributed application into a distributed architecture. If this is the >> case, you may want to revisit your implementation and try to decouple the >> application's requirement of node-level dependency locality. It is also a >> good opportunity to possibly redesign a monolithic application into a >> distributed one. >> >> On Tue, Jan 6, 2015 at 12:53 PM, David Greenberg <[email protected]> >> wrote: >> >>> Tom is absolutely correct--you also need to ensure that your "special >>> tasks" run as a user which is assigned a role w/ a special reservation to >>> ensure they can always launch. >>> >>> On Tue, Jan 6, 2015 at 2:38 PM, Tom Arnfeld <[email protected]> wrote: >>> >>>> I'm not sure if I'm fully aware of the use case but if you use a >>>> different framework (aka Marathon) to launch these services, should the >>>> service die and need to be re-launched (or even the slave restarts) could >>>> you not be in a position where another framework has consumed all resources >>>> on that slave and your "core" tasks cannot launch? >>>> >>>> Maybe if you're just using Marathon it might provide a sort of priority >>>> to decide who gets what resources first, but with multiple frameworks you >>>> might need to look into the slave resource reservations and framework >>>> roles. >>>> >>>> FWIW We're configuring these things out of band (via Chef to be >>>> specific). >>>> >>>> Hope this helps! >>>> >>>> -- >>>> >>>> Tom Arnfeld >>>> Developer // DueDil >>>> >>>> (+44) 7525940046 >>>> 25 Christopher Street, London, EC2A 2BS >>>> >>>> >>>> On Tue, Jan 6, 2015 at 9:05 AM, Itamar Ostricher <[email protected]> >>>> wrote: >>>> >>>>> Hi, >>>>> >>>>> I was wondering if the best approach to do what I want is to use mesos >>>>> itself, or other Linux system tools. >>>>> >>>>> There are a bunch of services that our framework assumes are running >>>>> on all participating slaves (e.g. logging service, data-bridge service, >>>>> etc.). >>>>> One approach to do that is in the infrastructure level, making sure >>>>> that slave nodes are configured correctly (e.g. with pre-configured >>>>> images, >>>>> or other provisioning systems). >>>>> Another approach would be to use mesos itself (maybe with something >>>>> like Marathon) to schedule these services on all slave nodes. >>>>> >>>>> The advantage of the mesos-based approach is that it becomes trivial >>>>> to account for the resource consumption of said services (e.g. make sure >>>>> there's always at least 1 CPU dedicated to this). >>>>> I'm not sure how to achieve something similar with the system-approach. >>>>> >>>>> Anyone has any insights on this? >>>>> >>>> >>>> >>> >> >

