Thanks everybody for all your insights!

I totally agree with the last response from Tom.
The per-node services definitely belong to the level that provisions the
machine and the mesos-slave service itself (in our case, pre-configured GCE
images).

So I guess the problem I wanted to solve is more general - how can I make
sure there are resources reserved for all of the system-level stuff that
are running outside of the mesos context?
To be more specific, if I have a machine with 16 CPUs, it is common that my
framework will schedule 16 heavy number-crunching processes on it.
This can starve anything else that's running on the machine... (like the
logging aggregation service, and the mesos-slave service itself)
(this probably explains phenomena of lost tasks we've been observing)
What's the best-practice solution for this situation?

On Wed, Jan 7, 2015 at 2:09 AM, Tom Arnfeld <[email protected]> wrote:

> I completely agree with Charles, though I think I can appreciate what
> you're trying to do here. Take the log aggregation service as an example,
> you want that on every slave to aggregate logs, but want to avoid using yet
> another layer of configuration management to deploy it.
>
> I'm of the opinion that these kind of auxiliary services which all work
> together (the mesos-slave process included) to define what we mean by a
> "slave" are the responsibility of whoever/whatever is provisioning the
> mesos-slave process and possibly even the machine itself. In our case,
> that's Chef. IMO once a slave registers with the mesos cluster it's
> immediately ready to start doing work, and mesos will actually start
> offering that slave immediately.
>
> If you continue down this path you're also going to run into a variety of
> interesting timing issues when these services fail, or when you want to
> upgrade them. I'd suggest taking a look at some kind of more advanced
> process monitor to run these aux services like M/Monit instead of mesos
> (via Marathon).
>
> Think of it another way, would you want something running through mesos to
> install apt package updates once a day? That'd be super weird, so why would
> log aggregation by any different?
>
> --
>
> Tom Arnfeld
> Developer // DueDil
>
>
> On Tue, Jan 6, 2015 at 11:57 PM, Charles Baker <[email protected]> wrote:
>
>> It seems like an 'anti-pattern' (for lack of a better term) to attempt to
>> force locality on a bunch of dependency services launched through Marathon.
>> I thought the whole idea of Mesos (and Marathon) was to treat the data
>> center as one giant computer in which it fundamentally should not matter
>> where your services are launched. Although I obviously don't know the
>> details of the use-case and may be grossly misunderstanding what you are
>> trying to do but to me it sounds like you are attempting to shoehorn a
>> non-distributed application into a distributed architecture. If this is the
>> case, you may want to revisit your implementation and try to decouple the
>> application's requirement of node-level dependency locality. It is also a
>> good opportunity to possibly redesign a monolithic application into a
>> distributed one.
>>
>> On Tue, Jan 6, 2015 at 12:53 PM, David Greenberg <[email protected]>
>> wrote:
>>
>>> Tom is absolutely correct--you also need to ensure that your "special
>>> tasks" run as a user which is assigned a role w/ a special reservation to
>>> ensure they can always launch.
>>>
>>> On Tue, Jan 6, 2015 at 2:38 PM, Tom Arnfeld <[email protected]> wrote:
>>>
>>>> I'm not sure if I'm fully aware of the use case but if you use a
>>>> different framework (aka Marathon) to launch these services, should the
>>>> service die and need to be re-launched (or even the slave restarts) could
>>>> you not be in a position where another framework has consumed all resources
>>>> on that slave and your "core" tasks cannot launch?
>>>>
>>>> Maybe if you're just using Marathon it might provide a sort of priority
>>>> to decide who gets what resources first, but with multiple frameworks you
>>>> might need to look into the slave resource reservations and framework 
>>>> roles.
>>>>
>>>> FWIW We're configuring these things out of band (via Chef to be
>>>> specific).
>>>>
>>>> Hope this helps!
>>>>
>>>> --
>>>>
>>>> Tom Arnfeld
>>>> Developer // DueDil
>>>>
>>>> (+44) 7525940046
>>>> 25 Christopher Street, London, EC2A 2BS
>>>>
>>>>
>>>> On Tue, Jan 6, 2015 at 9:05 AM, Itamar Ostricher <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I was wondering if the best approach to do what I want is to use mesos
>>>>> itself, or other Linux system tools.
>>>>>
>>>>> There are a bunch of services that our framework assumes are running
>>>>> on all participating slaves (e.g. logging service, data-bridge service,
>>>>> etc.).
>>>>> One approach to do that is in the infrastructure level, making sure
>>>>> that slave nodes are configured correctly (e.g. with pre-configured 
>>>>> images,
>>>>> or other provisioning systems).
>>>>> Another approach would be to use mesos itself (maybe with something
>>>>> like Marathon) to schedule these services on all slave nodes.
>>>>>
>>>>> The advantage of the mesos-based approach is that it becomes trivial
>>>>> to account for the resource consumption of said services (e.g. make sure
>>>>> there's always at least 1 CPU dedicated to this).
>>>>> I'm not sure how to achieve something similar with the system-approach.
>>>>>
>>>>> Anyone has any insights on this?
>>>>>
>>>>
>>>>
>>>
>>
>

Reply via email to