Re: Aurora, Marathon and long lived job frameworks

Bill Farner Fri, 27 Sep 2013 17:41:25 -0700

Ben pretty accurately described how Aurora fills some of these duties, but
Dan is right — we're still on the cusp of being *really* open sourced, so
it's not very usable yet.  Once our incubator vote is over, i hope to
promptly change this so outside users and contributors can dive in.


-=Bill


On Fri, Sep 27, 2013 at 5:21 PM, Benjamin Mahler
<[email protected]>wrote:

> I've replied inline below, also cc'ed some of the Aurora / Thermos
> developers to better answer your questions.
>
> On Fri, Sep 27, 2013 at 9:00 AM, Dan Colish <[email protected]>wrote:
>
>> I have been working on an internal project for executing a large number
>> of jobs across a cluster for the past couple of months and I am currently
>> doing a spike on using mesos for some of the cluster management tasks. The
>> clear prior art winners are Aurora and Marathon, but in both cases they
>> fall short of what I need.
>>
>> In aurora's case, the software is clearly very early in the open sourcing
>> process and as a result it missing significant pieces. The biggest missing
>> piece is the actual execution framework, Thermos. [That is what I assume
>> thermos does. I have no internal knowledge to verify that assumption]
>> Additionally, Aurora is heavily optimized for a high user count and large
>> number of incoming jobs. My use case is much simpler. There is only one
>> effective user and we have a small known set of jobs which need to run.
>>
>> On the other hand, Marathon is not designed for job execution if job is
>> defined to be a smaller unit of work. Instead, Marathon self-describes as a
>> meta-framework for deploying frameworks to a mesos cluster. A job to
>> marathon is the framework that runs. I do not think Marathon would be a
>> good fit for managing the my task execution and retry logic. It is designed
>> to run at on as a sub-layer of the cluster's resource allocation scheduler
>> and its abstractions follow suit.
>>
>> For my needs Aurora does appear to be a much closer fit than Marathon,
>> but neither is ideal. Since that is the case, I find myself left with a
>> rough choice. I am not thrilled with the prospect of yet another framework
>> for Mesos, but there is a lot of work which I have already completed for my
>> internal project that would need to reworked to fit with Aurora. Currently
>> my project can support the following features.
>>
>> * Distributed job locking - jobs cannot overlap
>>
>
> Can you elaborate on the ways in which jobs cannot overlap? Aurora may
> provide what you need here.
>
>
>> * Job execution delay queue - jobs can be run immediately or after a delay
>> * Job preemption
>>
>
> Aurora has the concept of pre-emption when there are insufficient
> resources for "production" jobs to run. I will defer to Aurora devs on
> elaborating here.
>
>
>> * Job success/failure tracking
>>
>
> What kind of tracking?
>
>
>> * Garbage collection of dead jobs
>>
>
> This is present in Aurora. Eventually, a completed job will be purged from
> the entire system. What kind of garbage collection are you referring to?
>
>
>> * Job execution failover - job is retried on a new executor
>>
>
> In Aurora, jobs are restarted if they fail.
>
>
>> * Executor warming - min # of executors idle
>> * Executor limits - max # of executors available
>>
>> My plan for integration with mesos is to adapt the job manager into a
>> mesos scheduler and my execution slaves into a mesos executor. At that
>> point, my framework will be able to run on the mesos cluster, but I have a
>> few concerns about how to allocated and release resources that the
>> executors will use over the lifetime of the cluster. I am not sure whether
>> it is better to be greedy early on in the frameworks life-cycle or to
>> decline resources initially and scale the framework's slaves when jobs
>> start coming in.
>>
>
> The better design would be to use resources as you need them (rather than
> greedily holding onto offers). Is there any motivation for the greedy
> approach?
>
>
>> Additionally, the relationship between the executor and its associated
>> driver are not immediately clear to me. If I am reading the code correctly,
>> they do not provide a way to stop a task in progress short of killing the
>> executor process.
>>
>
> You can kill tasks in progress, the executor process receives the request
> to kill the task. See SchedulerDriver::killTask and Executor::killTask
> (scheduler.hpp and executor.hpp).
>
>
>> I think that mesos will be a nice feature to add to my project and I
>> would really appreciate any feedback from the community. I will provide
>> progress updates as I continue work on my experiments.
>>
>
>

Re: Aurora, Marathon and long lived job frameworks

Reply via email to