Ben pretty accurately described how Aurora fills some of these duties, but Dan is right — we're still on the cusp of being *really* open sourced, so it's not very usable yet. Once our incubator vote is over, i hope to promptly change this so outside users and contributors can dive in.
-=Bill On Fri, Sep 27, 2013 at 5:21 PM, Benjamin Mahler <[email protected]>wrote: > I've replied inline below, also cc'ed some of the Aurora / Thermos > developers to better answer your questions. > > On Fri, Sep 27, 2013 at 9:00 AM, Dan Colish <[email protected]>wrote: > >> I have been working on an internal project for executing a large number >> of jobs across a cluster for the past couple of months and I am currently >> doing a spike on using mesos for some of the cluster management tasks. The >> clear prior art winners are Aurora and Marathon, but in both cases they >> fall short of what I need. >> >> In aurora's case, the software is clearly very early in the open sourcing >> process and as a result it missing significant pieces. The biggest missing >> piece is the actual execution framework, Thermos. [That is what I assume >> thermos does. I have no internal knowledge to verify that assumption] >> Additionally, Aurora is heavily optimized for a high user count and large >> number of incoming jobs. My use case is much simpler. There is only one >> effective user and we have a small known set of jobs which need to run. >> >> On the other hand, Marathon is not designed for job execution if job is >> defined to be a smaller unit of work. Instead, Marathon self-describes as a >> meta-framework for deploying frameworks to a mesos cluster. A job to >> marathon is the framework that runs. I do not think Marathon would be a >> good fit for managing the my task execution and retry logic. It is designed >> to run at on as a sub-layer of the cluster's resource allocation scheduler >> and its abstractions follow suit. >> >> For my needs Aurora does appear to be a much closer fit than Marathon, >> but neither is ideal. Since that is the case, I find myself left with a >> rough choice. I am not thrilled with the prospect of yet another framework >> for Mesos, but there is a lot of work which I have already completed for my >> internal project that would need to reworked to fit with Aurora. Currently >> my project can support the following features. >> >> * Distributed job locking - jobs cannot overlap >> > > Can you elaborate on the ways in which jobs cannot overlap? Aurora may > provide what you need here. > > >> * Job execution delay queue - jobs can be run immediately or after a delay >> * Job preemption >> > > Aurora has the concept of pre-emption when there are insufficient > resources for "production" jobs to run. I will defer to Aurora devs on > elaborating here. > > >> * Job success/failure tracking >> > > What kind of tracking? > > >> * Garbage collection of dead jobs >> > > This is present in Aurora. Eventually, a completed job will be purged from > the entire system. What kind of garbage collection are you referring to? > > >> * Job execution failover - job is retried on a new executor >> > > In Aurora, jobs are restarted if they fail. > > >> * Executor warming - min # of executors idle >> * Executor limits - max # of executors available >> >> My plan for integration with mesos is to adapt the job manager into a >> mesos scheduler and my execution slaves into a mesos executor. At that >> point, my framework will be able to run on the mesos cluster, but I have a >> few concerns about how to allocated and release resources that the >> executors will use over the lifetime of the cluster. I am not sure whether >> it is better to be greedy early on in the frameworks life-cycle or to >> decline resources initially and scale the framework's slaves when jobs >> start coming in. >> > > The better design would be to use resources as you need them (rather than > greedily holding onto offers). Is there any motivation for the greedy > approach? > > >> Additionally, the relationship between the executor and its associated >> driver are not immediately clear to me. If I am reading the code correctly, >> they do not provide a way to stop a task in progress short of killing the >> executor process. >> > > You can kill tasks in progress, the executor process receives the request > to kill the task. See SchedulerDriver::killTask and Executor::killTask > (scheduler.hpp and executor.hpp). > > >> I think that mesos will be a nice feature to add to my project and I >> would really appreciate any feedback from the community. I will provide >> progress updates as I continue work on my experiments. >> > >

