Hi @Ankur, It might be a big late in the conversation to respond, but we had almost exactly the same questions you did, so I thought I might share our story.
We originally needed to build a data science infrastructure and chose to do it on top of Mesos. We first tried for a few months to use Spark to run all of our applications. We quickly ran into all sorts of issues and challenges (from solving confusing bugs to getting creative with how spark manually partitions things), and then realized that many of these problems were more easily solved with simple scripts. After a couple months, we mostly moved off of Spark (though we're certainly excited to put future applications on Spark - it is an awesome project and has a lot of promise!). The next thing we tried to do was scale 1000s of instances of our simple scripts on Marathon (with the idea to build some sort of very simple autoscaler to manage it). We found that marathon could not handle the turnover of so many small tasks. With the smaller tasks, Marathon became unresponsive, got confused about task state, and (I think) occasionally lost track of some tasks, but on the other hand, it works quite well for long-running applications, so we use it for that. Next, we built Relay <https://github.com/sailthru/relay> and, the Mesos extension, Relay.Mesos <https://github.com/sailthru/relay.mesos>, to convert our small scripts into long-running instances we could then put on Marathon. Relay.Mesos auto-scales thousands of instances of small scripts on Mesos, and we can just run multiple Relay instances for better fault tolerance. We're very happy with this simple little tool! It doesn't try to solve a bin packing (or task clustering) problem. It attempts only to solves the auto scaling problem. Finally, to deal with auto-scaling Mesos nodes in the cluster, we use EC2 Auto-scaling policies <http://aws.amazon.com/autoscaling/> based on aggregate metrics (like CPU usage) reported from our Mesos slaves. The simple scaling policies simply spin up (or down) new instances of Mesos slave that then auto-register themselves into the cluster and are ready to run. We currently track just CPU usage and that seems to work well enough to not need to invest more time on it. The key strategy we've adopted through this process is to make each tool we use solve one very specific and well understood problem. I have found that trying to use "one tool to rule them all" is not practical because we have a lot of different kinds of problems to solve. On the other hand, we now have a zillion different tools. At first, the tools are overwhelming to new employees, but our group seems quite comfortable with all these tools. I believe this has also created a culture where it's encouraging, and exciting, to try new things all the time. I hope that was helpful, or if off the mark, at least interesting. Alex On Fri, Jun 5, 2015 at 4:37 AM Tim Chen <[email protected]> wrote: > Hi Sharma, > > What metrics do you watch for demand and supply for Spark? Do you just > watch node resources or you actually look at some Spark JMX stats? > > Tim > > On Thu, Jun 4, 2015 at 10:35 PM, Sharma Podila <[email protected]> > wrote: > >> We Autoscale our Mesos cluster in EC2 from within our framework. Scaling >> up can be easy via watching demand Vs supply. However, scaling down >> requires bin packing the tasks tightly onto as few servers as possible. >> Do you have any specific ideas on how you would leverage Mantis/Mesos for >> Spark based jobs? Fenzo, the scheduler part of Mantis, could be another >> point of leverage, which could give a framework the ability to autoscale >> the cluster among other benefits. >> >> >> >> On Thu, Jun 4, 2015 at 1:06 PM, Dmitry Goldenberg < >> [email protected]> wrote: >> >>> Thanks, Vinod. I'm really interested in how we could leverage something >>> like Mantis and Mesos to achieve autoscaling in a Spark-based data >>> processing system... >>> >>> On Jun 4, 2015, at 3:54 PM, Vinod Kone <[email protected]> wrote: >>> >>> Hey Dmitry. At the current time there is no built-in support for Mesos >>> to autoscale nodes in the cluster. I've heard people (Netflix?) do it out >>> of band on EC2. >>> >>> On Thu, Jun 4, 2015 at 9:08 AM, Dmitry Goldenberg < >>> [email protected]> wrote: >>> >>>> A Mesos noob here. Could someone point me at the doc or summary for the >>>> cluster autoscaling capabilities in Mesos? >>>> >>>> Is there a way to feed it events and have it detect the need to bring >>>> in more machines or decommission machines? Is there a way to receive >>>> events back that notify you that machines have been allocated or >>>> decommissioned? >>>> >>>> Would this work within a certain set of >>>> "preallocated"/pre-provisioned/"stand-by" machines or will Mesos go and >>>> grab machines from the cloud? >>>> >>>> What are the integration points of Apache Spark and Mesos? What are >>>> the true advantages of running Spark on Mesos? >>>> >>>> Can Mesos autoscale the cluster based on some signals/events coming out >>>> of Spark runtime or Spark consumers, then cause the consumers to run on the >>>> updated cluster, or signal to the consumers to restart themselves into an >>>> updated cluster? >>>> >>>> Thanks. >>>> >>> >>> >> >

