Re: Cluster autoscaling in Spark+Mesos ?

Alex Gaudio Fri, 05 Jun 2015 08:10:44 -0700

Hi @Ankur,

It might be a big late in the conversation to respond, but we had almost
exactly the same questions you did, so I thought I might share our story.

We originally needed to build a data science infrastructure and chose to do
it on top of Mesos.  We first tried for a few months to use Spark to run
all of our applications.  We quickly ran into all sorts of issues and
challenges (from solving confusing bugs to getting creative with how spark
manually partitions things), and then realized that many of these problems
were more easily solved with simple scripts.   After a couple months, we
mostly moved off of Spark (though we're certainly excited to put future
applications on Spark - it is an awesome project and has a lot of promise!).

The next thing we tried to do was scale 1000s of instances of our simple
scripts on Marathon (with the idea to build some sort of very simple
autoscaler to manage it).  We found that marathon could not handle the
turnover of so many small tasks.  With the smaller tasks, Marathon became
unresponsive, got confused about task state, and (I think) occasionally
lost track of some tasks, but on the other hand, it works quite well for
long-running applications, so we use it for that.

Next, we built Relay <https://github.com/sailthru/relay> and, the Mesos
extension, Relay.Mesos <https://github.com/sailthru/relay.mesos>, to
convert our small scripts into long-running instances we could then put on
Marathon.  Relay.Mesos auto-scales thousands of instances of small scripts
on Mesos, and we can just run multiple Relay instances for better fault
tolerance.  We're very happy with this simple little tool!  It doesn't try
to solve a bin packing (or task clustering) problem.  It attempts only to
solves the auto scaling problem.

Finally, to deal with auto-scaling Mesos nodes in the cluster, we use EC2
Auto-scaling policies <http://aws.amazon.com/autoscaling/> based on
aggregate metrics (like CPU usage) reported from our Mesos slaves.  The
simple scaling policies simply spin up (or down) new instances of Mesos
slave that then auto-register themselves into the cluster and are ready to
run.  We currently track just CPU usage and that seems to work well enough
to not need to invest more time on it.

The key strategy we've adopted through this process is to make each tool we
use solve one very specific and well understood problem.  I have found that
trying to use "one tool to rule them all" is not practical because we have
a lot of different kinds of problems to solve.  On the other hand, we now
have a zillion different tools.  At first, the tools are overwhelming to
new employees, but our group seems quite comfortable with all these tools.
I believe this has also created a culture where it's encouraging, and
exciting, to try new things all the time.

I hope that was helpful, or if off the mark, at least interesting.

Alex

On Fri, Jun 5, 2015 at 4:37 AM Tim Chen <[email protected]> wrote:

> Hi Sharma,
>
> What metrics do you watch for demand and supply for Spark? Do you just
> watch node resources or you actually look at some Spark JMX stats?
>
> Tim
>
> On Thu, Jun 4, 2015 at 10:35 PM, Sharma Podila <[email protected]>
> wrote:
>
>> We Autoscale our Mesos cluster in EC2 from within our framework. Scaling
>> up can be easy via watching demand Vs supply. However, scaling down
>> requires bin packing the tasks tightly onto as few servers as possible.
>> Do you have any specific ideas on how you would leverage Mantis/Mesos for
>> Spark based jobs? Fenzo, the scheduler part of Mantis, could be another
>> point of leverage, which could give a framework the ability to autoscale
>> the cluster among other benefits.
>>
>>
>>
>> On Thu, Jun 4, 2015 at 1:06 PM, Dmitry Goldenberg <
>> [email protected]> wrote:
>>
>>> Thanks, Vinod. I'm really interested in how we could leverage something
>>> like Mantis and Mesos to achieve autoscaling in a Spark-based data
>>> processing system...
>>>
>>> On Jun 4, 2015, at 3:54 PM, Vinod Kone <[email protected]> wrote:
>>>
>>> Hey Dmitry. At the current time there is no built-in support for Mesos
>>> to autoscale nodes in the cluster. I've heard people (Netflix?) do it out
>>> of band on EC2.
>>>
>>> On Thu, Jun 4, 2015 at 9:08 AM, Dmitry Goldenberg <
>>> [email protected]> wrote:
>>>
>>>> A Mesos noob here. Could someone point me at the doc or summary for the
>>>> cluster autoscaling capabilities in Mesos?
>>>>
>>>> Is there a way to feed it events and have it detect the need to bring
>>>> in more machines or decommission machines?  Is there a way to receive
>>>> events back that notify you that machines have been allocated or
>>>> decommissioned?
>>>>
>>>> Would this work within a certain set of
>>>> "preallocated"/pre-provisioned/"stand-by" machines or will Mesos go and
>>>> grab machines from the cloud?
>>>>
>>>> What are the integration points of Apache Spark and Mesos?  What are
>>>> the true advantages of running Spark on Mesos?
>>>>
>>>> Can Mesos autoscale the cluster based on some signals/events coming out
>>>> of Spark runtime or Spark consumers, then cause the consumers to run on the
>>>> updated cluster, or signal to the consumers to restart themselves into an
>>>> updated cluster?
>>>>
>>>> Thanks.
>>>>
>>>
>>>
>>
>

Re: Cluster autoscaling in Spark+Mesos ?

Reply via email to