On Fri, May 30, 2014 at 4:40 PM, Jason Giedymin <[email protected]>
wrote:

> You would be surprised how far just scaling when resources offers are
> 'tight' and keeping track of idle CPU for each slave to shut then down can
> take you.
>

+1, it's really easy to get cpu/mem/disk usage from state.json and set
thresholds. All you'd need to do is consume that API, no changes to Mesos
required. Would love to see how well that works.


>
>
> -Jason
>
> On May 30, 2014, at 5:57 PM, Diptanu Choudhury <[email protected]> wrote:
>
> Hi,
>
> I am currently working on designing an auto-scaling solution for Mesos
> slaves in AWS and would love to get some feedback around that. There are a
> couple of ways for doing it, and I was thinking to start with simple cases
> first -
>
> a. Define the lowest resource offer a framework can afford to get and then
> we start using the information published by Mesos master in states.json to
> determine if the cluster has enough resources. If we see that the available
> resources won't satisfy the lower bounds set, we bring up new EC2 instances
> with enough resources that Mesos could use to make offers.
>
> Sounds reasonable but I personally wouldn't want to go through the
administrative burden of setting the thresholds per framework. Resource
requirements also change over time.

>
> b. Latency for getting an offer for a given job. Say that the framework
> has a job which needs x cpu, y memory and y ports. If the framework doesn't
> get an offer until t amount of time, the ASG with slaves of EC2 instance
> type which can offer that amount of resource is autoscaled.
>
> Scheduling latency is an interesting metric. Mesos would have to expose
the time between a requestResources() call from the scheduler and a
matching offer being sent. The autoscaling component can then query the
Mesos REST API and scale up/down based on thresholds. It feels like a more
intuitive and easy to tweak knob vs. resource thresholds, and covers the a.
case as well.
The problem is that afaik not many frameworks use requestResources() at the
moment.


>
> c. Maintain historical information about the resources used, jobs
> submitted and running in Mesos and use that information for doing
> Predictive autoscaling.
>
> @chris_deli did some related work:
https://www.youtube.com/watch?v=YpmElyi94AA

>
> I would like to understand if potentially there are better ways of
> achieving elasticity in a Mesos cluster and where the complexity lies,
> information that Mesos could provide us to make it more efficient.
>
> --
> Thanks,
> Diptanu Choudhury
> Web - www.linkedin.com/in/diptanu
> Twitter - @diptanu <http://twitter.com/diptanu>
>
>

Reply via email to