On Fri, May 30, 2014 at 4:40 PM, Jason Giedymin <[email protected]> wrote:
> You would be surprised how far just scaling when resources offers are > 'tight' and keeping track of idle CPU for each slave to shut then down can > take you. > +1, it's really easy to get cpu/mem/disk usage from state.json and set thresholds. All you'd need to do is consume that API, no changes to Mesos required. Would love to see how well that works. > > > -Jason > > On May 30, 2014, at 5:57 PM, Diptanu Choudhury <[email protected]> wrote: > > Hi, > > I am currently working on designing an auto-scaling solution for Mesos > slaves in AWS and would love to get some feedback around that. There are a > couple of ways for doing it, and I was thinking to start with simple cases > first - > > a. Define the lowest resource offer a framework can afford to get and then > we start using the information published by Mesos master in states.json to > determine if the cluster has enough resources. If we see that the available > resources won't satisfy the lower bounds set, we bring up new EC2 instances > with enough resources that Mesos could use to make offers. > > Sounds reasonable but I personally wouldn't want to go through the administrative burden of setting the thresholds per framework. Resource requirements also change over time. > > b. Latency for getting an offer for a given job. Say that the framework > has a job which needs x cpu, y memory and y ports. If the framework doesn't > get an offer until t amount of time, the ASG with slaves of EC2 instance > type which can offer that amount of resource is autoscaled. > > Scheduling latency is an interesting metric. Mesos would have to expose the time between a requestResources() call from the scheduler and a matching offer being sent. The autoscaling component can then query the Mesos REST API and scale up/down based on thresholds. It feels like a more intuitive and easy to tweak knob vs. resource thresholds, and covers the a. case as well. The problem is that afaik not many frameworks use requestResources() at the moment. > > c. Maintain historical information about the resources used, jobs > submitted and running in Mesos and use that information for doing > Predictive autoscaling. > > @chris_deli did some related work: https://www.youtube.com/watch?v=YpmElyi94AA > > I would like to understand if potentially there are better ways of > achieving elasticity in a Mesos cluster and where the complexity lies, > information that Mesos could provide us to make it more efficient. > > -- > Thanks, > Diptanu Choudhury > Web - www.linkedin.com/in/diptanu > Twitter - @diptanu <http://twitter.com/diptanu> > >

