Hello fellow Aurorans, I'd like to share a proposal doc that seeks to lay out a roadmap for bringing in new scheduling features to Aurora.
David McLaughlin did a fantastic job of getting the ball rolling with the pluggable scheduling patches he contributed (1) and I'd like to expand upon that work. The overarching idea of this proposal is that everyone has different scheduling needs and it would be great to enhance Aurora to allow operators to meet organization specific scheduling needs without imposing them on the rest of the community. The features outlined in this proposal are based upon principles from Fenzo(2) which have enjoyed great success powering Mantis(3) and Titus(4) at Netflix. Finally, since this proposal is about scheduling enhancements, I also thought it would be pertinent to include talk of a feature that attempts to avoid hosting tasks on misbehaving agents. This is due to the fact that some of the scheduling policies introduced by this proposal can amplify the negative effect a bad node can have on performance. (I.e. we keep on choosing the "bad" node to schedule on and the task keeps on failing through no fault of its own.) Would love to hear some feedback on these ideas and/or opinions on what the next steps should be if we were to embark on this journey. https://docs.google.com/document/d/11ArMA53chtK-Zb_ KPMV7l_bCvTrUb005XlqzGQ2fTP4/edit# Thanks! -Renan 1. https://lists.apache.org/thread.html/50caf01283144ee9dacd24d3fb481a 2ca6120ceaa1289fd5b48620a4@%3Cdev.aurora.apache.org%3E 2. https://github.com/Netflix/Fenzo 3. https://medium.com/netflix-techblog/stream-processing- with-mantis-78af913f51a6 4. https://medium.com/netflix-techblog/the-evolution-of- container-usage-at-netflix-3abfc096781b