This sounds like a feature request for marathon. Can you redirect this to the marathon mailing list?
On Fri, Apr 29, 2016 at 9:26 AM, Stephen Gran <[email protected]> wrote: > Hello, > > We're running tasks on mesos, launched with marathon. We label all the > agents with AWS availability zone and VPC name, so that tasks can be > scheduled to the right set of hosts. > > I've noticed something that feels like, well, maybe not a bug, but > unexpected behavior. > > We launch tasks with: > > "constraints": [ > [ > "az", > "GROUP_BY", > "3" > ], > ], > "instances": 2, > > this is eu-west-1, where there are 3 AZs. We run agents in all 3 AZs. > > On trying to restart an application, no new task was started. Digging > around, I could see marathon decline any offers from mesos, which led us > to look a little closer. It turned out that the 2 tasks in the > application were running in eu-west-1a and eu-west-1b. All the agents > in eu-west-1c were fully subscribed and could not pick up any new work. > > Once we figured this out, it was straight forward enough to rebalance > and let things sort themselves out. > > So, with that as background: > > It would have been nicer if marathon had realized that the state at the > start and the end of the transaction would be to run in only 2 of 3 AZs, > and allowed a new task to start in either eu-west-1a or eu-west-1b. I > can see how that might be slightly harder to account for than just even > stacking. > > It would be nice if a metric "a framework keeps asking for resource and > then declining offers" was available - it may already be, but I can't > find it. This would at least make the issue visible. > > I can see the metric for declined offers, but this also increments when > the framework declines offers because it doesn't need any additional > resource, so I'm not sure if it's helpful or not here. Perhaps I need > to look at a second order derivative to see spikes in declines? It does > look like the number of declines went way up during this period. > > Like I said, I don't know if this is a bug, precisely, but it was a not > very visible failure to use resource, when there were actually plenty of > resources on offer. I'd like to make these failures more visible to the > team, so any pointers would be helpful. > > Cheers, > > -- > Stephen Gran > Senior Technical Architect > > picture the possibilities | piksel.com > > This message is private and confidential. If you have received this > message in error, please notify the sender or [email protected] and > remove it from your system. > > Piksel Inc is a company registered in the United States New York City, > 1250 Broadway, Suite 1902, New York, NY 10001. F No. = 2931986 >

