Hello, We're running tasks on mesos, launched with marathon. We label all the agents with AWS availability zone and VPC name, so that tasks can be scheduled to the right set of hosts.
I've noticed something that feels like, well, maybe not a bug, but unexpected behavior. We launch tasks with: "constraints": [ [ "az", "GROUP_BY", "3" ], ], "instances": 2, this is eu-west-1, where there are 3 AZs. We run agents in all 3 AZs. On trying to restart an application, no new task was started. Digging around, I could see marathon decline any offers from mesos, which led us to look a little closer. It turned out that the 2 tasks in the application were running in eu-west-1a and eu-west-1b. All the agents in eu-west-1c were fully subscribed and could not pick up any new work. Once we figured this out, it was straight forward enough to rebalance and let things sort themselves out. So, with that as background: It would have been nicer if marathon had realized that the state at the start and the end of the transaction would be to run in only 2 of 3 AZs, and allowed a new task to start in either eu-west-1a or eu-west-1b. I can see how that might be slightly harder to account for than just even stacking. It would be nice if a metric "a framework keeps asking for resource and then declining offers" was available - it may already be, but I can't find it. This would at least make the issue visible. I can see the metric for declined offers, but this also increments when the framework declines offers because it doesn't need any additional resource, so I'm not sure if it's helpful or not here. Perhaps I need to look at a second order derivative to see spikes in declines? It does look like the number of declines went way up during this period. Like I said, I don't know if this is a bug, precisely, but it was a not very visible failure to use resource, when there were actually plenty of resources on offer. I'd like to make these failures more visible to the team, so any pointers would be helpful. Cheers, -- Stephen Gran Senior Technical Architect picture the possibilities | piksel.com This message is private and confidential. If you have received this message in error, please notify the sender or serviced...@piksel.com and remove it from your system. Piksel Inc is a company registered in the United States New York City, 1250 Broadway, Suite 1902, New York, NY 10001. F No. = 2931986