This sounds like a feature request for marathon. Can you redirect this to
the marathon mailing list?

On Fri, Apr 29, 2016 at 9:26 AM, Stephen Gran <[email protected]>
wrote:

> Hello,
>
> We're running tasks on mesos, launched with marathon.  We label all the
> agents with AWS availability zone and VPC name, so that tasks can be
> scheduled to the right set of hosts.
>
> I've noticed something that feels like, well, maybe not a bug, but
> unexpected behavior.
>
> We launch tasks with:
>
>     "constraints": [
>         [
>             "az",
>             "GROUP_BY",
>             "3"
>         ],
>     ],
>     "instances": 2,
>
> this is eu-west-1, where there are 3 AZs.  We run agents in all 3 AZs.
>
> On trying to restart an application, no new task was started.  Digging
> around, I could see marathon decline any offers from mesos, which led us
> to look a little closer.  It turned out that the 2 tasks in the
> application were running in eu-west-1a and eu-west-1b.  All the agents
> in eu-west-1c were fully subscribed and could not pick up any new work.
>
> Once we figured this out, it was straight forward enough to rebalance
> and let things sort themselves out.
>
> So, with that as background:
>
> It would have been nicer if marathon had realized that the state at the
> start and the end of the transaction would be to run in only 2 of 3 AZs,
> and allowed a new task to start in either eu-west-1a or eu-west-1b.  I
> can see how that might be slightly harder to account for than just even
> stacking.
>
> It would be nice if a metric "a framework keeps asking for resource and
> then declining offers" was available - it may already be, but I can't
> find it.  This would at least make the issue visible.
>
> I can see the metric for declined offers, but this also increments when
> the framework declines offers because it doesn't need any additional
> resource, so I'm not sure if it's helpful or not here.  Perhaps I need
> to look at a second order derivative to see spikes in declines?  It does
> look like the number of declines went way up during this period.
>
> Like I said, I don't know if this is a bug, precisely, but it was a not
> very visible failure to use resource, when there were actually plenty of
> resources on offer.  I'd like to make these failures more visible to the
> team, so any pointers would be helpful.
>
> Cheers,
>
> --
> Stephen Gran
> Senior Technical Architect
>
> picture the possibilities | piksel.com
>
> This message is private and confidential. If you have received this
> message in error, please notify the sender or [email protected] and
> remove it from your system.
>
> Piksel Inc is a company registered in the United States New York City,
> 1250 Broadway, Suite 1902, New York, NY 10001. F No. = 2931986
>

Reply via email to