Hello,

We're running tasks on mesos, launched with marathon.  We label all the
agents with AWS availability zone and VPC name, so that tasks can be
scheduled to the right set of hosts.

I've noticed something that feels like, well, maybe not a bug, but
unexpected behavior.

We launch tasks with:

    "constraints": [
        [
            "az",
            "GROUP_BY",
            "3"
        ],
    ],
    "instances": 2,

this is eu-west-1, where there are 3 AZs.  We run agents in all 3 AZs.

On trying to restart an application, no new task was started.  Digging
around, I could see marathon decline any offers from mesos, which led us
to look a little closer.  It turned out that the 2 tasks in the
application were running in eu-west-1a and eu-west-1b.  All the agents
in eu-west-1c were fully subscribed and could not pick up any new work.

Once we figured this out, it was straight forward enough to rebalance
and let things sort themselves out.

So, with that as background:

It would have been nicer if marathon had realized that the state at the
start and the end of the transaction would be to run in only 2 of 3 AZs,
and allowed a new task to start in either eu-west-1a or eu-west-1b.  I
can see how that might be slightly harder to account for than just even
stacking.

It would be nice if a metric "a framework keeps asking for resource and
then declining offers" was available - it may already be, but I can't
find it.  This would at least make the issue visible.

I can see the metric for declined offers, but this also increments when
the framework declines offers because it doesn't need any additional
resource, so I'm not sure if it's helpful or not here.  Perhaps I need
to look at a second order derivative to see spikes in declines?  It does
look like the number of declines went way up during this period.

Like I said, I don't know if this is a bug, precisely, but it was a not
very visible failure to use resource, when there were actually plenty of
resources on offer.  I'd like to make these failures more visible to the
team, so any pointers would be helpful.

Cheers,

--
Stephen Gran
Senior Technical Architect

picture the possibilities | piksel.com

This message is private and confidential. If you have received this message in 
error, please notify the sender or serviced...@piksel.com and remove it from 
your system.

Piksel Inc is a company registered in the United States New York City, 1250 
Broadway, Suite 1902, New York, NY 10001. F No. = 2931986

Reply via email to