Hi Mohit,

The scenario makes sense and unfortunately this is really a bug in how we
do allocations.

The default allocator in Mesos implements a weighted fair-sharing algorithm
called dominant resource fairness. This does well when there are tasks that
are "short-lived" (or at least, don't live forever) and an adequate amount
of resources to "go around". In your case, it sounds like there is a lot
more computation (build jobs) then available resources.

I've filed https://issues.apache.org/jira/browse/MESOS-1086 and put up a
patch at https://reviews.apache.org/r/19090.

Thanks for the detailed report Mohit!

Ben.


On Mon, Mar 10, 2014 at 7:58 PM, Mohit Soni <mohitsoni1...@gmail.com> wrote:

> I was running a load test on a mesos-cluster, and observed that when mesos
> is running lots of frameworks, offer starvation occurs for certain
> frameworks, i.e. only a subset of frameworks registered with mesos gets
> offers. Let me describe the scenario below:
>
> First phase:
> At the beginning, there's only one framework registered with mesos, which
> is 'Marathon'. The load generator, uses Marathon's API to launch let's say
> 50 Jenkins masters, with mesos-plugin installed. Once all 50 masters are
> launched, the mesos-cluster now have 51 frameworks registered in total,
> because the mesos-plugin registers itself with mesos-master as a framework.
>
> Second phase:
> Now, the load generator goes and triggers couple of build jobs on each
> Jenkins Master. Each framework's Schedular will now have let's say 2 items
> in it's build queue. Once framework get's a resource offer from Master,
> it's schedular can perform the build tasks, if the offer matches the
> resource constraints as specified by mesos-plugin.
>
> What I observed was, at the start of second phase, some frameworks
> (jenkins masters) got offers and got their tasks scheduled to run. But,
> rest of the frameworks, didn't get resource offers from mesos-master, and
> the build jobs scheduled on those, got starved. Tailing jenkins logs on
> these masters never showed: 'Received offers'. Also, according to mesos
> master logs, mesos was sending offers to only a handful of frameworks. The
> logs below show the message from a minute, but I saw the similar behavior
> at other times, I have added a line break after each group of frameworks
> getting offers:
>
> I0310 17:56:44.703126  1156 master.cpp:2250] Sending 24 offers to
> framework 201403032301-1255541002-5050-1126-0364
> I0310 17:56:45.722951  1156 master.cpp:2250] Sending 24 offers to
> framework 201403032301-1255541002-5050-1126-0371
> I0310 17:56:46.744184  1159 master.cpp:2250] Sending 24 offers to
> framework 201403032301-1255541002-5050-1126-0377
> I0310 17:56:47.768546  1158 master.cpp:2250] Sending 24 offers to
> framework 201403032301-1255541002-5050-1126-0380
> I0310 17:56:48.794517  1156 master.cpp:2250] Sending 24 offers to
> framework 201403032301-1255541002-5050-1126-0396
>
> I0310 17:56:49.813484  1157 master.cpp:2250] Sending 24 offers to
> framework 201403032301-1255541002-5050-1126-0364
> I0310 17:56:50.833155  1159 master.cpp:2250] Sending 24 offers to
> framework 201403032301-1255541002-5050-1126-0371
> I0310 17:56:51.859712  1158 master.cpp:2250] Sending 24 offers to
> framework 201403032301-1255541002-5050-1126-0377
> I0310 17:56:52.879678  1153 master.cpp:2250] Sending 24 offers to
> framework 201403032301-1255541002-5050-1126-0380
> I0310 17:56:53.904261  1156 master.cpp:2250] Sending 24 offers to
> framework 201403032301-1255541002-5050-1126-0396
>
> I0310 17:56:54.929472  1155 master.cpp:2250] Sending 24 offers to
> framework 201403032301-1255541002-5050-1126-0364
> I0310 17:56:55.947387  1153 master.cpp:2250] Sending 24 offers to
> framework 201403032301-1255541002-5050-1126-0371
> I0310 17:56:56.975060  1157 master.cpp:2250] Sending 24 offers to
> framework 201403032301-1255541002-5050-1126-0377
> I0310 17:56:57.996995  1159 master.cpp:2250] Sending 24 offers to
> framework 201403032301-1255541002-5050-1126-0380
> I0310 17:56:59.022555  1156 master.cpp:2250] Sending 24 offers to
> framework 201403032301-1255541002-5050-1126-0396
>
> Couple of questions:
> 1. Does running multiple frameworks (say more than 10), have an impact on
> resource allocation strategy ?
> 2. If a registered framework keeps declining mesos offers for a while,
> does mesos take that into account while sending offers ?
>
> Links:
> 1. https://github.com/mesosphere/marathon
> 2. https://github.com/jenkinsci/mesos-plugin
>
> --
> Mohit
>

Reply via email to