large task scheduling on multi-framework cluster

Grégoire Seux Tue, 01 Oct 2019 05:18:05 -0700

Hello,

I'm wondering how other mesos users deal with scheduling of large tasks (using 
all resources offered by most agents).

On our cluster, we have various application launched mainly by marathon. Some
of those applications have large instances (30 cpus) which use all resources
from agents (most of our agents expose 30 cpus to mesos). Beyond these large
applications (many instances, many resource per instance) we have a lot more
applications whose instances are of various size (from 1 to 10 cpus).

Our issue lies with scheduling, since marathon uses offers from mesos as they
come and it creates fragmentation: most agents have small tasks running which
prevents big tasks to be scheduled. In an ideal world, mesos (or marathon)
would make sure some apps (let's say frameworks if mesos takes that
responsibility) have guarantees on large offers. We also have non-marathon
in-house frameworks which have similar needs to launch large tasks.

Our current solution is to:

* use a dedicated marathon instance (and a dedicated role) for those big
applications
* dedicate agents to this role

Of course, this require extra work since our mesos clusters are now sharded (it
creates additional toil in term of maintenance & capacity planning).
Our thinking is that mesos allocator might be improved to distribute offers
with a better heuristic than currently (offers are randomly sorted). A bit
similar to what was suggested on
http://mail-archives.apache.org/mod_mbox/mesos-user/201906.mbox/%3cCAHReGaiY0nJ0AevMvKbxAZsy2Xc=jmtszcucdxryzbvwkvv...@mail.gmail.com%3e,
we could imagine to sort offers (offers from most used slaves first).

So I'm curious on how other users handle this kind of needs!

Regards,

--
Grégoire Seux

large task scheduling on multi-framework cluster

Reply via email to