I'm trying to understand the behavior of mesos, and if what I am observing
is typical or if I'm doing something wrong, and what options I have for
improving the performance of how offers are made and how tasks are executed
for my particular use case.

I have written a Scheduler that has a queue of very small tasks (for
testing, they are "echo hello world", but in production many of them won't
be much more expensive than that). Each task is configured to use 1 cpu
resource. When resourceOffers is called, I launch as many tasks as I can in
the given offers; that is, one call to driver.launchTasks for each offer,
with a list of tasks that has one task for each cpu in that offer.

On a cluster of 3 nodes and 4 cores each (12 total cores), it takes 120s to
execute 1000 tasks out of the queue. We are evaluting mesos because we want
to use it to replace our current homegrown cluster controller, which can
execute 1000 tasks in way less than 120s.

I am seeing two things that concern me:

   - The time between driver.launchTasks and receiving a callback to
   statusUpdate when the task completes is typically 200-500ms, and sometimes
   even as high as 1000-2000ms.
   - The time between when a task completes and when I get an offer for the
   newly freed resource is another 500ms or so.

These latencies explain why I can only execute tasks at a rate of about 8/s.

It looks like my offers always include all 4 cores on each machine, which
would indicate that mesos doesn't like to send an offer as soon as a single
resource is avaiable, and prefers to delay and send an offer with more
resources in it. Is this true?

Thanks in advance for any advice you can offer!

- Phllip

Reply via email to