Currently, recovered resources are not immediately re-offered as you
noticed, and the default allocation interval is 1 second. I'd recommend
lowering that (e.g. --allocation_interval=50ms), that should improve the
second bullet you listed. Although, in your case it would be better to
immediately re-offer recovered resources (feel free to file a ticket for
supporting that).

For the first bullet, mind providing some more information? E.g. master
flags, slave flags, scheduler logs, master logs, slave logs, executor logs?
We would need to trace through a task launch to see where the latency is
being introduced.

On Fri, Jul 17, 2015 at 12:26 PM, Philip Weaver <[email protected]>
wrote:

> I'm trying to understand the behavior of mesos, and if what I am observing
> is typical or if I'm doing something wrong, and what options I have for
> improving the performance of how offers are made and how tasks are executed
> for my particular use case.
>
> I have written a Scheduler that has a queue of very small tasks (for
> testing, they are "echo hello world", but in production many of them won't
> be much more expensive than that). Each task is configured to use 1 cpu
> resource. When resourceOffers is called, I launch as many tasks as I can in
> the given offers; that is, one call to driver.launchTasks for each offer,
> with a list of tasks that has one task for each cpu in that offer.
>
> On a cluster of 3 nodes and 4 cores each (12 total cores), it takes 120s
> to execute 1000 tasks out of the queue. We are evaluting mesos because we
> want to use it to replace our current homegrown cluster controller, which
> can execute 1000 tasks in way less than 120s.
>
> I am seeing two things that concern me:
>
>    - The time between driver.launchTasks and receiving a callback to
>    statusUpdate when the task completes is typically 200-500ms, and sometimes
>    even as high as 1000-2000ms.
>    - The time between when a task completes and when I get an offer for
>    the newly freed resource is another 500ms or so.
>
> These latencies explain why I can only execute tasks at a rate of about
> 8/s.
>
> It looks like my offers always include all 4 cores on each machine, which
> would indicate that mesos doesn't like to send an offer as soon as a single
> resource is avaiable, and prefers to delay and send an offer with more
> resources in it. Is this true?
>
> Thanks in advance for any advice you can offer!
>
> - Phllip
>
>

Reply via email to