Ok, thanks!

On Fri, Jul 17, 2015 at 2:18 PM, Alexander Gallego <[email protected]>
wrote:

> I use a similar pattern.
>
> I have my own scheduler as you have. I deploy my own executor which
> downloads a tar from some storage and effectively ` execvp ( ... ) ` a
> proc. It monitors the child proc and reports status of child pid exit
> status.
>
> Check out the Marathon code if you are writing in scala. It is an
> excellent example for both scheduler and executor templates.
>
> -ag
>
> On Fri, Jul 17, 2015 at 5:06 PM, Philip Weaver <[email protected]>
> wrote:
>
>> Awesome, I suspected that was the case, but hadn't discovered the
>> --allocation_interval flag, so I will use that.
>>
>> I installed from the mesosphere RPMs and didn't change any flags from
>> there. I will try to find some logs that provide some insight into the
>> execution times.
>>
>> I am using a command task. I haven't looked into executors yet; I had a
>> hard time finding some examples in my language (Scala).
>>
>> On Fri, Jul 17, 2015 at 2:00 PM, Benjamin Mahler <
>> [email protected]> wrote:
>>
>>> One other thing, do you use an executor to run many tasks? Or are you
>>> using a command task?
>>>
>>> On Fri, Jul 17, 2015 at 1:54 PM, Benjamin Mahler <
>>> [email protected]> wrote:
>>>
>>>> Currently, recovered resources are not immediately re-offered as you
>>>> noticed, and the default allocation interval is 1 second. I'd recommend
>>>> lowering that (e.g. --allocation_interval=50ms), that should improve the
>>>> second bullet you listed. Although, in your case it would be better to
>>>> immediately re-offer recovered resources (feel free to file a ticket for
>>>> supporting that).
>>>>
>>>> For the first bullet, mind providing some more information? E.g. master
>>>> flags, slave flags, scheduler logs, master logs, slave logs, executor logs?
>>>> We would need to trace through a task launch to see where the latency is
>>>> being introduced.
>>>>
>>>> On Fri, Jul 17, 2015 at 12:26 PM, Philip Weaver <
>>>> [email protected]> wrote:
>>>>
>>>>> I'm trying to understand the behavior of mesos, and if what I am
>>>>> observing is typical or if I'm doing something wrong, and what options I
>>>>> have for improving the performance of how offers are made and how tasks 
>>>>> are
>>>>> executed for my particular use case.
>>>>>
>>>>> I have written a Scheduler that has a queue of very small tasks (for
>>>>> testing, they are "echo hello world", but in production many of them won't
>>>>> be much more expensive than that). Each task is configured to use 1 cpu
>>>>> resource. When resourceOffers is called, I launch as many tasks as I can 
>>>>> in
>>>>> the given offers; that is, one call to driver.launchTasks for each offer,
>>>>> with a list of tasks that has one task for each cpu in that offer.
>>>>>
>>>>> On a cluster of 3 nodes and 4 cores each (12 total cores), it takes
>>>>> 120s to execute 1000 tasks out of the queue. We are evaluting mesos 
>>>>> because
>>>>> we want to use it to replace our current homegrown cluster controller,
>>>>> which can execute 1000 tasks in way less than 120s.
>>>>>
>>>>> I am seeing two things that concern me:
>>>>>
>>>>>    - The time between driver.launchTasks and receiving a callback to
>>>>>    statusUpdate when the task completes is typically 200-500ms, and 
>>>>> sometimes
>>>>>    even as high as 1000-2000ms.
>>>>>    - The time between when a task completes and when I get an offer
>>>>>    for the newly freed resource is another 500ms or so.
>>>>>
>>>>> These latencies explain why I can only execute tasks at a rate of
>>>>> about 8/s.
>>>>>
>>>>> It looks like my offers always include all 4 cores on each machine,
>>>>> which would indicate that mesos doesn't like to send an offer as soon as a
>>>>> single resource is avaiable, and prefers to delay and send an offer with
>>>>> more resources in it. Is this true?
>>>>>
>>>>> Thanks in advance for any advice you can offer!
>>>>>
>>>>> - Phllip
>>>>>
>>>>>
>>>>
>>>
>>
>
>
>
>

Reply via email to