Trying to test the framework in an automated way, I tend to think of the
framework in these parts:
1. Executor
2. Scheduler's interaction with Mesos and state persistence
3. Scheduler's task assignment of resources

I will skip #1, you covered that already and it depends largely on the kind
of executor being used.

#2 is mostly achieved for us using a state machine along with a reliable
persistence engine. Then, it comes down to testing the state machine. Which
could be pretty simple. Besides the obvious state transition rules testing,
we only add testing of alert generation/handling when certain state
transitions timeout. For example, a task in STAGING state transitions to
STARTING state in certain time, or an alert is generated.

#3 is where we have spent most of the time. This may not be necessary for
simpler assignment strategies such as first fit. We are doing a bit more
for optimal task assignments with hard/soft constraints, auto scaling,
etc.. Trying to test a sophisticated scheduler can be non-trivial. But,
fortunately, it can be unit tested without requiring rest of Mesos. Offers
can be mocked/created for testing including all resources available, etc
(we do this currently for CPU,memory,ports). Using offers to launch tasks
in Mesos can be mocked by generating new offers less resources used by
launched tasks. As of now I have as many LoC in unit tests as the actual
code. Sometimes it takes less effort to write a new scheduler feature but
more effort to come up with deterministic tests for it. And far more effort
to debug it in real runs, if it weren't unit tested.

About #4 in your list "4. scheduler is fine in the wild ( in presence of
others/failures/checkpointing/...)", I'd call out ZooKeeper interaction as
well since most likely there's multiple copies of the scheduler running
using leader election strategy for HA purposes.

Happy to hear other strategies as well...

Sharma



On Sun, Oct 12, 2014 at 8:44 AM, Dharmesh Kakadia <[email protected]>
wrote:

> Thanks David.
>
> Taking state of the framework is an interesting design. I am assuming the
> scheduler is maintaining the state and then handing tasks on slaves. If
> that's the case, we can safely test executor (stateless - receiving event
> and returning appropriate status to the scheduler). You construct scheduler
> tests similarly by passing different states and event and observing the
> next state. This way you will be sure that your callbacks works fine in
> *isolation*. I would be concerned about the state of the framework in
> case of callback ordering (or re-execution) in *all possible scenarios*.
> Mocking is exactly what might uncover such bugs, but as you pointed out, I
> also think it would not be trivial for many frameworks.
>
> At a high-level it would be important to know for frameworks developers
> that,
> 1. executors are working fine in isolation on fresh start, implementing
> the feature.
> 2. executors are working fine when rescheduled/restarted/in presence of
> other executors.
> 3. scheduler is working fine in isolation.
> 4. scheduler is fine in the wild ( in presence of
> others/failures/checkpointing/...).
>
> 1 is easy to do traditionally. 2 is possible if your executors do not have
> side effect or using Docker etc.
> 3 and 4 are not easy to do. I think having support/library for testing
> scheduler is something all the framework writer would benefit from. Not
> having to think about communication between executors and scheduler is
> already a big plus, can we also make it easier for developers to test about
> their scheduler behaviour?
>
> Thoughts?
>
> I would love to hear thoughts from others.
>
> Thanks,
> Dharmesh
>
> On Sun, Oct 12, 2014 at 8:03 PM, David Greenberg <[email protected]>
> wrote:
>
>> For our frameworks, we don't tend to do much automated testing of the
>> Mesos interface--instead, we construct the framework state, then "send it a
>> message", since our callbacks take the state of the framework + the event
>> as the argument. This way, we don't need to have mesos running, and we can
>> trim away large amounts of code necessary to connect to mesos but
>> unnecessary for the actual feature under test. We've also been
>> experimenting with simulation testing by mocking out the mesos APIs. These
>> techniques are mostly effective when you can pretend that the executors
>> you're using don't communicate much, or when they're trivial to mock.
>>
>> On Sun, Oct 12, 2014 at 9:42 AM, Dharmesh Kakadia <[email protected]>
>> wrote:
>>
>>> Hi,
>>>
>>> I am working on a tiny experimental framework for Mesos. I was wondering
>>> what is the recommended way of writing testcases for framework testing. I
>>> looked at the several existing frameworks, but its still not clear to me. I
>>> understand that I might be able to test executor functionality in isolation
>>> through normal test cases, but testing as a whole framework is what I am
>>> unclear about.
>>>
>>> Suggestions? Is that a non-goal? How do other framework developers go
>>> about it?
>>>
>>> Also, on the related note, is there a way to debug frameworks in better
>>> way than sifting through logs?
>>>
>>> Thanks,
>>> Dharmesh
>>>
>>>
>>>
>>
>

Reply via email to