Re: Review Request 39548: SchedulerTest.Suppress is flaky

2015-10-23 Thread Guangya Liu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39548/
---

(Updated 十月 23, 2015, 9:05 a.m.)


Review request for mesos and Vinod Kone.


Changes
---

Address Anand's comments per IRC discussion


Bugs: MESOS-3733
https://issues.apache.org/jira/browse/MESOS-3733


Repository: mesos


Description (updated)
---

Root Cause: The reason is that the DECLINE call set filter as 1hr,
the Clock::advance set as 100m. A race condition is that both DECLINE
and SUPPRESS started up in different threads and the call Clock::advance
may be called before SUPPRESS finished. The clock advanced for 100m which
is greater than 1hr, this caused the allocator start to allocate resource
again and ASSERT_TRUE(event.isPending()) will be failed.

Solution: Make SUPPRESS call is finished before call Clock::advance


Diffs (updated)
-

  src/tests/scheduler_tests.cpp 7946cb48d62f4ed6d0fdbc771746518e31921f97 

Diff: https://reviews.apache.org/r/39548/diff/


Testing
---

Platform: Ubuntu 14.04
make 
make check
bin/mesos-tests.sh --gtest_filter="ContentType/SchedulerTest.Suppress/*"  
--gtest_repeat=-1 --gtest_break_on_failure


Thanks,

Guangya Liu



Re: Review Request 39548: SchedulerTest.Suppress is flaky

2015-10-23 Thread Anand Mazumdar

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39548/#review103745
---

Ship it!


LGTM !

- Anand Mazumdar


On Oct. 23, 2015, 9:05 a.m., Guangya Liu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/39548/
> ---
> 
> (Updated Oct. 23, 2015, 9:05 a.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-3733
> https://issues.apache.org/jira/browse/MESOS-3733
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Root Cause: The reason is that the DECLINE call set filter as 1hr,
> the Clock::advance set as 100m. A race condition is that both DECLINE
> and SUPPRESS started up in different threads and the call Clock::advance
> may be called before SUPPRESS finished. The clock advanced for 100m which
> is greater than 1hr, this caused the allocator start to allocate resource
> again and ASSERT_TRUE(event.isPending()) will be failed.
> 
> Solution: Make SUPPRESS call is finished before call Clock::advance
> 
> 
> Diffs
> -
> 
>   src/tests/scheduler_tests.cpp 7946cb48d62f4ed6d0fdbc771746518e31921f97 
> 
> Diff: https://reviews.apache.org/r/39548/diff/
> 
> 
> Testing
> ---
> 
> Platform: Ubuntu 14.04
> make 
> make check
> bin/mesos-tests.sh --gtest_filter="ContentType/SchedulerTest.Suppress/*"  
> --gtest_repeat=-1 --gtest_break_on_failure
> 
> 
> Thanks,
> 
> Guangya Liu
> 
>



Re: Review Request 39548: SchedulerTest.Suppress is flaky

2015-10-23 Thread Mesos ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39548/#review103752
---


Patch looks great!

Reviews applied: [39548]

All tests passed.

- Mesos ReviewBot


On Oct. 23, 2015, 9:05 a.m., Guangya Liu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/39548/
> ---
> 
> (Updated Oct. 23, 2015, 9:05 a.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-3733
> https://issues.apache.org/jira/browse/MESOS-3733
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Root Cause: The reason is that the DECLINE call set filter as 1hr,
> the Clock::advance set as 100m. A race condition is that both DECLINE
> and SUPPRESS started up in different threads and the call Clock::advance
> may be called before SUPPRESS finished. The clock advanced for 100m which
> is greater than 1hr, this caused the allocator start to allocate resource
> again and ASSERT_TRUE(event.isPending()) will be failed.
> 
> Solution: Make SUPPRESS call is finished before call Clock::advance
> 
> 
> Diffs
> -
> 
>   src/tests/scheduler_tests.cpp 7946cb48d62f4ed6d0fdbc771746518e31921f97 
> 
> Diff: https://reviews.apache.org/r/39548/diff/
> 
> 
> Testing
> ---
> 
> Platform: Ubuntu 14.04
> make 
> make check
> bin/mesos-tests.sh --gtest_filter="ContentType/SchedulerTest.Suppress/*"  
> --gtest_repeat=-1 --gtest_break_on_failure
> 
> 
> Thanks,
> 
> Guangya Liu
> 
>



Re: Review Request 39548: SchedulerTest.Suppress is flaky

2015-10-23 Thread Vinod Kone

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39548/#review103810
---

Ship it!


Ship It!

- Vinod Kone


On Oct. 23, 2015, 9:05 a.m., Guangya Liu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/39548/
> ---
> 
> (Updated Oct. 23, 2015, 9:05 a.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-3733
> https://issues.apache.org/jira/browse/MESOS-3733
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Root Cause: The reason is that the DECLINE call set filter as 1hr,
> the Clock::advance set as 100m. A race condition is that both DECLINE
> and SUPPRESS started up in different threads and the call Clock::advance
> may be called before SUPPRESS finished. The clock advanced for 100m which
> is greater than 1hr, this caused the allocator start to allocate resource
> again and ASSERT_TRUE(event.isPending()) will be failed.
> 
> Solution: Make SUPPRESS call is finished before call Clock::advance
> 
> 
> Diffs
> -
> 
>   src/tests/scheduler_tests.cpp 7946cb48d62f4ed6d0fdbc771746518e31921f97 
> 
> Diff: https://reviews.apache.org/r/39548/diff/
> 
> 
> Testing
> ---
> 
> Platform: Ubuntu 14.04
> make 
> make check
> bin/mesos-tests.sh --gtest_filter="ContentType/SchedulerTest.Suppress/*"  
> --gtest_repeat=-1 --gtest_break_on_failure
> 
> 
> Thanks,
> 
> Guangya Liu
> 
>



Re: Review Request 39548: SchedulerTest.Suppress is flaky

2015-10-23 Thread Mesos ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39548/#review103735
---


Patch looks great!

Reviews applied: [39548]

All tests passed.

- Mesos ReviewBot


On Oct. 23, 2015, 5:29 a.m., Guangya Liu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/39548/
> ---
> 
> (Updated Oct. 23, 2015, 5:29 a.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-3733
> https://issues.apache.org/jira/browse/MESOS-3733
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Root Cause: The reason is that the DECLINE call set filter as 1hr,
> the Clock::advance set as 100m. A race condition is that both DECLINE
> and SUPPRESS started up in different threads and the call Clock::advance
> may be called before SUPPRESS finished. The clock advanced for 100m which
> is greater than 1hr, this caused the allocator start to allocate resource
> again and ASSERT_TRUE(event.isPending()) will be failed.
> 
> Solution: Call SUPPRESS first, and make sure SUPPRESS call ready before
> call DECLINE.
> 
> 
> Diffs
> -
> 
>   src/tests/scheduler_tests.cpp 7946cb48d62f4ed6d0fdbc771746518e31921f97 
> 
> Diff: https://reviews.apache.org/r/39548/diff/
> 
> 
> Testing
> ---
> 
> Platform: Ubuntu 14.04
> make 
> make check
> bin/mesos-tests.sh --gtest_filter="ContentType/SchedulerTest.Suppress/*"  
> --gtest_repeat=-1 --gtest_break_on_failure
> 
> 
> Thanks,
> 
> Guangya Liu
> 
>



Review Request 39548: SchedulerTest.Suppress is flaky

2015-10-22 Thread Guangya Liu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39548/
---

Review request for mesos and Vinod Kone.


Bugs: MESOS-3733
https://issues.apache.org/jira/browse/MESOS-3733


Repository: mesos


Description
---

Root Cause: The reason is that the DECLINE call set filter as 1hr,
the Clock::advance set as 100m. A race condition is that both DECLINE
and SUPPRESS started up in different threads and the call Clock::advance
may be called before SUPPRESS finished. The clock advanced for 100m which
is greater than 1hr, this caused the allocator start to allocate resource
again and ASSERT_TRUE(event.isPending()) will be failed.

Solution: Call SUPPRESS first, this can make sure the SUPPRESS can
be finished before DECLINE.


Diffs
-

  src/tests/scheduler_tests.cpp 7946cb48d62f4ed6d0fdbc771746518e31921f97 

Diff: https://reviews.apache.org/r/39548/diff/


Testing
---

Platform: Ubuntu 14.04
make 
make check
bin/mesos-tests.sh --gtest_filter="ContentType/SchedulerTest.Suppress/*"  
--gtest_repeat=-1 --gtest_break_on_failure


Thanks,

Guangya Liu



Re: Review Request 39548: SchedulerTest.Suppress is flaky

2015-10-22 Thread Vinod Kone

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39548/#review103660
---



src/tests/scheduler_tests.cpp (line 1001)


Hmm. Does this really fix the race? What's the guarantee that SUPPRESS and 
DECLINE are processed by the master/allocator before you advance the clock?

I think what you want do instead here is setup a FUTURE_DISPATCH() on 
HierarchicalAllocatorProcess::suppressOffers() and do a clock::settle() on that 
future before you advance the clock. 

Does that make sense?


- Vinod Kone


On Oct. 22, 2015, 9:12 a.m., Guangya Liu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/39548/
> ---
> 
> (Updated Oct. 22, 2015, 9:12 a.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-3733
> https://issues.apache.org/jira/browse/MESOS-3733
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Root Cause: The reason is that the DECLINE call set filter as 1hr,
> the Clock::advance set as 100m. A race condition is that both DECLINE
> and SUPPRESS started up in different threads and the call Clock::advance
> may be called before SUPPRESS finished. The clock advanced for 100m which
> is greater than 1hr, this caused the allocator start to allocate resource
> again and ASSERT_TRUE(event.isPending()) will be failed.
> 
> Solution: Call SUPPRESS first, this can make sure the SUPPRESS can
> be finished before DECLINE.
> 
> 
> Diffs
> -
> 
>   src/tests/scheduler_tests.cpp 7946cb48d62f4ed6d0fdbc771746518e31921f97 
> 
> Diff: https://reviews.apache.org/r/39548/diff/
> 
> 
> Testing
> ---
> 
> Platform: Ubuntu 14.04
> make 
> make check
> bin/mesos-tests.sh --gtest_filter="ContentType/SchedulerTest.Suppress/*"  
> --gtest_repeat=-1 --gtest_break_on_failure
> 
> 
> Thanks,
> 
> Guangya Liu
> 
>



Re: Review Request 39548: SchedulerTest.Suppress is flaky

2015-10-22 Thread Anand Mazumdar

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39548/#review103710
---



src/tests/scheduler_tests.cpp (lines 999 - 1008)


Wondering, why did you change order here i.e. move this piece of code up ? 

Now, with using `FUTURE_DISPATCH` you can gurrantee that `Suppress` would 
be completed before the `Clock` is advanced. Am I missing something ?



src/tests/scheduler_tests.cpp (line 1007)


This alone won't suffice. AFAIK, the `Future` returned by `FUTURE_DISPATCH` 
is marked ready when the function is invoked not neccessarily when the function 
has completely executed.

You would want to pause/settle the clock too. The modified code would look 
like:

```
 Future suppressOffers =
FUTURE_DISPATCH(_, ::suppressOffers);

  {
Call call;
call.mutable_framework_id()->CopyFrom(id);
call.set_type(Call::SUPPRESS);

mesos.send(call);
  }

  AWAIT_READY(suppressOffers);
  
  // Wait for suppress to be complete.
  Clock::pause();
  Clock::settle();
  
  // No offers should be sent within 100 mins because the framework
  // suppressed offers.
  Clock::advance(Minutes(100));
  Clock::settle();
```

Here is some more info: 
http://mesos.apache.org/documentation/latest/testing-patterns/ . What do you 
think ?


- Anand Mazumdar


On Oct. 23, 2015, 2:32 a.m., Guangya Liu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/39548/
> ---
> 
> (Updated Oct. 23, 2015, 2:32 a.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-3733
> https://issues.apache.org/jira/browse/MESOS-3733
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Root Cause: The reason is that the DECLINE call set filter as 1hr,
> the Clock::advance set as 100m. A race condition is that both DECLINE
> and SUPPRESS started up in different threads and the call Clock::advance
> may be called before SUPPRESS finished. The clock advanced for 100m which
> is greater than 1hr, this caused the allocator start to allocate resource
> again and ASSERT_TRUE(event.isPending()) will be failed.
> 
> Solution: Call SUPPRESS first, and make sure SUPPRESS call ready before
> call DECLINE.
> 
> 
> Diffs
> -
> 
>   src/tests/scheduler_tests.cpp 7946cb48d62f4ed6d0fdbc771746518e31921f97 
> 
> Diff: https://reviews.apache.org/r/39548/diff/
> 
> 
> Testing
> ---
> 
> Platform: Ubuntu 14.04
> make 
> make check
> bin/mesos-tests.sh --gtest_filter="ContentType/SchedulerTest.Suppress/*"  
> --gtest_repeat=-1 --gtest_break_on_failure
> 
> 
> Thanks,
> 
> Guangya Liu
> 
>



Re: Review Request 39548: SchedulerTest.Suppress is flaky

2015-10-22 Thread Guangya Liu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39548/
---

(Updated Oct. 23, 2015, 2:32 a.m.)


Review request for mesos and Vinod Kone.


Changes
---

Address Vinod's comments


Bugs: MESOS-3733
https://issues.apache.org/jira/browse/MESOS-3733


Repository: mesos


Description (updated)
---

Root Cause: The reason is that the DECLINE call set filter as 1hr,
the Clock::advance set as 100m. A race condition is that both DECLINE
and SUPPRESS started up in different threads and the call Clock::advance
may be called before SUPPRESS finished. The clock advanced for 100m which
is greater than 1hr, this caused the allocator start to allocate resource
again and ASSERT_TRUE(event.isPending()) will be failed.

Solution: Call SUPPRESS first, and make sure SUPPRESS call ready before
call DECLINE.


Diffs (updated)
-

  src/tests/scheduler_tests.cpp 7946cb48d62f4ed6d0fdbc771746518e31921f97 

Diff: https://reviews.apache.org/r/39548/diff/


Testing
---

Platform: Ubuntu 14.04
make 
make check
bin/mesos-tests.sh --gtest_filter="ContentType/SchedulerTest.Suppress/*"  
--gtest_repeat=-1 --gtest_break_on_failure


Thanks,

Guangya Liu



Re: Review Request 39548: SchedulerTest.Suppress is flaky

2015-10-22 Thread Mesos ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39548/#review103711
---


Patch looks great!

Reviews applied: [39548]

All tests passed.

- Mesos ReviewBot


On Oct. 23, 2015, 2:32 a.m., Guangya Liu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/39548/
> ---
> 
> (Updated Oct. 23, 2015, 2:32 a.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-3733
> https://issues.apache.org/jira/browse/MESOS-3733
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Root Cause: The reason is that the DECLINE call set filter as 1hr,
> the Clock::advance set as 100m. A race condition is that both DECLINE
> and SUPPRESS started up in different threads and the call Clock::advance
> may be called before SUPPRESS finished. The clock advanced for 100m which
> is greater than 1hr, this caused the allocator start to allocate resource
> again and ASSERT_TRUE(event.isPending()) will be failed.
> 
> Solution: Call SUPPRESS first, and make sure SUPPRESS call ready before
> call DECLINE.
> 
> 
> Diffs
> -
> 
>   src/tests/scheduler_tests.cpp 7946cb48d62f4ed6d0fdbc771746518e31921f97 
> 
> Diff: https://reviews.apache.org/r/39548/diff/
> 
> 
> Testing
> ---
> 
> Platform: Ubuntu 14.04
> make 
> make check
> bin/mesos-tests.sh --gtest_filter="ContentType/SchedulerTest.Suppress/*"  
> --gtest_repeat=-1 --gtest_break_on_failure
> 
> 
> Thanks,
> 
> Guangya Liu
> 
>



Re: Review Request 39548: SchedulerTest.Suppress is flaky

2015-10-22 Thread Guangya Liu


> On 十月 23, 2015, 3:53 a.m., Anand Mazumdar wrote:
> > src/tests/scheduler_tests.cpp, lines 999-1008
> > 
> >
> > Wondering, why did you change order here i.e. move this piece of code 
> > up ? 
> > 
> > Now, with using `FUTURE_DISPATCH` you can gurrantee that `Suppress` 
> > would be completed before the `Clock` is advanced. Am I missing something ?

I think that the order of SUPPRESS and DECLINEN is not that important as long 
as we can make sure the offer will be send after decline filter time out. So 
here it is OK to adjust and keep the order.


- Guangya


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39548/#review103710
---


On 十月 23, 2015, 2:32 a.m., Guangya Liu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/39548/
> ---
> 
> (Updated 十月 23, 2015, 2:32 a.m.)
> 
> 
> Review request for mesos and Vinod Kone.
> 
> 
> Bugs: MESOS-3733
> https://issues.apache.org/jira/browse/MESOS-3733
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Root Cause: The reason is that the DECLINE call set filter as 1hr,
> the Clock::advance set as 100m. A race condition is that both DECLINE
> and SUPPRESS started up in different threads and the call Clock::advance
> may be called before SUPPRESS finished. The clock advanced for 100m which
> is greater than 1hr, this caused the allocator start to allocate resource
> again and ASSERT_TRUE(event.isPending()) will be failed.
> 
> Solution: Call SUPPRESS first, and make sure SUPPRESS call ready before
> call DECLINE.
> 
> 
> Diffs
> -
> 
>   src/tests/scheduler_tests.cpp 7946cb48d62f4ed6d0fdbc771746518e31921f97 
> 
> Diff: https://reviews.apache.org/r/39548/diff/
> 
> 
> Testing
> ---
> 
> Platform: Ubuntu 14.04
> make 
> make check
> bin/mesos-tests.sh --gtest_filter="ContentType/SchedulerTest.Suppress/*"  
> --gtest_repeat=-1 --gtest_break_on_failure
> 
> 
> Thanks,
> 
> Guangya Liu
> 
>



Re: Review Request 39548: SchedulerTest.Suppress is flaky

2015-10-22 Thread Guangya Liu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/39548/
---

(Updated 十月 23, 2015, 5:29 a.m.)


Review request for mesos and Vinod Kone.


Changes
---

Address Anand's comments


Bugs: MESOS-3733
https://issues.apache.org/jira/browse/MESOS-3733


Repository: mesos


Description
---

Root Cause: The reason is that the DECLINE call set filter as 1hr,
the Clock::advance set as 100m. A race condition is that both DECLINE
and SUPPRESS started up in different threads and the call Clock::advance
may be called before SUPPRESS finished. The clock advanced for 100m which
is greater than 1hr, this caused the allocator start to allocate resource
again and ASSERT_TRUE(event.isPending()) will be failed.

Solution: Call SUPPRESS first, and make sure SUPPRESS call ready before
call DECLINE.


Diffs (updated)
-

  src/tests/scheduler_tests.cpp 7946cb48d62f4ed6d0fdbc771746518e31921f97 

Diff: https://reviews.apache.org/r/39548/diff/


Testing
---

Platform: Ubuntu 14.04
make 
make check
bin/mesos-tests.sh --gtest_filter="ContentType/SchedulerTest.Suppress/*"  
--gtest_repeat=-1 --gtest_break_on_failure


Thanks,

Guangya Liu