Re: Review Request 67444: Persisted role consumed quota info in the allocator.

2018-07-17 Thread Meng Zhu


> On July 5, 2018, 2:22 p.m., Benjamin Mahler wrote:
> > src/master/allocator/mesos/hierarchical.cpp
> > Lines 2586-2588 (patched)
> > 
> >
> > It's pretty hard to reason from here about whether this is correct. For 
> > example, how do we know that these quantities were not already tracked 
> > because they were allocated prior to becoming reserved? If that invariant 
> > doesn't hold we'll double count?

Thanks for catching this!

The invariant should be:
(1) when tracking reservation, only track unallocated reservations
(2) when track allocation, only track unreserved allocations

(2) is already being done. (1) is hard to do given the current abstraction, we 
will need more than raw "reservations". Will need more refactoring on the Slave 
struct in the allocator to make this work and readable. Will come back to this 
later.

In addition, this reminds me that we do not have tests for the above scenario 
yet. Will add some.


- Meng


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67444/#review205762
---


On June 28, 2018, 1:55 p.m., Meng Zhu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67444/
> ---
> 
> (Updated June 28, 2018, 1:55 p.m.)
> 
> 
> Review request for mesos and Benjamin Mahler.
> 
> 
> Bugs: MESOS-8802
> https://issues.apache.org/jira/browse/MESOS-8802
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> The allocator needs to keep track of role consumed quota
> to maintain quota headroom. Currently this info is
> constructed on the fly prior to each allocation cycle.
> 
> This patch lets the allocator buildup and persist this info
> across allocation cycles to improve performance and reduce
> code complexity.
> 
> 
> Diffs
> -
> 
>   src/master/allocator/mesos/hierarchical.hpp 
> 0f6c0e96a105c64465d3f5db4ff663d8fdfe7e26 
>   src/master/allocator/mesos/hierarchical.cpp 
> cbdfb2ba9c25755ac631557e0e7dbd721f861a4d 
> 
> 
> Diff: https://reviews.apache.org/r/67444/diff/1/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Meng Zhu
> 
>



Re: Review Request 67444: Persisted role consumed quota info in the allocator.

2018-07-05 Thread Benjamin Mahler

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67444/#review205762
---



Looks like a great cleanup!

Persist tends to carry the connotation of writing something to durable storage. 
How about:

```
Made quota consumption tracking event-driven in the allocator.

The allocator needs to keep track of role consumed quota
to maintain quota headroom. Currently this info is
constructed on the fly prior to each allocation cycle.

This patch lets the allocator track quota consumption
across allocation cycles in an event-driven manner to improve
performance and reduce code complexity.
```


src/master/allocator/mesos/hierarchical.hpp
Lines 509-512 (patched)


```
  // To enforce quota, we keep track of how much quota is consumed by each 
role.
  // Quota consumption always includes the role's reservations (since they 
cannot
  // be allocated to other roles) as well as any allocated resources for 
the role.
  //
  // This is only tracked for roles with non-default quota.
```



src/master/allocator/mesos/hierarchical.hpp
Lines 526 (patched)


How about `consumedQuotaScalarQuantities`? We don't call it `rolesQuota` 
above for example.



src/master/allocator/mesos/hierarchical.cpp
Lines 1387-1390 (patched)


I don't think we need to repeat ourselves with the "in other words" part. I 
think we just need to explain why reservations are included in one place (in 
the header):

```
  // Track quota consumption.
  //
  // A role's quota consumption includes its allocation as well as any
  // unallocated reservations:
  //
  //   Consumed Quota = reservations + unreserved allocation
  //  = reservations + allocation - allocated reservations
```



src/master/allocator/mesos/hierarchical.cpp
Lines 1404 (patched)


It wasn't clear to me when reading this why it needs to be done on each 
agent, so we should probably clarify that?

```
// Lastly, subtract allocated reservations. This needs to be done on a 
per-agent basis because ...
```



src/master/allocator/mesos/hierarchical.cpp
Lines 1405-1406 (patched)


We probably want a TODO to optimize this by avoiding the copy by returning 
a const reference to the map?



src/master/allocator/mesos/hierarchical.cpp
Lines 1385-1390 (original), 1413-1418 (patched)


Just an unrelated observation, this note looks misleading, since the master 
does try to rescind offers to re-balance.



src/master/allocator/mesos/hierarchical.cpp
Lines 1439-1440 (patched)


It looks like it follows the same order as operations in setQuota, so I 
would expect this to be done after metrics.removeQuota. Any reason not to?

Actually, it seems like the metric should get added after quota tracking 
and removed before quota tracking is removed, since I would imagine the metrics 
look at the tracking information. It looks like we currently only expose 
'allocated' instead of 'consumed' for now, but we probably want to expose 
'consumed' soon, is there a ticket?



src/master/allocator/mesos/hierarchical.cpp
Lines 2613-2614 (original), 2576-2577 (patched)


Hm.. an aside, but I was puzzled about this function. Since it takes 
role->reservations it loses information for non-scalar resources.

Fortunately, this function only uses the scalar quantities anyway, but we 
should make the interface clearly take quantities to clarify the assumption.



src/master/allocator/mesos/hierarchical.cpp
Lines 2586-2588 (patched)


It's pretty hard to reason from here about whether this is correct. For 
example, how do we know that these quantities were not already tracked because 
they were allocated prior to becoming reserved? If that invariant doesn't hold 
we'll double count?



src/master/allocator/mesos/hierarchical.cpp
Lines 2607-2611 (patched)


Ditto here, it's hard to reason about why we can remove it here, what if 
the resources are unreserved but remain allocated?



src/master/allocator/mesos/hierarchical.cpp
Lines 2727-2728 (patched)


Hm.. there seem to be some invariants here in how the tracking functions 
are called but I can't quite figure them out. Is there a way to enforce them?

Let's say I have allocated 

Re: Review Request 67444: Persisted role consumed quota info in the allocator.

2018-06-28 Thread Mesos Reviewbot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67444/#review205541
---



Patch looks great!

Reviews applied: [67773, 67444]

Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' 
CONFIGURATION='--verbose --disable-libtool-wrappers' ENVIRONMENT='GLOG_v=1 
MESOS_VERBOSE=1'; ./support/docker-build.sh

- Mesos Reviewbot


On June 28, 2018, 8:55 p.m., Meng Zhu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67444/
> ---
> 
> (Updated June 28, 2018, 8:55 p.m.)
> 
> 
> Review request for mesos and Benjamin Mahler.
> 
> 
> Bugs: MESOS-8802
> https://issues.apache.org/jira/browse/MESOS-8802
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> The allocator needs to keep track of role consumed quota
> to maintain quota headroom. Currently this info is
> constructed on the fly prior to each allocation cycle.
> 
> This patch lets the allocator buildup and persist this info
> across allocation cycles to improve performance and reduce
> code complexity.
> 
> 
> Diffs
> -
> 
>   src/master/allocator/mesos/hierarchical.hpp 
> 0f6c0e96a105c64465d3f5db4ff663d8fdfe7e26 
>   src/master/allocator/mesos/hierarchical.cpp 
> cbdfb2ba9c25755ac631557e0e7dbd721f861a4d 
> 
> 
> Diff: https://reviews.apache.org/r/67444/diff/1/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Meng Zhu
> 
>



Re: Review Request 67444: Persisted role consumed quota info in the allocator.

2018-06-28 Thread Meng Zhu


> On June 28, 2018, 3:25 p.m., Mesos Reviewbot Windows wrote:
> > FAIL: Some of the unit tests failed. Please check the relevant logs.
> > 
> > Reviews applied: `['67773', '67444']`
> > 
> > Failed command: `Start-MesosCITesting`
> > 
> > All the build artifacts available at: 
> > http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/67444
> > 
> > Relevant logs:
> > 
> > - 
> > [mesos-tests-stdout.log](http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/67444/logs/mesos-tests-stdout.log):
> > 
> > ```
> > [   OK ] Endpoint/SlaveEndpointTest.NoAuthorizer/2 (127 ms)
> > [--] 9 tests from Endpoint/SlaveEndpointTest (1029 ms total)
> > 
> > [--] 2 tests from ContainerizerType/DefaultContainerDNSFlagTest
> > [ RUN  ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/0
> > [   OK ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/0 
> > (38 ms)
> > [ RUN  ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/1
> > [   OK ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/1 
> > (38 ms)
> > [--] 2 tests from ContainerizerType/DefaultContainerDNSFlagTest (77 
> > ms total)
> > 
> > [--] 1 test from IsolationFlag/CpuIsolatorTest
> > [ RUN  ] IsolationFlag/CpuIsolatorTest.ROOT_UserCpuUsage/0
> > [   OK ] IsolationFlag/CpuIsolatorTest.ROOT_UserCpuUsage/0 (848 ms)
> > [--] 1 test from IsolationFlag/CpuIsolatorTest (870 ms total)
> > 
> > [--] 1 test from IsolationFlag/MemoryIsolatorTest
> > [ RUN  ] IsolationFlag/MemoryIsolatorTest.ROOT_MemUsage/0
> > [   OK ] IsolationFlag/MemoryIsolatorTest.ROOT_MemUsage/0 (838 ms)
> > [--] 1 test from IsolationFlag/MemoryIsolatorTest (859 ms total)
> > 
> > [--] Global test environment tear-down
> > [==] 990 tests from 98 test cases ran. (460605 ms total)
> > [  PASSED  ] 989 tests.
> > [  FAILED  ] 1 test, listed below:
> > [  FAILED  ] MesosContainerizer/DefaultExecutorTest.SigkillExecutor/0, 
> > where GetParam() = "mesos"
> > 
> >  1 FAILED TEST
> >   YOU HAVE 220 DISABLED TESTS
> > 
> > ```
> > 
> > - 
> > [mesos-tests-stderr.log](http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/67444/logs/mesos-tests-stderr.log):
> > 
> > ```
> > I0628 22:25:13.862025 19180 master.cpp:10900] Updating the state of task 
> > c4b533f8-2e56-4c92-bcfe-6d053a2640a7 of framework 
> > 831789c0-9734-4768-8415-c26da2d176e3- (latest state: TASK_KILLED, 
> > status update state: TASK_KILLED)
> > I0628 22:25:13.862025 20224 slave.cpp:3939] Shutting down framework 
> > 831789c0-9734-4768-8415-c26da2d176e3-
> > I0628 22:25:13.862025 20224 slave.cpp:6660] Shutting down executor 
> > 'c4b533f8-2e56-4c92-bcfe-6d053a2640a7' of framework 
> > 831789c0-9734-4768-8415-c26da2d176e3- at executor(1)@192.10.1.6:59686
> > I0628 22:25:13.864045 20224 slave.cpp:931] Agent terminating
> > W0628 22:25:13.I0628 22:25:13.658028 18908 exec.cpp:162] Version: 1.7.0
> > I0628 22:25:13.683043 20960 exec.cpp:236] Executor registered on agent 
> > 831789c0-9734-4768-8415-c26da2d176e3-S0
> > I0628 22:25:13.687032 15960 executor.cpp:182] Received SUBSCRIBED event
> > I0628 22:25:13.691023 15960 executor.cpp:186] Subscribed executor on 
> > windows-02.enofukwu14ruplxn0gs3yzmsgf.xx.internal.cloudapp.net
> > I0628 22:25:13.691023 15960 executor.cpp:182] Received LAUNCH event
> > I0628 22:25:13.696027 15960 executor.cpp:679] Starting task 
> > c4b533f8-2e56-4c92-bcfe-6d053a2640a7
> > I0628 22:25:13.777024 15960 executor.cpp:499] Running 
> > 'D:\DCOS\mesos\src\mesos-containerizer.exe launch '
> > I0628 22:25:13.833027 15960 executor.cpp:693] Forked command at 20420
> > I0628 22:25:13.864045 18700 exec.cpp:445] Executor asked to shutdown
> > I0628 22:25:13.865029 15960 executor.cpp:182] Received SHUTDOWN event
> > I0628 22:25:13.865029 15960 executor.cpp:796] Shutting down
> > I0628 22:25:13.865029 15960 executor.cpp:909] Sending SIGTERM to process 
> > tree at pid 20864045 20224 slave.cpp:3935] Ignoring shutdown framework 
> > 831789c0-9734-4768-8415-c26da2d176e3- because it is terminating
> > I0628 22:25:13.864045 19180 master.cpp:10999] Removing task 
> > c4b533f8-2e56-4c92-bcfe-6d053a2640a7 with resources cpus(allocated: *):4; 
> > mem(allocated: *):2048; disk(allocated: *):1024; ports(allocated: 
> > *):[31000-32000] of framework 831789c0-9734-4768-8415-c26da2d176e3- on 
> > agent 831789c0-9734-4768-8415-c26da2d176e3-S0 at 
> > slave(449)@192.10.1.6:59665 
> > (windows-02.enofukwu14ruplxn0gs3yzmsgf.xx.internal.cloudapp.net)
> > I0628 22:25:13.868031 19180 master.cpp:1330] Agent 
> > 831789c0-9734-4768-8415-c26da2d176e3-S0 at slave(449)@192.10.1.6:59665 
> > (windows-02.enofukwu14ruplxn0gs3yzmsgf.xx.internal.cloudapp.net) 
> > disconnected
> > I0628 22:25:13.868031 19180 master.cpp:3340] Disconnecting agent 
> > 831789c0-9734-4768-8415-c26da2d176e3-S0 at slave(449)@192.10.1.6:59665 
> > 

Re: Review Request 67444: Persisted role consumed quota info in the allocator.

2018-06-28 Thread Mesos Reviewbot Windows

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67444/#review205537
---



FAIL: Some of the unit tests failed. Please check the relevant logs.

Reviews applied: `['67773', '67444']`

Failed command: `Start-MesosCITesting`

All the build artifacts available at: 
http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/67444

Relevant logs:

- 
[mesos-tests-stdout.log](http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/67444/logs/mesos-tests-stdout.log):

```
[   OK ] Endpoint/SlaveEndpointTest.NoAuthorizer/2 (127 ms)
[--] 9 tests from Endpoint/SlaveEndpointTest (1029 ms total)

[--] 2 tests from ContainerizerType/DefaultContainerDNSFlagTest
[ RUN  ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/0
[   OK ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/0 (38 
ms)
[ RUN  ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/1
[   OK ] ContainerizerType/DefaultContainerDNSFlagTest.ValidateFlag/1 (38 
ms)
[--] 2 tests from ContainerizerType/DefaultContainerDNSFlagTest (77 ms 
total)

[--] 1 test from IsolationFlag/CpuIsolatorTest
[ RUN  ] IsolationFlag/CpuIsolatorTest.ROOT_UserCpuUsage/0
[   OK ] IsolationFlag/CpuIsolatorTest.ROOT_UserCpuUsage/0 (848 ms)
[--] 1 test from IsolationFlag/CpuIsolatorTest (870 ms total)

[--] 1 test from IsolationFlag/MemoryIsolatorTest
[ RUN  ] IsolationFlag/MemoryIsolatorTest.ROOT_MemUsage/0
[   OK ] IsolationFlag/MemoryIsolatorTest.ROOT_MemUsage/0 (838 ms)
[--] 1 test from IsolationFlag/MemoryIsolatorTest (859 ms total)

[--] Global test environment tear-down
[==] 990 tests from 98 test cases ran. (460605 ms total)
[  PASSED  ] 989 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] MesosContainerizer/DefaultExecutorTest.SigkillExecutor/0, where 
GetParam() = "mesos"

 1 FAILED TEST
  YOU HAVE 220 DISABLED TESTS

```

- 
[mesos-tests-stderr.log](http://dcos-win.westus.cloudapp.azure.com/mesos-build/review/67444/logs/mesos-tests-stderr.log):

```
I0628 22:25:13.862025 19180 master.cpp:10900] Updating the state of task 
c4b533f8-2e56-4c92-bcfe-6d053a2640a7 of framework 
831789c0-9734-4768-8415-c26da2d176e3- (latest state: TASK_KILLED, status 
update state: TASK_KILLED)
I0628 22:25:13.862025 20224 slave.cpp:3939] Shutting down framework 
831789c0-9734-4768-8415-c26da2d176e3-
I0628 22:25:13.862025 20224 slave.cpp:6660] Shutting down executor 
'c4b533f8-2e56-4c92-bcfe-6d053a2640a7' of framework 
831789c0-9734-4768-8415-c26da2d176e3- at executor(1)@192.10.1.6:59686
I0628 22:25:13.864045 20224 slave.cpp:931] Agent terminating
W0628 22:25:13.I0628 22:25:13.658028 18908 exec.cpp:162] Version: 1.7.0
I0628 22:25:13.683043 20960 exec.cpp:236] Executor registered on agent 
831789c0-9734-4768-8415-c26da2d176e3-S0
I0628 22:25:13.687032 15960 executor.cpp:182] Received SUBSCRIBED event
I0628 22:25:13.691023 15960 executor.cpp:186] Subscribed executor on 
windows-02.enofukwu14ruplxn0gs3yzmsgf.xx.internal.cloudapp.net
I0628 22:25:13.691023 15960 executor.cpp:182] Received LAUNCH event
I0628 22:25:13.696027 15960 executor.cpp:679] Starting task 
c4b533f8-2e56-4c92-bcfe-6d053a2640a7
I0628 22:25:13.777024 15960 executor.cpp:499] Running 
'D:\DCOS\mesos\src\mesos-containerizer.exe launch '
I0628 22:25:13.833027 15960 executor.cpp:693] Forked command at 20420
I0628 22:25:13.864045 18700 exec.cpp:445] Executor asked to shutdown
I0628 22:25:13.865029 15960 executor.cpp:182] Received SHUTDOWN event
I0628 22:25:13.865029 15960 executor.cpp:796] Shutting down
I0628 22:25:13.865029 15960 executor.cpp:909] Sending SIGTERM to process tree 
at pid 20864045 20224 slave.cpp:3935] Ignoring shutdown framework 
831789c0-9734-4768-8415-c26da2d176e3- because it is terminating
I0628 22:25:13.864045 19180 master.cpp:10999] Removing task 
c4b533f8-2e56-4c92-bcfe-6d053a2640a7 with resources cpus(allocated: *):4; 
mem(allocated: *):2048; disk(allocated: *):1024; ports(allocated: 
*):[31000-32000] of framework 831789c0-9734-4768-8415-c26da2d176e3- on 
agent 831789c0-9734-4768-8415-c26da2d176e3-S0 at slave(449)@192.10.1.6:59665 
(windows-02.enofukwu14ruplxn0gs3yzmsgf.xx.internal.cloudapp.net)
I0628 22:25:13.868031 19180 master.cpp:1330] Agent 
831789c0-9734-4768-8415-c26da2d176e3-S0 at slave(449)@192.10.1.6:59665 
(windows-02.enofukwu14ruplxn0gs3yzmsgf.xx.internal.cloudapp.net) disconnected
I0628 22:25:13.868031 19180 master.cpp:3340] Disconnecting agent 
831789c0-9734-4768-8415-c26da2d176e3-S0 at slave(449)@192.10.1.6:59665 
(windows-02.enofukwu14ruplxn0gs3yzmsgf.xx.internal.cloudapp.net)
I0628 22:25:13.868031 19180 master.cpp:3359] Deactivating agent 
831789c0-9734-4768-8415-c26da2d176e3-S0 at slave(449)@192.10.1.6:59665 
(windows-02.enofukwu14ruplxn0gs3yzmsgf.xx.internal.cloudapp.net)
I0628