Re: Review Request 67022: Refreshed checkpointed SlaveInfo if `--reconfiguration_policy==any`.

2018-09-13 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67022/
---

(Updated Sept. 13, 2018, 10:17 a.m.)


Review request for mesos, Eric Chung and Jason Lai.


Changes
---

Reset requiredMasterCapabilities.agentUpdate after successful refresh.


Repository: mesos


Description
---

Refreshed checkpointed SlaveInfo if `--reconfiguration_policy==any`.


Diffs (updated)
-

  src/slave/flags.cpp fd53d90776967ae97575140778129d6fddd726d2 
  src/slave/slave.cpp e6c7e686f287fb4448a0074d4e99298665fc866d 


Diff: https://reviews.apache.org/r/67022/diff/3/

Changes: https://reviews.apache.org/r/67022/diff/2-3/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 67022: Refreshed checkpointed SlaveInfo if `--reconfiguration_policy==any`.

2018-09-11 Thread Zhitao Li


> On Sept. 11, 2018, 10:35 a.m., Benno Evers wrote:
> > src/slave/slave.cpp
> > Lines 1607 (patched)
> > <https://reviews.apache.org/r/67022/diff/1/?file=2018278#file2018278line1607>
> >
> > If I understand the slave code correctly, I think this would checkpoint 
> > the new state every time we're losing connection to a master (e.g. when the 
> > leader changes).
> > 
> > I'd suggest moving it into `Slave::recover()`, i.e. the place where we 
> > set `requiredMasterCapabilities.agentUpdate` to true in the first place.

The condition to set `requiredMasterCapabilities.agentUpdate` is:

```
 // Check for SlaveInfo compatibility.  

  
 Try _compatible = 

  
   compatible(slaveState->info.get(), info);

  


  
 if (_compatible.isSome()) {

  
   // Permitted change, so we reuse the recovered agent id and reconnect

  
   // to running executors. 

  


  
   // Prior to Mesos 1.5, the master expected that an agent would never 

  
   // change its `SlaveInfo` and keep the same slave id, and therefore 
would   
   
   // not update it's internal data structures on agent re-registration.

  
   if (!(slaveState->info.get() == info)) { 

  
 requiredMasterCapabilities.agentUpdate = true; 

  
   }   
```

So if slave info did not change (i.e, a simple leader change), 
`requiredMasterCapabilities.agentUpdate` should be `false` and we would not 
update the checkpointed `SlaveInfo`.

I confirmed that in our agent log, but we can also create a new unit test to 
cover that case. wdyt?


- Zhitao


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67022/#review208520
-------


On May 8, 2018, 5:40 p.m., Zhitao Li wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67022/
> ---
> 
> (Updated May 8, 2018, 5:40 p.m.)
> 
> 
> Review request for mesos, Eric Chung and Jason Lai.
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Refreshed checkpointed SlaveInfo if `--reconfiguration_policy==any`.
> 
> 
> Diffs
> -
> 
>   src/slave/flags.cpp a319b5ea633c41fd8a252c5e1617ac52d1480ba5 
>   src/slave/slave.cpp c6d9152d9de4184f9107bb8242b41d468d76e018 
> 
> 
> Diff: https://reviews.apache.org/r/67022/diff/2/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Zhitao Li
> 
>



Review Request 68300: Added `gpus` to failure message.

2018-08-10 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68300/
---

Review request for mesos, Gilbert Song and Jason Lai.


Repository: mesos


Description
---

This message would be propagated all the way to scheduler, so making it
explicit that this is related to gpu can speed up debugging.


Diffs
-

  src/slave/containerizer/mesos/isolators/gpu/allocator.cpp 
c288ad634b856702483b9751f41445308babd0c9 


Diff: https://reviews.apache.org/r/68300/diff/1/


Testing
---


Thanks,

Zhitao Li



Review Request 68299: Documented new `--cgroups_destroy_timeout` agent option.

2018-08-10 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68299/
---

Review request for mesos, Gilbert Song and Jason Lai.


Repository: mesos


Description
---

Documented new `--cgroups_destroy_timeout` agent option.


Diffs
-

  docs/configuration/agent.md 83b5fed5a8bf287700688507eaa584f37e8ba2b7 


Diff: https://reviews.apache.org/r/68299/diff/1/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 68088: Replaced `cgroups::DESTROY_TIMEOUT` with new agent flag.

2018-08-10 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68088/
---

(Updated Aug. 10, 2018, 10:36 a.m.)


Review request for mesos, Gilbert Song, Jason Lai, and James Peach.


Bugs: MESOS-8038
https://issues.apache.org/jira/browse/MESOS-8038


Repository: mesos


Description (updated)
---

The new agent flag can be used to reconfigure how long a container
destroy is allowed to take on Mesos containerizer.


Diffs (updated)
-

  src/linux/cgroups.hpp 6a6f0e6df208bc0b0a888d132b3befd062755851 
  src/slave/containerizer/mesos/isolators/cgroups/cgroups.cpp 
e016d6dcff15bf49b003c9f3ca94477ce313f7b6 
  src/slave/containerizer/mesos/linux_launcher.cpp 
cd677cc652dc53ea3bf9a715b415c3e5c48c1f89 
  src/slave/flags.hpp eeb9708f9ec76d83b6719541f4a012544c7c0cbe 
  src/slave/flags.cpp 58cdc0f1100fe244e5bf1036e1ccf39478d5d478 


Diff: https://reviews.apache.org/r/68088/diff/3/

Changes: https://reviews.apache.org/r/68088/diff/2-3/


Testing
---

`make` and `./bin/mesos-slave.sh --help`


Thanks,

Zhitao Li



Re: Review Request 68088: Replaced `cgroups::DESTROY_TIMEOUT` with new agent flag.

2018-07-30 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68088/
---

(Updated July 30, 2018, 10:50 a.m.)


Review request for mesos, Gilbert Song and Jason Lai.


Changes
---

Squash both diffs.


Summary (updated)
-

Replaced `cgroups::DESTROY_TIMEOUT` with new agent flag.


Bugs: MESOS-8038
https://issues.apache.org/jira/browse/MESOS-8038


Repository: mesos


Description (updated)
---

The new agent flag can be used to reconfigure how long a container
destroy is allowed to take on Mesos containerizer.

The default is also increased to 5 min based on suggestion from Gilbert
because certain containers could have deep system calls which may not
finish within previous 1 min timeout.


Diffs (updated)
-

  src/linux/cgroups.hpp 6a6f0e6df208bc0b0a888d132b3befd062755851 
  src/slave/containerizer/mesos/isolators/cgroups/cgroups.cpp 
8fba6f9d335cc43a7cd0fcb51ed047ba0e7027ed 
  src/slave/containerizer/mesos/linux_launcher.cpp 
3bddcece7028745cec6623ac33dbfcaced629629 
  src/slave/flags.hpp eeb9708f9ec76d83b6719541f4a012544c7c0cbe 
  src/slave/flags.cpp 58cdc0f1100fe244e5bf1036e1ccf39478d5d478 


Diff: https://reviews.apache.org/r/68088/diff/2/

Changes: https://reviews.apache.org/r/68088/diff/1-2/


Testing
---

`make` and `./bin/mesos-slave.sh --help`


Thanks,

Zhitao Li



Review Request 68088: Replaced `cgroups::DESTROY_TIMEOUT` with agent flag.

2018-07-27 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68088/
---

Review request for mesos, Gilbert Song and Jason Lai.


Bugs: MESOS-8038
https://issues.apache.org/jira/browse/MESOS-8038


Repository: mesos


Description
---

Replaced `cgroups::DESTROY_TIMEOUT` with agent flag.


Diffs
-

  src/linux/cgroups.hpp 6a6f0e6df208bc0b0a888d132b3befd062755851 
  src/slave/containerizer/mesos/isolators/cgroups/cgroups.cpp 
8fba6f9d335cc43a7cd0fcb51ed047ba0e7027ed 
  src/slave/containerizer/mesos/linux_launcher.cpp 
3bddcece7028745cec6623ac33dbfcaced629629 


Diff: https://reviews.apache.org/r/68088/diff/1/


Testing
---

`make` and `./bin/mesos-slave.sh --help`


Thanks,

Zhitao Li



Review Request 68087: Added new agent flag `cgroups_destroy_timeout`.

2018-07-27 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68087/
---

Review request for mesos, Gilbert Song and Jason Lai.


Bugs: MESOS-8038
https://issues.apache.org/jira/browse/MESOS-8038


Repository: mesos


Description
---

Added new agent flag `cgroups_destroy_timeout`.


Diffs
-

  src/slave/flags.hpp eeb9708f9ec76d83b6719541f4a012544c7c0cbe 
  src/slave/flags.cpp 58cdc0f1100fe244e5bf1036e1ccf39478d5d478 


Diff: https://reviews.apache.org/r/68087/diff/1/


Testing
---

`make` and run `./bin/mesos-slave.sh --help`.


Thanks,

Zhitao Li



Re: Review Request 67822: Avoid duplicate unmount dangling mount point.

2018-07-03 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67822/
---

(Updated July 3, 2018, 4:05 p.m.)


Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu.


Changes
---

Minor style fix


Bugs: MESOS-9049
https://issues.apache.org/jira/browse/MESOS-9049


Repository: mesos


Description
---

We could potentially schedule the framework dir, executor dir, and
executor run sandbox to gc at the same time, and then these
paths will be gc independently, although they are parents and
children directories. This patch makes sure we do not call unmount
anymore after a success.


Diffs (updated)
-

  src/slave/gc.cpp 407f6b23f87cf2e2bdaf873c8adcda57f5d559b3 


Diff: https://reviews.apache.org/r/67822/diff/2/

Changes: https://reviews.apache.org/r/67822/diff/1-2/


Testing
---

```make```


Thanks,

Zhitao Li



Review Request 67822: Avoid duplicate unmount dangling mount point.

2018-07-03 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67822/
---

Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu.


Bugs: MESOS-8830
https://issues.apache.org/jira/browse/MESOS-8830


Repository: mesos


Description
---

We could potentially schedule the framework dir, executor dir, and
executor run sandbox to gc at the same time, and then these
paths will be gc independently, although they are parents and
children directories. This patch makes sure we do not call unmount
anymore after a success.


Diffs
-

  src/slave/gc.cpp 407f6b23f87cf2e2bdaf873c8adcda57f5d559b3 


Diff: https://reviews.apache.org/r/67822/diff/1/


Testing
---

```make```


Thanks,

Zhitao Li



Re: Review Request 53105: Added an hourly timer for `containerizer/docker/image_pull`.

2018-06-20 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53105/
---

(Updated June 20, 2018, 9:49 p.m.)


Review request for mesos, Gilbert Song, Jason Lai, Jie Yu, and Qian Zhang.


Changes
---

Qian's comment.


Bugs: MESOS-6451
https://issues.apache.org/jira/browse/MESOS-6451


Repository: mesos


Description
---

Added an hourly timer for `containerizer/docker/image_pull`.


Diffs (updated)
-

  src/slave/containerizer/docker.hpp 1ed47c85f765da6fb91a2a243fcc5030e9686190 
  src/slave/containerizer/docker.cpp f83036aa5a212d1d88e6f5aa41023abc125f7116 


Diff: https://reviews.apache.org/r/53105/diff/7/

Changes: https://reviews.apache.org/r/53105/diff/6-7/


Testing
---

Ran `mesos-execute` with a docker image and docker containerizer twice, 
observed the following metrics:
```
curl -s localhost:5051/metrics/snapshot | jq .
  "containerizer/docker/image_pull_ms": 6628.833024,
  "containerizer/docker/image_pull_ms/count": 2,
  "containerizer/docker/image_pull_ms/max": 8851.497216,
  "containerizer/docker/image_pull_ms/min": 6628.833024,
  "containerizer/docker/image_pull_ms/p50": 7740.16512,
  "containerizer/docker/image_pull_ms/p90": 8629.2307968,
  "containerizer/docker/image_pull_ms/p95": 8740.3640064,
  "containerizer/docker/image_pull_ms/p99": 8829.27057408,
  "containerizer/docker/image_pull_ms/p999": 8849.274551808,
  "containerizer/docker/image_pull_ms/p": 8851.2749495808,
```


Thanks,

Zhitao Li



Review Request 67677: Added doc for new pull latency metrics.

2018-06-20 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67677/
---

Review request for mesos, Jason Lai and Qian Zhang.


Bugs: MESOS-6451
https://issues.apache.org/jira/browse/MESOS-6451


Repository: mesos


Description
---

Added doc for new pull latency metrics.


Diffs
-

  docs/monitoring.md 2985f68f644e84cdc5337c516ee06c95abd6018b 


Diff: https://reviews.apache.org/r/67677/diff/1/


Testing
---

https://github.com/zhitaoli/mesos/blob/public/zhitao/time_containerizer_pulls/docs/monitoring.md#containerizers


Thanks,

Zhitao Li



Re: Review Request 66875: Added an hourly timer for docker store pull latency.

2018-06-20 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66875/
---

(Updated June 20, 2018, 11:26 a.m.)


Review request for mesos, Gilbert Song, Jason Lai, Jie Yu, and Qian Zhang.


Changes
---

Qian's comments.


Bugs: MESOS-6451
https://issues.apache.org/jira/browse/MESOS-6451


Repository: mesos


Description
---

This patch added pull latency tracking for docker store, which can tell
us both latency distribution of pull as well as number of pulls.


Diffs (updated)
-

  src/slave/containerizer/mesos/provisioner/docker/store.cpp 
6e7dc44321bff3198e5ffe69be1ba329be3ee31e 


Diff: https://reviews.apache.org/r/66875/diff/3/

Changes: https://reviews.apache.org/r/66875/diff/2-3/


Testing (updated)
---

Ran agent in command line and trigger several launches through `mesos-execute`, 
observed following metrics from agent endpoint:

```
  "containerizer/mesos/provisioner/docker_store/image_pull_ms": 3619.53408,
  "containerizer/mesos/provisioner/docker_store/image_pull_ms/count": 2,
  "containerizer/mesos/provisioner/docker_store/image_pull_ms/max": 4208.662016,
  "containerizer/mesos/provisioner/docker_store/image_pull_ms/min": 3619.53408,
  "containerizer/mesos/provisioner/docker_store/image_pull_ms/p50": 3914.098048,
  "containerizer/mesos/provisioner/docker_store/image_pull_ms/p90": 
4149.7492224,
  "containerizer/mesos/provisioner/docker_store/image_pull_ms/p95": 
4179.2056192,
  "containerizer/mesos/provisioner/docker_store/image_pull_ms/p99": 
4202.77073664,
  "containerizer/mesos/provisioner/docker_store/image_pull_ms/p999": 
4208.072888064,
  "containerizer/mesos/provisioner/docker_store/image_pull_ms/p": 
4208.6031032064,
```


Thanks,

Zhitao Li



Re: Review Request 53105: Added an hourly timer for `containerizer/docker/image_pull`.

2018-06-20 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53105/
---

(Updated June 20, 2018, 11:25 a.m.)


Review request for mesos, Gilbert Song, Jason Lai, Jie Yu, and Qian Zhang.


Changes
---

Qian's comments.


Summary (updated)
-

Added an hourly timer for `containerizer/docker/image_pull`.


Bugs: MESOS-6451
https://issues.apache.org/jira/browse/MESOS-6451


Repository: mesos


Description (updated)
---

Added an hourly timer for `containerizer/docker/image_pull`.


Diffs (updated)
-

  src/slave/containerizer/docker.hpp 1ed47c85f765da6fb91a2a243fcc5030e9686190 
  src/slave/containerizer/docker.cpp f83036aa5a212d1d88e6f5aa41023abc125f7116 


Diff: https://reviews.apache.org/r/53105/diff/6/

Changes: https://reviews.apache.org/r/53105/diff/5-6/


Testing (updated)
---

Ran `mesos-execute` with a docker image and docker containerizer twice, 
observed the following metrics:
```
curl -s localhost:5051/metrics/snapshot | jq .
  "containerizer/docker/image_pull_ms": 6628.833024,
  "containerizer/docker/image_pull_ms/count": 2,
  "containerizer/docker/image_pull_ms/max": 8851.497216,
  "containerizer/docker/image_pull_ms/min": 6628.833024,
  "containerizer/docker/image_pull_ms/p50": 7740.16512,
  "containerizer/docker/image_pull_ms/p90": 8629.2307968,
  "containerizer/docker/image_pull_ms/p95": 8740.3640064,
  "containerizer/docker/image_pull_ms/p99": 8829.27057408,
  "containerizer/docker/image_pull_ms/p999": 8849.274551808,
  "containerizer/docker/image_pull_ms/p": 8851.2749495808,
```


Thanks,

Zhitao Li



Re: Review Request 53105: Added an hourly timer for `slave/docker_containerizer/pull`.

2018-06-20 Thread Zhitao Li


> On June 20, 2018, 12:58 a.m., Qian Zhang wrote:
> > src/slave/containerizer/docker.hpp
> > Lines 140 (patched)
> > <https://reviews.apache.org/r/53105/diff/5/?file=2014848#file2014848line140>
> >
> > It seems the convention in Mesos is to implement a `struct Metrics` for 
> > all the metrics rather than directly adding the metric here. And the name 
> > of the metric should be in underscore_case instead of camelCase, so I would 
> > suggest to name it `image_pull("containerizer/docker/image_pull", 
> > Hours(1))`.
> > 
> > And is 1 hour too short for this metric?

Ack for name and metrics struct

On window: In our operational experience, if something goes wrong on the docker 
registry side, we typicaly detect that within an hour or so, so I would not 
make this much longer and lose sensitivity. If anyone needs a longer view, they 
can usually pipe the metric into another time series database and perform out 
of bound aggregation from there.


- Zhitao


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53105/#review205049
-------


On May 21, 2018, 10:16 a.m., Zhitao Li wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/53105/
> ---
> 
> (Updated May 21, 2018, 10:16 a.m.)
> 
> 
> Review request for mesos, Gilbert Song, Jason Lai, Jie Yu, and Qian Zhang.
> 
> 
> Bugs: MESOS-6451
> https://issues.apache.org/jira/browse/MESOS-6451
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added an hourly timer for `slave/docker_containerizer/pull`.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/docker.hpp 6bdf4c7f60384473927de437a061fafe94170557 
>   src/slave/containerizer/docker.cpp a4c9c10e91e75f406329cbb2086f39b549cbeed0 
> 
> 
> Diff: https://reviews.apache.org/r/53105/diff/5/
> 
> 
> Testing
> ---
> 
> Ran `mesos-execute` with a docker image and docker containerizer, observed 
> the following metrics:
> ```
> curl -s localhost:5051/metrics/snapshot | jq . | grep pull
>   "containerizer/docker/pull_ms/p999": 26209.36173824,
>   "containerizer/docker/pull_ms/p90": 21036.405248,
>   "containerizer/docker/pull_ms/p": 26256.388615424,
>   "containerizer/docker/pull_ms/p50": 135.570944,
>   "containerizer/docker/pull_ms/max": 26261.613824,
>   "containerizer/docker/pull_ms/p95": 23649.009536,
>   "containerizer/docker/pull_ms/min": 103.215872,
>   "containerizer/docker/pull_ms/p99": 25739.0929664,
>   "containerizer/docker/pull_ms": 26261.613824,
>   "containerizer/docker/pull_ms/count": 3
> ```
> 
> 
> Thanks,
> 
> Zhitao Li
> 
>



Re: Review Request 67575: Changed operator API to notify subscribers on every status change.

2018-06-15 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67575/#review204840
---




include/mesos/master/master.proto
Lines 605-606 (original), 605-609 (patched)
<https://reviews.apache.org/r/67575/#comment287580>

Please also update comments in `include/mesos/v1/master/master.proto`



include/mesos/master/master.proto
Lines 606 (patched)
<https://reviews.apache.org/r/67575/#comment287581>

```
// This is the latest state of the task according to the agent,
// which can be more recent...
```


- Zhitao Li


On June 15, 2018, 5:45 a.m., Benno Evers wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67575/
> ---
> 
> (Updated June 15, 2018, 5:45 a.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Anand Mazumdar, Vinod Kone, 
> and Zhitao Li.
> 
> 
> Bugs: MESOS-9000
> https://issues.apache.org/jira/browse/MESOS-9000
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Prior to this change, the master would only send TaskUpdated messages
> to subscribers when the latest known task state on the agent changed.
> 
> This implied that schedulers could not reliably wait for the status
> information corresponding to specific state updates (i.e. TASK_RUNNING),
> since there is no guarantee that subscribers get notified during
> the time when this status update will be included in the status field.
> 
> After this change, TaskUpdate messages are sent whenever the latest
> acknowledged state of the task changes.
> 
> 
> Diffs
> -
> 
>   include/mesos/master/master.proto 54f84120728eea7995422b9c356ed67e5b054623 
>   src/master/master.cpp 5db5a8da85f02323a5654c93ac47ec4aa7e711d2 
> 
> 
> Diff: https://reviews.apache.org/r/67575/diff/1/
> 
> 
> Testing
> ---
> 
> `make check`
> 
> 
> Thanks,
> 
> Benno Evers
> 
>



Re: Review Request 67423: Skipped metric for non existing paths in gc.

2018-06-12 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67423/
---

(Updated June 12, 2018, 1:23 p.m.)


Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu.


Repository: mesos


Description
---

Previously, agent gc would increment the "failed" counter if the path
does not exist, but this should not be an issue. This patch skipped such
paths in both "failed" and "succeeded" counters.


Diffs (updated)
-

  src/slave/gc.cpp 390b35e6d17d6614a73c9548decbf10739560106 


Diff: https://reviews.apache.org/r/67423/diff/3/

Changes: https://reviews.apache.org/r/67423/diff/2-3/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 67421: Rewrote the `ROOT_BusyMountPoint` test to reflect updated behavior.

2018-06-12 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67421/
---

(Updated June 12, 2018, 1:23 p.m.)


Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu.


Bugs: MESOS-8830
https://issues.apache.org/jira/browse/MESOS-8830


Repository: mesos


Description
---

The current `ROOT_BusyMountPoint` test would fail because we added
support for unmounting dangling mount points in directory to gc. This
patch rewrote this test to reflect that after unmounting, the gc
succeeded, directory was gone and metrics were correctly reported.


Diffs (updated)
-

  src/tests/gc_tests.cpp 619ed22edd9b3909ea24cdcbf62c354420a8d031 


Diff: https://reviews.apache.org/r/67421/diff/5/

Changes: https://reviews.apache.org/r/67421/diff/4-5/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 67264: Unmounted any mount points in gc paths.

2018-06-12 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67264/
---

(Updated June 12, 2018, 1:22 p.m.)


Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu.


Changes
---

Jie's comments.


Bugs: MESOS-8830
https://issues.apache.org/jira/browse/MESOS-8830


Repository: mesos


Description
---

In various corner cases, agent may not get chance to properly unmount
persistent volumes mounted inside an executor's sandbox. When GC later
gets to these sandbox directories, permanent data loss can happen (see
MESOS-8830).

Currently, the only mounts in the host mount namespace under the sandbox
directories are persistent volumes, so this diff added protection to
unmount any dangling mount points before calling `rmdir` on the
directory.

NOTE: this means agent will not garbage collect any path if it cannot
read its own `mountinfo` table.


Diffs (updated)
-

  src/local/local.cpp afff54653e8e659d947ddbee6dc38ba2715f2a78 
  src/slave/gc.hpp df40165bb8a23f065156bf6c5f354b143d88c088 
  src/slave/gc.cpp 390b35e6d17d6614a73c9548decbf10739560106 
  src/slave/gc_process.hpp 20374ad91820341282fdf18ecade60a020e26cea 
  src/slave/main.cpp 646125344d590b28256d8ee684d7e51a90e82f23 
  src/tests/cluster.cpp 01eb0950e687227dac81b1cdb9eaba3379cf5dbb 
  src/tests/gc_tests.cpp 619ed22edd9b3909ea24cdcbf62c354420a8d031 
  src/tests/mesos.hpp 733344a2f07ebd9d841a55fb9bbfda2e3c1a1eb2 
  src/tests/mesos.cpp d3c87c295429481c59d5a49398e289a4b84e4496 
  src/tests/slave_tests.cpp 3d67511de5abd3466eeb5ad1daf318209bd69eed 


Diff: https://reviews.apache.org/r/67264/diff/8/

Changes: https://reviews.apache.org/r/67264/diff/7-8/


Testing
---

Added a unit test in following patch.

Tested with following procedures:
1. Start a test master and agent;
2. Created a persistent volume on agent through operator API;
3. Use `mesos-execute` to run a task;
4. Stop the agent;
5. Manually bind mount persistent volume path into a `volume` directory inside 
the executor sandbox (to simulate a dangling mount in MESOS-8830);
6. Restart agent with `--gc_disk_headroom=1.0 --gc_delay=1secs` to force it gc 
the path immediately.

With this fix, we observed that the dangling mount is automatically cleaned up, 
and agent produces log line:
```
W0523 06:00:04.001075 82745 gc.cpp:229] Unmounting dangling mount point 
'/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0/frameworks/b3eb3aff-d19d-45ff-8113-f0316462d3fa-/executors/test_id/runs/1cd3bd06-2632-4541-a708-80c7cd51c74b/volume'
 of persistent volume '/home/zhitao/mesos-workdir/volumes/roles/role/id1' 
inside garbage collected path 
'/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0'
```


Thanks,

Zhitao Li



Re: Review Request 67264: Unmounted any mount points in gc paths.

2018-06-11 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67264/
---

(Updated June 11, 2018, 4:04 p.m.)


Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu.


Changes
---

Use adaptor::reverse


Bugs: MESOS-8830
https://issues.apache.org/jira/browse/MESOS-8830


Repository: mesos


Description
---

In various corner cases, agent may not get chance to properly unmount
persistent volumes mounted inside an executor's sandbox. When GC later
gets to these sandbox directories, permanent data loss can happen (see
MESOS-8830).

Currently, the only mounts in the host mount namespace under the sandbox
directories are persistent volumes, so this diff added protection to
unmount any dangling mount points before calling `rmdir` on the
directory.

NOTE: this means agent will not garbage collect any path if it cannot
read its own `mountinfo` table.


Diffs (updated)
-

  src/local/local.cpp afff54653e8e659d947ddbee6dc38ba2715f2a78 
  src/slave/gc.hpp df40165bb8a23f065156bf6c5f354b143d88c088 
  src/slave/gc.cpp 390b35e6d17d6614a73c9548decbf10739560106 
  src/slave/gc_process.hpp 20374ad91820341282fdf18ecade60a020e26cea 
  src/slave/main.cpp 646125344d590b28256d8ee684d7e51a90e82f23 
  src/slave/paths.hpp 015896453410a33923eed07b3e676be19af62a48 
  src/slave/paths.cpp ed0b1276908f4990ce7a24c96aea20e8c79d3126 
  src/tests/cluster.cpp 01eb0950e687227dac81b1cdb9eaba3379cf5dbb 
  src/tests/gc_tests.cpp 619ed22edd9b3909ea24cdcbf62c354420a8d031 
  src/tests/mesos.hpp 733344a2f07ebd9d841a55fb9bbfda2e3c1a1eb2 
  src/tests/mesos.cpp d3c87c295429481c59d5a49398e289a4b84e4496 
  src/tests/slave_tests.cpp 3d67511de5abd3466eeb5ad1daf318209bd69eed 


Diff: https://reviews.apache.org/r/67264/diff/7/

Changes: https://reviews.apache.org/r/67264/diff/6-7/


Testing
---

Added a unit test in following patch.

Tested with following procedures:
1. Start a test master and agent;
2. Created a persistent volume on agent through operator API;
3. Use `mesos-execute` to run a task;
4. Stop the agent;
5. Manually bind mount persistent volume path into a `volume` directory inside 
the executor sandbox (to simulate a dangling mount in MESOS-8830);
6. Restart agent with `--gc_disk_headroom=1.0 --gc_delay=1secs` to force it gc 
the path immediately.

With this fix, we observed that the dangling mount is automatically cleaned up, 
and agent produces log line:
```
W0523 06:00:04.001075 82745 gc.cpp:229] Unmounting dangling mount point 
'/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0/frameworks/b3eb3aff-d19d-45ff-8113-f0316462d3fa-/executors/test_id/runs/1cd3bd06-2632-4541-a708-80c7cd51c74b/volume'
 of persistent volume '/home/zhitao/mesos-workdir/volumes/roles/role/id1' 
inside garbage collected path 
'/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0'
```


Thanks,

Zhitao Li



Re: Review Request 67264: Unmounted any mount points in gc paths.

2018-06-11 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67264/
---

(Updated June 11, 2018, 1:58 p.m.)


Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu.


Changes
---

Fail `rmdir` future if mount table cannot be loaded for agent on Linux.


Bugs: MESOS-8830
https://issues.apache.org/jira/browse/MESOS-8830


Repository: mesos


Description
---

In various corner cases, agent may not get chance to properly unmount
persistent volumes mounted inside an executor's sandbox. When GC later
gets to these sandbox directories, permanent data loss can happen (see
MESOS-8830).

Currently, the only mounts in the host mount namespace under the sandbox
directories are persistent volumes, so this diff added protection to
unmount any dangling mount points before calling `rmdir` on the
directory.

NOTE: this means agent will not garbage collect any path if it cannot
read its own `mountinfo` table.


Diffs (updated)
-

  src/local/local.cpp afff54653e8e659d947ddbee6dc38ba2715f2a78 
  src/slave/gc.hpp df40165bb8a23f065156bf6c5f354b143d88c088 
  src/slave/gc.cpp 390b35e6d17d6614a73c9548decbf10739560106 
  src/slave/gc_process.hpp 20374ad91820341282fdf18ecade60a020e26cea 
  src/slave/main.cpp 646125344d590b28256d8ee684d7e51a90e82f23 
  src/slave/paths.hpp 015896453410a33923eed07b3e676be19af62a48 
  src/slave/paths.cpp ed0b1276908f4990ce7a24c96aea20e8c79d3126 
  src/tests/cluster.cpp 01eb0950e687227dac81b1cdb9eaba3379cf5dbb 
  src/tests/gc_tests.cpp 619ed22edd9b3909ea24cdcbf62c354420a8d031 
  src/tests/mesos.hpp 733344a2f07ebd9d841a55fb9bbfda2e3c1a1eb2 
  src/tests/mesos.cpp d3c87c295429481c59d5a49398e289a4b84e4496 
  src/tests/slave_tests.cpp 3d67511de5abd3466eeb5ad1daf318209bd69eed 


Diff: https://reviews.apache.org/r/67264/diff/6/

Changes: https://reviews.apache.org/r/67264/diff/5-6/


Testing
---

Added a unit test in following patch.

Tested with following procedures:
1. Start a test master and agent;
2. Created a persistent volume on agent through operator API;
3. Use `mesos-execute` to run a task;
4. Stop the agent;
5. Manually bind mount persistent volume path into a `volume` directory inside 
the executor sandbox (to simulate a dangling mount in MESOS-8830);
6. Restart agent with `--gc_disk_headroom=1.0 --gc_delay=1secs` to force it gc 
the path immediately.

With this fix, we observed that the dangling mount is automatically cleaned up, 
and agent produces log line:
```
W0523 06:00:04.001075 82745 gc.cpp:229] Unmounting dangling mount point 
'/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0/frameworks/b3eb3aff-d19d-45ff-8113-f0316462d3fa-/executors/test_id/runs/1cd3bd06-2632-4541-a708-80c7cd51c74b/volume'
 of persistent volume '/home/zhitao/mesos-workdir/volumes/roles/role/id1' 
inside garbage collected path 
'/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0'
```


Thanks,

Zhitao Li



Re: Review Request 67421: Rewrote the `ROOT_BusyMountPoint` test to reflect updated behavior.

2018-06-05 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67421/
---

(Updated June 5, 2018, 5:08 p.m.)


Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu.


Changes
---

Updated test cases.


Summary (updated)
-

Rewrote the `ROOT_BusyMountPoint` test to reflect updated behavior.


Bugs: MESOS-8830
https://issues.apache.org/jira/browse/MESOS-8830


Repository: mesos


Description (updated)
---

The current `ROOT_BusyMountPoint` test would fail because we added
support for unmounting dangling mount points in directory to gc. This
patch rewrote this test to reflect that after unmounting, the gc
succeeded, directory was gone and metrics were correctly reported.


Diffs (updated)
-

  src/tests/gc_tests.cpp 619ed22edd9b3909ea24cdcbf62c354420a8d031 


Diff: https://reviews.apache.org/r/67421/diff/3/

Changes: https://reviews.apache.org/r/67421/diff/2-3/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 67264: Unmounted any mount points in gc paths.

2018-06-05 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67264/
---

(Updated June 5, 2018, 5:07 p.m.)


Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu.


Changes
---

Jie's comments.


Summary (updated)
-

Unmounted any mount points in gc paths.


Bugs: MESOS-8830
https://issues.apache.org/jira/browse/MESOS-8830


Repository: mesos


Description (updated)
---

In various corner cases, agent may not get chance to properly unmount
persistent volumes mounted inside an executor's sandbox. When GC later
gets to these sandbox directories, permanent data loss can happen (see
MESOS-8830).

Currently, the only mounts in the host mount namespace under the sandbox
directories are persistent volumes, so this diff added protection to
unmount any dangling mount points before calling `rmdir` on the
directory.

NOTE: this means agent will not garbage collect any path if it cannot
read its own `mountinfo` table.


Diffs (updated)
-

  src/local/local.cpp afff54653e8e659d947ddbee6dc38ba2715f2a78 
  src/slave/gc.hpp df40165bb8a23f065156bf6c5f354b143d88c088 
  src/slave/gc.cpp 390b35e6d17d6614a73c9548decbf10739560106 
  src/slave/gc_process.hpp 20374ad91820341282fdf18ecade60a020e26cea 
  src/slave/main.cpp 646125344d590b28256d8ee684d7e51a90e82f23 
  src/slave/paths.hpp 015896453410a33923eed07b3e676be19af62a48 
  src/slave/paths.cpp ed0b1276908f4990ce7a24c96aea20e8c79d3126 
  src/tests/cluster.cpp 01eb0950e687227dac81b1cdb9eaba3379cf5dbb 
  src/tests/gc_tests.cpp 619ed22edd9b3909ea24cdcbf62c354420a8d031 
  src/tests/mesos.hpp 733344a2f07ebd9d841a55fb9bbfda2e3c1a1eb2 
  src/tests/mesos.cpp d3c87c295429481c59d5a49398e289a4b84e4496 
  src/tests/slave_tests.cpp 3d67511de5abd3466eeb5ad1daf318209bd69eed 


Diff: https://reviews.apache.org/r/67264/diff/5/

Changes: https://reviews.apache.org/r/67264/diff/4-5/


Testing
---

Added a unit test in following patch.

Tested with following procedures:
1. Start a test master and agent;
2. Created a persistent volume on agent through operator API;
3. Use `mesos-execute` to run a task;
4. Stop the agent;
5. Manually bind mount persistent volume path into a `volume` directory inside 
the executor sandbox (to simulate a dangling mount in MESOS-8830);
6. Restart agent with `--gc_disk_headroom=1.0 --gc_delay=1secs` to force it gc 
the path immediately.

With this fix, we observed that the dangling mount is automatically cleaned up, 
and agent produces log line:
```
W0523 06:00:04.001075 82745 gc.cpp:229] Unmounting dangling mount point 
'/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0/frameworks/b3eb3aff-d19d-45ff-8113-f0316462d3fa-/executors/test_id/runs/1cd3bd06-2632-4541-a708-80c7cd51c74b/volume'
 of persistent volume '/home/zhitao/mesos-workdir/volumes/roles/role/id1' 
inside garbage collected path 
'/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0'
```


Thanks,

Zhitao Li



Re: Review Request 67421: Added a test for verifying GC can handle dangling persistent volume.

2018-06-01 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67421/
---

(Updated June 1, 2018, 10:31 p.m.)


Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu.


Changes
---

Fix test name and failed counter value


Bugs: MESOS-8830
https://issues.apache.org/jira/browse/MESOS-8830


Repository: mesos


Description
---

Added a test for verifying GC can handle dangling persistent volume.


Diffs (updated)
-

  src/tests/gc_tests.cpp 619ed22edd9b3909ea24cdcbf62c354420a8d031 


Diff: https://reviews.apache.org/r/67421/diff/2/

Changes: https://reviews.apache.org/r/67421/diff/1-2/


Testing
---


Thanks,

Zhitao Li



Review Request 67423: Skipped metric for non existing paths in gc.

2018-06-01 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67423/
---

Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu.


Repository: mesos


Description
---

Previously, agent gc would increment the "failed" counter if the path
does not exist, but this should not be an issue. This patch skipped such
paths in both "failed" and "succeeded" counters.


Diffs
-

  src/slave/gc.cpp 390b35e6d17d6614a73c9548decbf10739560106 


Diff: https://reviews.apache.org/r/67423/diff/1/


Testing
---


Thanks,

Zhitao Li



Review Request 67421: Added a test for verifying GC can handle dangling persistent volume.

2018-06-01 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67421/
---

Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu.


Bugs: MESOS-8830
https://issues.apache.org/jira/browse/MESOS-8830


Repository: mesos


Description
---

Added a test for verifying GC can handle dangling persistent volume.


Diffs
-

  src/tests/gc_tests.cpp 619ed22edd9b3909ea24cdcbf62c354420a8d031 


Diff: https://reviews.apache.org/r/67421/diff/1/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 67264: Unmounted any dangling persistent volume in gc paths.

2018-05-31 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67264/
---

(Updated May 31, 2018, 1:38 p.m.)


Review request for mesos, Chun-Hung Hsiao, Jason Lai, and Jie Yu.


Changes
---

Jason's comments.


Bugs: MESOS-8830
https://issues.apache.org/jira/browse/MESOS-8830


Repository: mesos


Description
---

In various corner cases, agent may not get chance to properly unmount
persistent volumes mounted inside an executor's sandbox. When GC later
gets to these sandbox directories, permanent data loss can happen (see
MESOS-8830).

This patch added some protection to unmount possible persistent
volumes inside a path to gc, and skipped the path if unmount failed.

NOTE: this means agent will not garbage collect any path if it cannot
read its own `mountinfo` table.


Diffs (updated)
-

  src/local/local.cpp afff54653e8e659d947ddbee6dc38ba2715f2a78 
  src/slave/gc.hpp df40165bb8a23f065156bf6c5f354b143d88c088 
  src/slave/gc.cpp 390b35e6d17d6614a73c9548decbf10739560106 
  src/slave/gc_process.hpp 20374ad91820341282fdf18ecade60a020e26cea 
  src/slave/main.cpp 646125344d590b28256d8ee684d7e51a90e82f23 
  src/slave/paths.hpp 015896453410a33923eed07b3e676be19af62a48 
  src/slave/paths.cpp ed0b1276908f4990ce7a24c96aea20e8c79d3126 
  src/tests/cluster.cpp b56212f6529a4d307e65797ad9bb34f2104fc832 
  src/tests/gc_tests.cpp 619ed22edd9b3909ea24cdcbf62c354420a8d031 
  src/tests/mesos.hpp 733344a2f07ebd9d841a55fb9bbfda2e3c1a1eb2 
  src/tests/mesos.cpp d3c87c295429481c59d5a49398e289a4b84e4496 
  src/tests/slave_tests.cpp 65d860594572b58a50a89358e31e97fd2a10bf08 


Diff: https://reviews.apache.org/r/67264/diff/3/

Changes: https://reviews.apache.org/r/67264/diff/2-3/


Testing
---

Tested with following procedures:
1. Start a test master and agent;
2. Created a persistent volume on agent through operator API;
3. Use `mesos-execute` to run a task;
4. Stop the agent;
5. Manually bind mount persistent volume path into a `volume` directory inside 
the executor sandbox (to simulate a dangling mount in MESOS-8830);
6. Restart agent with `--gc_disk_headroom=1.0 --gc_delay=1secs` to force it gc 
the path immediately.

With this fix, we observed that the dangling mount is automatically cleaned up, 
and agent produces log line:
```
W0523 06:00:04.001075 82745 gc.cpp:229] Unmounting dangling mount point 
'/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0/frameworks/b3eb3aff-d19d-45ff-8113-f0316462d3fa-/executors/test_id/runs/1cd3bd06-2632-4541-a708-80c7cd51c74b/volume'
 of persistent volume '/home/zhitao/mesos-workdir/volumes/roles/role/id1' 
inside garbage collected path 
'/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0'
```


Thanks,

Zhitao Li



Re: Review Request 67320: Sent task (health) check updates over the operator streaming API.

2018-05-29 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67320/#review204023
---


Ship it!




Ship It!

- Zhitao Li


On May 28, 2018, 4:25 a.m., Alexander Rukletsov wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67320/
> ---
> 
> (Updated May 28, 2018, 4:25 a.m.)
> 
> 
> Review request for mesos and Zhitao Li.
> 
> 
> Bugs: MESOS-8942
> https://issues.apache.org/jira/browse/MESOS-8942
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> With this patch subscribers to the master operator streaming API
> start receiving task health (check) updates. This allows subscribers
> to maintain more accurate view of the cluster's state, closer to
> what the traditional `state.json` endpoint offers.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp 41862db9900acde85a62d2fea85459691c68556e 
>   src/tests/api_tests.cpp 84368707e2c0bcf66bbfb308a4b863112119d328 
> 
> 
> Diff: https://reviews.apache.org/r/67320/diff/2/
> 
> 
> Testing
> ---
> 
> make check on
> * Mac OS 10.13.4
> * various linux distros
> 
> 
> Thanks,
> 
> Alexander Rukletsov
> 
>



Re: Review Request 67320: Sent task (health) check updates over the operator streaming API.

2018-05-25 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67320/#review203885
---




src/tests/api_tests.cpp
Lines 2386 (patched)
<https://reviews.apache.org/r/67320/#comment286238>

We typically do not do `this->`.

`Try<Owned> master = StartMaster();`


- Zhitao Li


On May 25, 2018, 6:52 a.m., Alexander Rukletsov wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67320/
> ---
> 
> (Updated May 25, 2018, 6:52 a.m.)
> 
> 
> Review request for mesos and Zhitao Li.
> 
> 
> Bugs: MESOS-8942
> https://issues.apache.org/jira/browse/MESOS-8942
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> With this patch subscribers to the master operator streaming API
> start receiving task health (check) updates. This allows subscribers
> to maintain more accurate view of the cluster's state, closer to
> what the traditional `state.json` endpoint offers.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp 41862db9900acde85a62d2fea85459691c68556e 
>   src/tests/api_tests.cpp 84368707e2c0bcf66bbfb308a4b863112119d328 
> 
> 
> Diff: https://reviews.apache.org/r/67320/diff/1/
> 
> 
> Testing
> ---
> 
> make check on
> * Mac OS 10.13.4
> * various linux distros
> 
> 
> Thanks,
> 
> Alexander Rukletsov
> 
>



Re: Review Request 67264: Unmounted any dangling persistent volume in gc paths.

2018-05-24 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67264/
---

(Updated May 24, 2018, 12:48 p.m.)


Review request for mesos, Jason Lai and Jie Yu.


Changes
---

Fix test compilation due to changed constructor interface.


Bugs: MESOS-8830
https://issues.apache.org/jira/browse/MESOS-8830


Repository: mesos


Description
---

In various corner cases, agent may not get chance to properly unmount
persistent volumes mounted inside an executor's sandbox. When GC later
gets to these sandbox directories, permanent data loss can happen (see
MESOS-8830).

This patch added some protection to unmount possible persistent
volumes inside a path to gc, and skipped the path if unmount failed.

NOTE: this means agent will not garbage collect any path if it cannot
read its own `mountinfo` table.


Diffs (updated)
-

  src/local/local.cpp afff54653e8e659d947ddbee6dc38ba2715f2a78 
  src/slave/gc.hpp df40165bb8a23f065156bf6c5f354b143d88c088 
  src/slave/gc.cpp 390b35e6d17d6614a73c9548decbf10739560106 
  src/slave/gc_process.hpp 20374ad91820341282fdf18ecade60a020e26cea 
  src/slave/main.cpp 646125344d590b28256d8ee684d7e51a90e82f23 
  src/slave/paths.hpp 015896453410a33923eed07b3e676be19af62a48 
  src/slave/paths.cpp ed0b1276908f4990ce7a24c96aea20e8c79d3126 
  src/tests/cluster.cpp b56212f6529a4d307e65797ad9bb34f2104fc832 
  src/tests/gc_tests.cpp 619ed22edd9b3909ea24cdcbf62c354420a8d031 
  src/tests/mesos.hpp 733344a2f07ebd9d841a55fb9bbfda2e3c1a1eb2 
  src/tests/mesos.cpp d3c87c295429481c59d5a49398e289a4b84e4496 
  src/tests/slave_tests.cpp 65d860594572b58a50a89358e31e97fd2a10bf08 


Diff: https://reviews.apache.org/r/67264/diff/2/

Changes: https://reviews.apache.org/r/67264/diff/1-2/


Testing
---

Tested with following procedures:
1. Start a test master and agent;
2. Created a persistent volume on agent through operator API;
3. Use `mesos-execute` to run a task;
4. Stop the agent;
5. Manually bind mount persistent volume path into a `volume` directory inside 
the executor sandbox (to simulate a dangling mount in MESOS-8830);
6. Restart agent with `--gc_disk_headroom=1.0 --gc_delay=1secs` to force it gc 
the path immediately.

With this fix, we observed that the dangling mount is automatically cleaned up, 
and agent produces log line:
```
W0523 06:00:04.001075 82745 gc.cpp:229] Unmounting dangling mount point 
'/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0/frameworks/b3eb3aff-d19d-45ff-8113-f0316462d3fa-/executors/test_id/runs/1cd3bd06-2632-4541-a708-80c7cd51c74b/volume'
 of persistent volume '/home/zhitao/mesos-workdir/volumes/roles/role/id1' 
inside garbage collected path 
'/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0'
```


Thanks,

Zhitao Li



Review Request 67264: Unmounted any dangling persistent volume in gc paths.

2018-05-23 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67264/
---

Review request for mesos, Jason Lai and Jie Yu.


Bugs: MESOS-8830
https://issues.apache.org/jira/browse/MESOS-8830


Repository: mesos


Description
---

In various corner cases, agent may not get chance to properly unmount
persistent volumes mounted inside an executor's sandbox. When GC later
gets to these sandbox directories, permanent data loss can happen (see
MESOS-8830).

This patch added some protection to unmount possible persistent
volumes inside a path to gc, and skipped the path if unmount failed.

NOTE: this means agent will not garbage collect any path if it cannot
read its own `mountinfo` table.


Diffs
-

  src/local/local.cpp afff54653e8e659d947ddbee6dc38ba2715f2a78 
  src/slave/gc.hpp df40165bb8a23f065156bf6c5f354b143d88c088 
  src/slave/gc.cpp 390b35e6d17d6614a73c9548decbf10739560106 
  src/slave/gc_process.hpp 20374ad91820341282fdf18ecade60a020e26cea 
  src/slave/main.cpp 646125344d590b28256d8ee684d7e51a90e82f23 
  src/slave/paths.hpp 015896453410a33923eed07b3e676be19af62a48 
  src/slave/paths.cpp ed0b1276908f4990ce7a24c96aea20e8c79d3126 


Diff: https://reviews.apache.org/r/67264/diff/1/


Testing
---

Tested with following procedures:
1. Start a test master and agent;
2. Created a persistent volume on agent through operator API;
3. Use `mesos-execute` to run a task;
4. Stop the agent;
5. Manually bind mount persistent volume path into a `volume` directory inside 
the executor sandbox (to simulate a dangling mount in MESOS-8830);
6. Restart agent with `--gc_disk_headroom=1.0 --gc_delay=1secs` to force it gc 
the path immediately.

With this fix, we observed that the dangling mount is automatically cleaned up, 
and agent produces log line:
```
W0523 06:00:04.001075 82745 gc.cpp:229] Unmounting dangling mount point 
'/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0/frameworks/b3eb3aff-d19d-45ff-8113-f0316462d3fa-/executors/test_id/runs/1cd3bd06-2632-4541-a708-80c7cd51c74b/volume'
 of persistent volume '/home/zhitao/mesos-workdir/volumes/roles/role/id1' 
inside garbage collected path 
'/home/zhitao/mesos-workdir/slaves/b3eb3aff-d19d-45ff-8113-f0316462d3fa-S0'
```


Thanks,

Zhitao Li



Re: Review Request 67154: Replaced `RpcResult` with `Try<Response, StatusError>`.

2018-05-16 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67154/#review203271
---




3rdparty/libprocess/include/process/grpc.hpp
Lines 122 (patched)
<https://reviews.apache.org/r/67154/#comment285376>

This function returns a `Future` of a `Try` ...



3rdparty/libprocess/include/process/grpc.hpp
Lines 125-126 (patched)
<https://reviews.apache.org/r/67154/#comment285377>

Do we need to call out what case the `Future` itself can be failed?


- Zhitao Li


On May 16, 2018, 12:20 p.m., Chun-Hung Hsiao wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67154/
> ---
> 
> (Updated May 16, 2018, 12:20 p.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Benjamin Mahler, Jie Yu, and 
> Zhitao Li.
> 
> 
> Bugs: MESOS-8924
> https://issues.apache.org/jira/browse/MESOS-8924
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> The `process::grpc::client::Runtime::call` method currently returns a
> `RpcResult`, which contains both a `::grpc::Status` object
> and the resulting response protobuf. However, if the `::grpc::Status`
> represents a non-OK status, the gRPC library does not guarantee that
> the response protobuf is valid. This patch replaces `RpcResult` with
> `Try` to provide better type safety.
> 
> 
> Diffs
> -
> 
>   3rdparty/libprocess/include/process/grpc.hpp 
> 321a46e19c69eafb24012bcef68bb8b0cc6aa436 
>   3rdparty/libprocess/src/tests/grpc_tests.cpp 
> 38cd6c61b54518a1019bb11a3551be13026c3f0d 
> 
> 
> Diff: https://reviews.apache.org/r/67154/diff/1/
> 
> 
> Testing
> ---
> 
> make check in libprocess
> 
> NOTE: Mesos cannot be built with this patch standalone. The tests are done 
> later in the chain.
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>



Re: Review Request 67022: Refreshed checkpointed SlaveInfo if `--reconfiguration_policy==any`.

2018-05-08 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67022/
---

(Updated May 8, 2018, 5:40 p.m.)


Review request for mesos, Eric Chung and Jason Lai.


Repository: mesos


Description
---

Refreshed checkpointed SlaveInfo if `--reconfiguration_policy==any`.


Diffs (updated)
-

  src/slave/flags.cpp a319b5ea633c41fd8a252c5e1617ac52d1480ba5 
  src/slave/slave.cpp c6d9152d9de4184f9107bb8242b41d468d76e018 


Diff: https://reviews.apache.org/r/67022/diff/2/

Changes: https://reviews.apache.org/r/67022/diff/1-2/


Testing
---


Thanks,

Zhitao Li



Review Request 67022: Refreshed checkpointed SlaveInfo if `--reconfiguration_policy==any`.

2018-05-08 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67022/
---

Review request for mesos, Benno Evers, Jason Lai, and Vinod Kone.


Repository: mesos


Description
---

Refreshed checkpointed SlaveInfo if `--reconfiguration_policy==any`.


Diffs
-

  src/slave/slave.cpp c6d9152d9de4184f9107bb8242b41d468d76e018 


Diff: https://reviews.apache.org/r/67022/diff/1/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 64384: Added new 'any' setting for reconfiguration_policy flag.

2018-05-08 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64384/#review202713
---




src/slave/slave.cpp
Lines 6373-6375 (original), 6377-6379 (patched)
<https://reviews.apache.org/r/64384/#comment284699>

I noticed that we do not refresh the checkpointed information of agent, but 
simply proceed to run a different `AgentInfo` from what's left on the disk.

Do we worry that this could cause a confusion in the future? I wonder 
whether we should augment the behavior for `any` to also flush out changed 
`AgentInfo` using `state::checkpoint()`.

Thoughts?


- Zhitao Li


On Dec. 6, 2017, 8:22 a.m., Benno Evers wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64384/
> ---
> 
> (Updated Dec. 6, 2017, 8:22 a.m.)
> 
> 
> Review request for mesos.
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This setting allows any state change, effectively telling agents to ignore 
> the existing state and not to run any new tasks until the existing ones fit 
> into the new provided resources.
> 
> 
> Diffs
> -
> 
>   src/slave/compatibility.hpp 78b421a01abe5d2178c93832577577a7ba282b38 
>   src/slave/compatibility.cpp 4ead4a5b655f6f3d7812aa52d656830d7cff4598 
>   src/slave/flags.cpp d8764745e6aca81283d8b96388df1320c3465952 
>   src/slave/slave.cpp 49270013537356c8fe9150d757b064bc3bbae3cb 
> 
> 
> Diff: https://reviews.apache.org/r/64384/diff/1/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Benno Evers
> 
>



Re: Review Request 66993: Removed unnecessary expectation on termination.

2018-05-08 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66993/
---

(Updated May 8, 2018, 1:36 p.m.)


Review request for mesos, Andrei Budnik and James Peach.


Bugs: MESOS-8884
https://issues.apache.org/jira/browse/MESOS-8884


Repository: mesos


Description
---

This test was flaky because termination could already happened when we
set up the expectation. Given that we already verified task state, I do
not see checking container termination explicitly is necessary, so
removing the expectation should fix the flakiness.


Diffs (updated)
-

  src/tests/containerizer/docker_containerizer_tests.cpp 
d834e531a550028cd57ac409c9312dd22138e8d5 


Diff: https://reviews.apache.org/r/66993/diff/2/

Changes: https://reviews.apache.org/r/66993/diff/1-2/


Testing
---

GTEST_FILTER="DockerContainerizerTest.ROOT_DOCKER_MaxCompletionTime" 
./bin/mesos-tests.sh --gtest_repeat=100 --verbose


Thanks,

Zhitao Li



Review Request 66993: Removed unnecessary expectation on termination.

2018-05-07 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66993/
---

Review request for mesos, Andrei Budnik and James Peach.


Bugs: MESOS-8884
https://issues.apache.org/jira/browse/MESOS-8884


Repository: mesos


Description
---

This test was flaky because termination could already happened when we
set up the expectation. Given that we already verified task state, I do
not see checking container termination explicitly is necessary, so
removing the expectation should fix the flakiness.


Diffs
-

  src/tests/containerizer/docker_containerizer_tests.cpp 
d834e531a550028cd57ac409c9312dd22138e8d5 


Diff: https://reviews.apache.org/r/66993/diff/1/


Testing
---

GTEST_FILTER="DockerContainerizerTest.ROOT_DOCKER_MaxCompletionTime" 
./bin/mesos-tests.sh --gtest_repeat=100 --verbose


Thanks,

Zhitao Li



Re: Review Request 66923: Added documentation on volume resize support.

2018-05-04 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66923/
---

(Updated May 4, 2018, 9:49 a.m.)


Review request for mesos, Chun-Hung Hsiao and Greg Mann.


Changes
---

Review comments.


Repository: mesos


Description
---

Added documentation on volume resize support.


Diffs (updated)
-

  docs/authorization.md fdbef770c38ab38ab748b4f943b331d4442a2ce1 
  docs/operator-http-api.md 9be1e2db60562b369bdc6ee2c732a74fad049580 
  docs/persistent-volume.md 1a5799b8d8e5302e0b01da6b9a16a3b9fb75c898 


Diff: https://reviews.apache.org/r/66923/diff/2/

Changes: https://reviews.apache.org/r/66923/diff/1-2/


Testing
---

https://github.com/zhitaoli/mesos/blob/zhitao/public/volume_resize_mesos_4965_speculative/docs/persistent-volume.md
https://github.com/zhitaoli/mesos/blob/zhitao/public/volume_resize_mesos_4965_speculative/docs/operator-http-api.md
https://github.com/zhitaoli/mesos/blob/zhitao/public/volume_resize_mesos_4965_speculative/docs/authorization.md


Thanks,

Zhitao Li



Re: Review Request 66532: Added test for authorization actions for `RESIZE_VOLUME`.

2018-05-03 Thread Zhitao Li


> On May 2, 2018, 11:51 a.m., Chun-Hung Hsiao wrote:
> > src/tests/persistent_volume_tests.cpp
> > Lines 1289 (patched)
> > <https://reviews.apache.org/r/66532/diff/8/?file=2015784#file2015784line1289>
> >
> > I don't think we need this but it doesn't hurt. I'll leave it up to you 
> > to decide if you want to keep this. Please feel free to drop it if you 
> > prefer keeping it.
> > 
> > This is more of a personal preference ;) I prefer always introducing 
> > synchronizations for a reason but some others prefer making tests as 
> > deterministic as possible.

I think I prefer keeping it for now.


- Zhitao


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66532/#review202300
-------


On May 1, 2018, 3:55 p.m., Zhitao Li wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66532/
> ---
> 
> (Updated May 1, 2018, 3:55 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao and Greg Mann.
> 
> 
> Bugs: MESOS-8748
> https://issues.apache.org/jira/browse/MESOS-8748
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added test for authorization actions for `RESIZE_VOLUME`.
> 
> 
> Diffs
> -
> 
>   src/tests/authorization_tests.cpp a76ad18a54ec232e6d6c92ec6a45b445a83f174c 
>   src/tests/persistent_volume_tests.cpp 
> 4edf781711d9efdb994114aeb6289b6af750b87a 
> 
> 
> Diff: https://reviews.apache.org/r/66532/diff/8/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Zhitao Li
> 
>



Re: Review Request 66920: Improved tests for resizing persistent volumes.

2018-05-03 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66920/#review202355
---




src/tests/persistent_volume_tests.cpp
Lines 604 (patched)
<https://reviews.apache.org/r/66920/#comment284139>

Technically we only tested that persistent volume and its data can still be 
accessed.

I do not know a reliable and efficient way to test the volume size actually 
changed after resizing.

Maybe clarify the comment?



src/tests/persistent_volume_tests.cpp
Line 751 (original), 807 (patched)
<https://reviews.apache.org/r/66920/#comment284140>

```
This test verifies that launching any task depending either origial or 
grown volume of a `GROW_VOLUME` call in the same `acceptOffers` will be 
dropped...
```



src/tests/persistent_volume_tests.cpp
Line 874 (original), 942 (patched)
<https://reviews.apache.org/r/66920/#comment284141>

```
This test verifies that launching any task depending either origial or 
shrunk volume of a `SHRINK_VOLUME` call in the same `acceptOffers` will be 
dropped...
    ```


- Zhitao Li


On May 2, 2018, 9:54 p.m., Chun-Hung Hsiao wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66920/
> ---
> 
> (Updated May 2, 2018, 9:54 p.m.)
> 
> 
> Review request for mesos, Greg Mann and Zhitao Li.
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Now the `GrowVolume` and `ShrinkVolume` tests launch tasks after
> resizing the volumes to ensure that the operations take effect on
> agents. The `NonSpeculativeGrowAndLaunch` and
> `NonSpeculativeShrinkAndLaunch` tests launch an additional task to
> verify that the original volume consumed by the operations cannot be
> used by subsequent tasks.
> 
> This patch also adjusted when the clock is resumed so schedulers will
> not receive unexpected offers.
> 
> 
> Diffs
> -
> 
>   src/tests/persistent_volume_tests.cpp 
> 4edf781711d9efdb994114aeb6289b6af750b87a 
> 
> 
> Diff: https://reviews.apache.org/r/66920/diff/2/
> 
> 
> Testing
> ---
> 
> sudo make check
> 
> 
> Thanks,
> 
> Chun-Hung Hsiao
> 
>



Re: Review Request 66227: Added test for `GROW_VOLUME` and `SHRINK_VOLUME` operator API.

2018-05-02 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66227/
---

(Updated May 2, 2018, 5:06 p.m.)


Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann.


Changes
---

Run test on Windows too.


Bugs: MESOS-8747
https://issues.apache.org/jira/browse/MESOS-8747


Repository: mesos


Description
---

Added test for `GROW_VOLUME` and `SHRINK_VOLUME` operator API.


Diffs (updated)
-

  src/tests/api_tests.cpp dd8e221d8fd1b2a241505345337897e4ee4a6347 


Diff: https://reviews.apache.org/r/66227/diff/9/

Changes: https://reviews.apache.org/r/66227/diff/8-9/


Testing
---


Thanks,

Zhitao Li



Review Request 66923: Added documentation on volume resize support.

2018-05-02 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66923/
---

Review request for mesos, Chun-Hung Hsiao and Greg Mann.


Repository: mesos


Description
---

Added documentation on volume resize support.


Diffs
-

  docs/authorization.md fdbef770c38ab38ab748b4f943b331d4442a2ce1 
  docs/operator-http-api.md 9be1e2db60562b369bdc6ee2c732a74fad049580 
  docs/persistent-volume.md 1a5799b8d8e5302e0b01da6b9a16a3b9fb75c898 


Diff: https://reviews.apache.org/r/66923/diff/1/


Testing
---

https://github.com/zhitaoli/mesos/blob/zhitao/public/volume_resize_mesos_4965_speculative/docs/persistent-volume.md
https://github.com/zhitaoli/mesos/blob/zhitao/public/volume_resize_mesos_4965_speculative/docs/operator-http-api.md
https://github.com/zhitaoli/mesos/blob/zhitao/public/volume_resize_mesos_4965_speculative/docs/authorization.md


Thanks,

Zhitao Li



Re: Review Request 66858: Added tests for validation of `GrowVolume` and `ShrinkVolume`.

2018-05-02 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66858/
---

(Updated May 2, 2018, 2:16 p.m.)


Review request for mesos, Chun-Hung Hsiao and Greg Mann.


Bugs: MESOS-4945
https://issues.apache.org/jira/browse/MESOS-4945


Repository: mesos


Description
---

Added tests for validation of `GrowVolume` and `ShrinkVolume`.


Diffs (updated)
-

  src/tests/master_validation_tests.cpp 
a5229610a81dfac9d358597135b63ee51de86659 


Diff: https://reviews.apache.org/r/66858/diff/6/

Changes: https://reviews.apache.org/r/66858/diff/5-6/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 66858: Added tests for validation of `GrowVolume` and `ShrinkVolume`.

2018-05-02 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66858/
---

(Updated May 2, 2018, 2:14 p.m.)


Review request for mesos, Chun-Hung Hsiao and Greg Mann.


Bugs: MESOS-4945
https://issues.apache.org/jira/browse/MESOS-4945


Repository: mesos


Description
---

Added tests for validation of `GrowVolume` and `ShrinkVolume`.


Diffs (updated)
-

  src/tests/master_validation_tests.cpp 
a5229610a81dfac9d358597135b63ee51de86659 


Diff: https://reviews.apache.org/r/66858/diff/5/

Changes: https://reviews.apache.org/r/66858/diff/4-5/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 66858: Added tests for validation of `GrowVolume` and `ShrinkVolume`.

2018-05-02 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66858/
---

(Updated May 2, 2018, 12:01 p.m.)


Review request for mesos, Chun-Hung Hsiao and Greg Mann.


Changes
---

Added test for grow on MOUNT.


Bugs: MESOS-4945
https://issues.apache.org/jira/browse/MESOS-4945


Repository: mesos


Description
---

Added tests for validation of `GrowVolume` and `ShrinkVolume`.


Diffs (updated)
-

  src/tests/master_validation_tests.cpp 
a5229610a81dfac9d358597135b63ee51de86659 


Diff: https://reviews.apache.org/r/66858/diff/4/

Changes: https://reviews.apache.org/r/66858/diff/3-4/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 66227: Added test for `GROW_VOLUME` and `SHRINK_VOLUME` operator API.

2018-05-02 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66227/
---

(Updated May 2, 2018, 10:28 a.m.)


Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann.


Changes
---

Review comments.


Bugs: MESOS-8747
https://issues.apache.org/jira/browse/MESOS-8747


Repository: mesos


Description
---

Added test for `GROW_VOLUME` and `SHRINK_VOLUME` operator API.


Diffs (updated)
-

  src/tests/api_tests.cpp dd8e221d8fd1b2a241505345337897e4ee4a6347 


Diff: https://reviews.apache.org/r/66227/diff/7/

Changes: https://reviews.apache.org/r/66227/diff/6-7/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 66858: Added tests for validation of `GrowVolume` and `ShrinkVolume`.

2018-05-02 Thread Zhitao Li


> On May 1, 2018, 7:18 p.m., Chun-Hung Hsiao wrote:
> > src/tests/master_validation_tests.cpp
> > Lines 1657 (patched)
> > <https://reviews.apache.org/r/66858/diff/2/?file=2015777#file2015777line1657>
> >
> > Hmm... we don't have this check when validating `GrowVolume` because we 
> > are counting on the fact that adding anything disk to a MOUNT disk will 
> > result in two resources. But I guess for the validation test maybe it's a 
> > good idea to also test this for `GrowVolume`. What do you think?

MOUNT would actually pass here in current implementation. As we discussed, I 
think it may be a good idea to explicitly validate this then test it out.

Let me know what you think.


- Zhitao


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66858/#review202251
-------


On May 2, 2018, 10:24 a.m., Zhitao Li wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66858/
> ---
> 
> (Updated May 2, 2018, 10:24 a.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao and Greg Mann.
> 
> 
> Bugs: MESOS-4945
> https://issues.apache.org/jira/browse/MESOS-4945
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added tests for validation of `GrowVolume` and `ShrinkVolume`.
> 
> 
> Diffs
> -
> 
>   src/tests/master_validation_tests.cpp 
> a5229610a81dfac9d358597135b63ee51de86659 
> 
> 
> Diff: https://reviews.apache.org/r/66858/diff/3/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Zhitao Li
> 
>



Re: Review Request 66858: Added tests for validation of `GrowVolume` and `ShrinkVolume`.

2018-05-02 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66858/
---

(Updated May 2, 2018, 10:24 a.m.)


Review request for mesos, Chun-Hung Hsiao and Greg Mann.


Changes
---

Rebase and review comments.


Bugs: MESOS-4945
https://issues.apache.org/jira/browse/MESOS-4945


Repository: mesos


Description
---

Added tests for validation of `GrowVolume` and `ShrinkVolume`.


Diffs (updated)
-

  src/tests/master_validation_tests.cpp 
a5229610a81dfac9d358597135b63ee51de86659 


Diff: https://reviews.apache.org/r/66858/diff/3/

Changes: https://reviews.apache.org/r/66858/diff/2-3/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 66051: Implemented operator API to grow and shrink persistent volume.

2018-05-01 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66051/
---

(Updated May 1, 2018, 3:56 p.m.)


Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann.


Changes
---

Review comments.


Bugs: MESOS-8747
https://issues.apache.org/jira/browse/MESOS-8747


Repository: mesos


Description
---

These operator APIs is implemented as speculative for now, but
we plan to convert them to non-speculative in the future.


Diffs (updated)
-

  src/master/http.cpp 135ae4389623385a6638158f5f23d6daca14a0ad 
  src/master/master.hpp c30cf082d9ddf2e5dd97e0d887d6d07d8e03cbd4 
  src/master/validation.cpp ac2e1bb8771841ec59b3bdcdeffb6c6230680d4d 


Diff: https://reviews.apache.org/r/66051/diff/16/

Changes: https://reviews.apache.org/r/66051/diff/15-16/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 66532: Added test for authorization actions for `RESIZE_VOLUME`.

2018-05-01 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66532/#review202238
---




src/tests/persistent_volume_tests.cpp
Line 1212 (original), 1133 (patched)
<https://reviews.apache.org/r/66532/#comment283990>

For my self: consistency.


- Zhitao Li


On May 1, 2018, 3:45 p.m., Zhitao Li wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66532/
> ---
> 
> (Updated May 1, 2018, 3:45 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao and Greg Mann.
> 
> 
> Bugs: MESOS-8748
> https://issues.apache.org/jira/browse/MESOS-8748
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added test for authorization actions for `RESIZE_VOLUME`.
> 
> 
> Diffs
> -
> 
>   src/tests/authorization_tests.cpp a76ad18a54ec232e6d6c92ec6a45b445a83f174c 
>   src/tests/persistent_volume_tests.cpp 
> 4edf781711d9efdb994114aeb6289b6af750b87a 
> 
> 
> Diff: https://reviews.apache.org/r/66532/diff/8/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Zhitao Li
> 
>



Re: Review Request 66532: Added test for authorization actions for `RESIZE_VOLUME`.

2018-05-01 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66532/
---

(Updated May 1, 2018, 3:55 p.m.)


Review request for mesos, Chun-Hung Hsiao and Greg Mann.


Changes
---

Several more review comments.


Bugs: MESOS-8748
https://issues.apache.org/jira/browse/MESOS-8748


Repository: mesos


Description
---

Added test for authorization actions for `RESIZE_VOLUME`.


Diffs (updated)
-

  src/tests/authorization_tests.cpp a76ad18a54ec232e6d6c92ec6a45b445a83f174c 
  src/tests/persistent_volume_tests.cpp 
4edf781711d9efdb994114aeb6289b6af750b87a 


Diff: https://reviews.apache.org/r/66532/diff/8/

Changes: https://reviews.apache.org/r/66532/diff/7-8/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 66220: Added tests for `GROW_VOLUME` and `SHRINK_VOLUME` operations.

2018-05-01 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66220/
---

(Updated May 1, 2018, 3:44 p.m.)


Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann.


Changes
---

Several more review comments.


Bugs: MESOS-4965
https://issues.apache.org/jira/browse/MESOS-4965


Repository: mesos


Description
---

Added tests for `GROW_VOLUME` and `SHRINK_VOLUME` operations.


Diffs (updated)
-

  src/tests/persistent_volume_tests.cpp 
4edf781711d9efdb994114aeb6289b6af750b87a 


Diff: https://reviews.apache.org/r/66220/diff/10/

Changes: https://reviews.apache.org/r/66220/diff/9-10/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 66532: Added test for authorization actions for `RESIZE_VOLUME`.

2018-05-01 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66532/
---

(Updated May 1, 2018, 3:45 p.m.)


Review request for mesos, Chun-Hung Hsiao and Greg Mann.


Changes
---

Review comments.


Bugs: MESOS-8748
https://issues.apache.org/jira/browse/MESOS-8748


Repository: mesos


Description
---

Added test for authorization actions for `RESIZE_VOLUME`.


Diffs (updated)
-

  src/tests/authorization_tests.cpp a76ad18a54ec232e6d6c92ec6a45b445a83f174c 
  src/tests/persistent_volume_tests.cpp 
4edf781711d9efdb994114aeb6289b6af750b87a 


Diff: https://reviews.apache.org/r/66532/diff/7/

Changes: https://reviews.apache.org/r/66532/diff/6-7/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 66220: Added tests for `GROW_VOLUME` and `SHRINK_VOLUME` operations.

2018-05-01 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66220/
---

(Updated May 1, 2018, 3:37 p.m.)


Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann.


Changes
---

Review comments.


Bugs: MESOS-4965
https://issues.apache.org/jira/browse/MESOS-4965


Repository: mesos


Description
---

Added tests for `GROW_VOLUME` and `SHRINK_VOLUME` operations.


Diffs (updated)
-

  src/tests/persistent_volume_tests.cpp 
4edf781711d9efdb994114aeb6289b6af750b87a 


Diff: https://reviews.apache.org/r/66220/diff/9/

Changes: https://reviews.apache.org/r/66220/diff/8-9/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 66050: Implemented grow and shrink of persistent volumes.

2018-05-01 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66050/
---

(Updated May 1, 2018, 3:35 p.m.)


Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann.


Changes
---

Review comments.


Bugs: MESOS-4965
https://issues.apache.org/jira/browse/MESOS-4965


Repository: mesos


Description
---

The new offer operations are implemented as speculative operations, but
we will use validation to make them non-speculative on API level so that
we can transition later without a breaking change.


Diffs (updated)
-

  src/common/protobuf_utils.cpp 78bffd8595f0e9f34e981548d8136ff94160573b 
  src/common/resources_utils.cpp 9be01c1abd48264e308960f35cc7c2ee8a367518 
  src/master/master.cpp c723a291ed8d219ced4492bc905ac6b52683ae47 
  src/master/validation.hpp 7c129ceb929596acbb64d37025e055661277e6bf 
  src/master/validation.cpp ac2e1bb8771841ec59b3bdcdeffb6c6230680d4d 


Diff: https://reviews.apache.org/r/66050/diff/15/

Changes: https://reviews.apache.org/r/66050/diff/14-15/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 66532: Added test for authorization actions for `RESIZE_VOLUME`.

2018-05-01 Thread Zhitao Li


> On April 30, 2018, 9:05 p.m., Chun-Hung Hsiao wrote:
> > src/tests/persistent_volume_tests.cpp
> > Lines 1256 (patched)
> > <https://reviews.apache.org/r/66532/diff/6/?file=2014384#file2014384line1256>
> >
> > Suggestion: Start the first framework with `DEFAULT_CREDENTIAL`, to be 
> > consistent with your comment for the second framework.

This is implicit by using `DEFAULT_FRAMEWORK_INFO`. Are you suggesting to call 
set_principal() explicitly?


- Zhitao


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66532/#review202170
-----------


On April 27, 2018, 1:08 p.m., Zhitao Li wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66532/
> ---
> 
> (Updated April 27, 2018, 1:08 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao and Greg Mann.
> 
> 
> Bugs: MESOS-8748
> https://issues.apache.org/jira/browse/MESOS-8748
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added test for authorization actions for `RESIZE_VOLUME`.
> 
> 
> Diffs
> -
> 
>   src/tests/authorization_tests.cpp a76ad18a54ec232e6d6c92ec6a45b445a83f174c 
>   src/tests/persistent_volume_tests.cpp 
> 4edf781711d9efdb994114aeb6289b6af750b87a 
> 
> 
> Diff: https://reviews.apache.org/r/66532/diff/6/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Zhitao Li
> 
>



Re: Review Request 66532: Added test for authorization actions for `RESIZE_VOLUME`.

2018-05-01 Thread Zhitao Li


> On April 30, 2018, 9:05 p.m., Chun-Hung Hsiao wrote:
> > src/tests/persistent_volume_tests.cpp
> > Lines 1464 (patched)
> > <https://reviews.apache.org/r/66532/diff/6/?file=2014384#file2014384line1464>
> >
> > No need to resume the clock here, as the fixture teardown will resume 
> > it. However, I'm not sure if you need to resume the clock before stopping 
> > and joining `driver2`.

I think enough tests explicitly resume the clock, so I'd like to keep it unless 
you have issue with it.


- Zhitao


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66532/#review202170
-------


On April 27, 2018, 1:08 p.m., Zhitao Li wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66532/
> ---
> 
> (Updated April 27, 2018, 1:08 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao and Greg Mann.
> 
> 
> Bugs: MESOS-8748
> https://issues.apache.org/jira/browse/MESOS-8748
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added test for authorization actions for `RESIZE_VOLUME`.
> 
> 
> Diffs
> -
> 
>   src/tests/authorization_tests.cpp a76ad18a54ec232e6d6c92ec6a45b445a83f174c 
>   src/tests/persistent_volume_tests.cpp 
> 4edf781711d9efdb994114aeb6289b6af750b87a 
> 
> 
> Diff: https://reviews.apache.org/r/66532/diff/6/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Zhitao Li
> 
>



Re: Review Request 66220: Added tests for `GROW_VOLUME` and `SHRINK_VOLUME` operations.

2018-05-01 Thread Zhitao Li


> On April 30, 2018, 5:20 p.m., Chun-Hung Hsiao wrote:
> > src/tests/persistent_volume_tests.cpp
> > Lines 459 (patched)
> > <https://reviews.apache.org/r/66220/diff/8/?file=2014377#file2014377line459>
> >
> > Suggestion for the TODO: Make `MOUNT` a meaningful parameter value for 
> > this test, or create a new fixture to avoid testing against it.
> > 
> > BTW, would it be hard to do what you did for `MOUNT` in the 
> > `ShrinkVolume` test?

Yes it is harder here because there is not conceivable way for a framework to 
receive both a `MOUNT` volume and free disk resource on the same `MOUNT`, so 
framework cannot even construct such operation from offered resources.


> On April 30, 2018, 5:20 p.m., Chun-Hung Hsiao wrote:
> > src/tests/persistent_volume_tests.cpp
> > Lines 837 (patched)
> > <https://reviews.apache.org/r/66220/diff/8/?file=2014377#file2014377line837>
> >
> > To make this test more succinct, how about doing
> > `{CREATE(volume), GROW_VOLUME(volume, addition), LAUNCH(task)}` so we 
> > don't need the subsequent `acceptOffers`?

I like this suggestion. will try it out.


- Zhitao


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66220/#review202157
---


On April 27, 2018, 1:04 p.m., Zhitao Li wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66220/
> ---
> 
> (Updated April 27, 2018, 1:04 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann.
> 
> 
> Bugs: MESOS-4965
> https://issues.apache.org/jira/browse/MESOS-4965
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added tests for `GROW_VOLUME` and `SHRINK_VOLUME` operations.
> 
> 
> Diffs
> -
> 
>   src/tests/persistent_volume_tests.cpp 
> 4edf781711d9efdb994114aeb6289b6af750b87a 
> 
> 
> Diff: https://reviews.apache.org/r/66220/diff/8/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Zhitao Li
> 
>



Re: Review Request 66293: Tested default executor support of `max_completion_time`.

2018-05-01 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66293/
---

(Updated May 1, 2018, 9:43 a.m.)


Review request for mesos, Jason Lai and James Peach.


Changes
---

Review comments.


Bugs: MESOS-8725
https://issues.apache.org/jira/browse/MESOS-8725


Repository: mesos


Description
---

Tested default executor support of `max_completion_time`.


Diffs (updated)
-

  src/tests/default_executor_tests.cpp bf849c4b636e81ec267112bff9621579998941f5 


Diff: https://reviews.apache.org/r/66293/diff/6/

Changes: https://reviews.apache.org/r/66293/diff/5-6/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 66284: Tested `max_completion_time` support in docker executor.

2018-05-01 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66284/
---

(Updated May 1, 2018, 9:42 a.m.)


Review request for mesos, Jason Lai and James Peach.


Changes
---

Review comments.


Bugs: MESOS-8725
https://issues.apache.org/jira/browse/MESOS-8725


Repository: mesos


Description
---

Tested `max_completion_time` support in docker executor.


Diffs (updated)
-

  src/tests/containerizer/docker_containerizer_tests.cpp 
5e3dfdb537f52a85a82c6e8195e70a389b76aadf 


Diff: https://reviews.apache.org/r/66284/diff/5/

Changes: https://reviews.apache.org/r/66284/diff/4-5/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 66291: Added support to `max_completion_time` in default executor.

2018-05-01 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66291/
---

(Updated May 1, 2018, 9:42 a.m.)


Review request for mesos, Jason Lai and James Peach.


Changes
---

Review comments.


Bugs: MESOS-8725
https://issues.apache.org/jira/browse/MESOS-8725


Repository: mesos


Description
---

When a task group has multiple tasks:
- each task can have its own `max_completion_time`, or not have one;
- if a task succeeds before its `max_completion_time`, all other tasks
will keep running;
- if a task fails, all other tasks in the same group will fail (as
before);
- if a task does not succeed before its `max_completion_time`, it will
fail with `TASK_FAILED` and reason `REASON_MAX_COMPLETION_TIME_REACHED`,
while other tasks will be killed without above reason.


Diffs (updated)
-

  src/launcher/default_executor.cpp ea0d425bb60e970f209f411632e1d486c279259b 


Diff: https://reviews.apache.org/r/66291/diff/4/

Changes: https://reviews.apache.org/r/66291/diff/3-4/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 66283: Added support of `max_completion_time` in docker executor.

2018-05-01 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66283/
---

(Updated May 1, 2018, 9:42 a.m.)


Review request for mesos, Jason Lai and James Peach.


Changes
---

Review comments.


Bugs: MESOS-8725
https://issues.apache.org/jira/browse/MESOS-8725


Repository: mesos


Description
---

If `TaskInfo.max_completion_time` is set, docker executor will kill
the task immediately. We reuse the `shutdown` method to achieve a
forced kill ignoring any `KillPolicy`.

Framework should only received a `TASK_FAILED` state with
`REASON_MAX_COMPLETION_TIME_REACHED` reason.


Diffs (updated)
-

  src/docker/executor.cpp 1d672112bae889cc9d19343a59e0ff66569785c4 


Diff: https://reviews.apache.org/r/66283/diff/5/

Changes: https://reviews.apache.org/r/66283/diff/4-5/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 66259: Added `max_completion_time` support to command executor.

2018-05-01 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66259/
---

(Updated May 1, 2018, 9:41 a.m.)


Review request for mesos, Jason Lai and James Peach.


Changes
---

Review comments.


Bugs: MESOS-8725
https://issues.apache.org/jira/browse/MESOS-8725


Repository: mesos


Description
---

If `TaskInfo.max_completion_time` is set, command executor will kill
the task with `SIGKILL` immediately. Note that no KillPolicy will be
observed. Framework should only received a `TASK_FAILED` state with
`REASON_MAX_COMPLETION_TIME_REACHED` reason.


Diffs (updated)
-

  src/launcher/executor.cpp 8d0869cfcdfc693301d0e185a036e97204bad17f 


Diff: https://reviews.apache.org/r/66259/diff/6/

Changes: https://reviews.apache.org/r/66259/diff/5-6/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 66828: Introduced a push-based gauge metric.

2018-05-01 Thread Zhitao Li


> On May 1, 2018, 9:25 a.m., Zhitao Li wrote:
> > 3rdparty/libprocess/include/process/metrics/push_gauge.hpp
> > Lines 70-75 (patched)
> > <https://reviews.apache.org/r/66828/diff/1/?file=2013207#file2013207line70>
> >
> > Hmm, is there a possible race condition here?
> > 
> > Consider two threads calling this with `v=2` and `v=3`, and initial 
> > value as `0`, and the order would be:
> > 
> > T1: `fetch_add(2)`, `prev` in T1 == 0;
> > T2: `fetch_add(3)`, `prev` in T2 == 2;
> > T2: `push(2 + 3)`, metric value is 5;
> > T1: `push(0 + 2)`, metric value is 2.
> > 
> > Metric value and cached value in gauge would be out of sync due to the 
> > race condition?

My bad, realized that `push()` is only used as history. Actual sampling happens 
through `value()` function. Please ignore my comment.


- Zhitao


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66828/#review202194
---


On April 26, 2018, 2:54 p.m., Benjamin Mahler wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66828/
> ---
> 
> (Updated April 26, 2018, 2:54 p.m.)
> 
> 
> Review request for mesos, Gaston Kleiman, Gilbert Song, Greg Mann, James 
> Peach, and Vinod Kone.
> 
> 
> Bugs: MESOS-8851
> https://issues.apache.org/jira/browse/MESOS-8851
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> A push-based gauge differs from a pull-based gauge in that the
> client is responsible for pushing the latest value into the gauge
> whenever it changes. This can be challenging in some cases as it
> requires the client to have a good handle on when the gauge value
> changes (rather than just computing the current value when asked).
> 
> It is highly recommended to use push-based gauges if possible as
> they provide significant performance benefits over pull-based
> gauges. Pull-based gauge suffer from delays getting processed on
> the event queue of a `Process`, as well as incur computation cost
> on the `Process` each time the metrics are collected. Push-based
> gauges, on the other hand, incur no cost to the owning `Process`
> when metrics are collected, and instead incur a trivial cost when
> the `Process` pushes new values in.
> 
> 
> Diffs
> -
> 
>   3rdparty/libprocess/include/Makefile.am 
> cd2c3bc62df8de5b50ec2fa830b3e2634ba11e28 
>   3rdparty/libprocess/include/process/metrics/push_gauge.hpp PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/66828/diff/1/
> 
> 
> Testing
> ---
> 
> Test added in the subsequent patch.
> 
> 
> Thanks,
> 
> Benjamin Mahler
> 
>



Re: Review Request 66828: Introduced a push-based gauge metric.

2018-05-01 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66828/#review202194
---




3rdparty/libprocess/include/process/metrics/push_gauge.hpp
Lines 70-75 (patched)
<https://reviews.apache.org/r/66828/#comment283932>

Hmm, is there a possible race condition here?

Consider two threads calling this with `v=2` and `v=3`, and initial value 
as `0`, and the order would be:

T1: `fetch_add(2)`, `prev` in T1 == 0;
T2: `fetch_add(3)`, `prev` in T2 == 2;
T2: `push(2 + 3)`, metric value is 5;
T1: `push(0 + 2)`, metric value is 2.

Metric value and cached value in gauge would be out of sync due to the race 
condition?


- Zhitao Li


On April 26, 2018, 2:54 p.m., Benjamin Mahler wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66828/
> ---
> 
> (Updated April 26, 2018, 2:54 p.m.)
> 
> 
> Review request for mesos, Gaston Kleiman, Gilbert Song, Greg Mann, James 
> Peach, and Vinod Kone.
> 
> 
> Bugs: MESOS-8851
> https://issues.apache.org/jira/browse/MESOS-8851
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> A push-based gauge differs from a pull-based gauge in that the
> client is responsible for pushing the latest value into the gauge
> whenever it changes. This can be challenging in some cases as it
> requires the client to have a good handle on when the gauge value
> changes (rather than just computing the current value when asked).
> 
> It is highly recommended to use push-based gauges if possible as
> they provide significant performance benefits over pull-based
> gauges. Pull-based gauge suffer from delays getting processed on
> the event queue of a `Process`, as well as incur computation cost
> on the `Process` each time the metrics are collected. Push-based
> gauges, on the other hand, incur no cost to the owning `Process`
> when metrics are collected, and instead incur a trivial cost when
> the `Process` pushes new values in.
> 
> 
> Diffs
> -
> 
>   3rdparty/libprocess/include/Makefile.am 
> cd2c3bc62df8de5b50ec2fa830b3e2634ba11e28 
>   3rdparty/libprocess/include/process/metrics/push_gauge.hpp PRE-CREATION 
> 
> 
> Diff: https://reviews.apache.org/r/66828/diff/1/
> 
> 
> Testing
> ---
> 
> Test added in the subsequent patch.
> 
> 
> Thanks,
> 
> Benjamin Mahler
> 
>



Re: Review Request 54986: Added pull gauges for slave message queue.

2018-05-01 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54986/
---

(Updated May 1, 2018, 8:54 a.m.)


Review request for mesos, Eric Chung, Jason Lai, and James Peach.


Changes
---

Rebase and use PullGauge.


Summary (updated)
-

Added pull gauges for slave message queue.


Bugs: MESOS-6831
https://issues.apache.org/jira/browse/MESOS-6831


Repository: mesos


Description (updated)
---

Added pull gauges for slave message queue.


Diffs (updated)
-

  src/slave/metrics.hpp 80b47222e05997b23041cc51037824a4cef3ec02 
  src/slave/metrics.cpp a1d01edea108ddecb92030fbd7f2e25f0fd143f0 
  src/slave/slave.hpp fb911efe4985c7bf3b340f29a5d884d476307705 
  src/tests/slave_tests.cpp ae82438a4f2f79820fe6d476e25b4915aa5f212c 


Diff: https://reviews.apache.org/r/54986/diff/3/

Changes: https://reviews.apache.org/r/54986/diff/2-3/


Testing
---

make check.


Thanks,

Zhitao Li



Re: Review Request 53267: Added a PullGauge to track active subscribers.

2018-05-01 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53267/
---

(Updated May 1, 2018, 8:54 a.m.)


Review request for mesos, Anand Mazumdar, Jason Lai, and James Peach.


Changes
---

Rebase and use PullGauge.


Summary (updated)
-

Added a PullGauge to track active subscribers.


Bugs: MESOS-6499
https://issues.apache.org/jira/browse/MESOS-6499


Repository: mesos


Description (updated)
---

Added a PullGauge to track active subscribers.


Diffs (updated)
-

  src/master/master.hpp c30cf082d9ddf2e5dd97e0d887d6d07d8e03cbd4 
  src/master/master.cpp c723a291ed8d219ced4492bc905ac6b52683ae47 
  src/master/metrics.hpp ec76dbcd1d1fa5349d62ce73fb9603e1986a776b 
  src/master/metrics.cpp e46ead79f3f29e285426f9d061337077f453aa45 


Diff: https://reviews.apache.org/r/53267/diff/4/

Changes: https://reviews.apache.org/r/53267/diff/3-4/


Testing
---

Ran this on master with a client keeps subscribing and disconnecting. Observes 
that gauge gets updated and new log line is printed.


Thanks,

Zhitao Li



Re: Review Request 66875: Added an hourly timer for docker store pull latency.

2018-04-30 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66875/
---

(Updated April 30, 2018, 4:32 p.m.)


Review request for mesos, Gilbert Song, Jason Lai, and Jie Yu.


Bugs: MESOS-6451
https://issues.apache.org/jira/browse/MESOS-6451


Repository: mesos


Description
---

This patch added pull latency tracking for docker store, which can tell
us both latency distribution of pull as well as number of pulls.


Diffs
-

  src/slave/containerizer/mesos/provisioner/docker/store.cpp 
8b3f07f5027cb90d4b4ed401960494709d3eda5f 


Diff: https://reviews.apache.org/r/66875/diff/1/


Testing (updated)
---

Ran agent in command line and trigger several launches through `mesos-execute`, 
observed following metrics from agent endpoint:

```
  "containerizer/mesos/docker_store/pull_ms": 4084.254208,
  "containerizer/mesos/docker_store/pull_ms/count": 2,
  "containerizer/mesos/docker_store/pull_ms/max": 4084.254208,
  "containerizer/mesos/docker_store/pull_ms/min": 3525.044736,
  "containerizer/mesos/docker_store/pull_ms/p50": 3804.649472,
  "containerizer/mesos/docker_store/pull_ms/p90": 4028.3332608,
  "containerizer/mesos/docker_store/pull_ms/p95": 4056.2937344,
  "containerizer/mesos/docker_store/pull_ms/p99": 4078.66211328,
  "containerizer/mesos/docker_store/pull_ms/p999": 4083.694998528,
  "containerizer/mesos/docker_store/pull_ms/p": 4084.1982870528,
```


Thanks,

Zhitao Li



Re: Review Request 66292: Validated that all tasks in the same group have same max_duration.

2018-04-30 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66292/
---

(Updated April 30, 2018, 3:51 p.m.)


Review request for mesos, Jason Lai and James Peach.


Bugs: MESOS-8725
https://issues.apache.org/jira/browse/MESOS-8725


Repository: mesos


Description
---

Validated that all tasks in the same group have same max_duration.


Diffs
-

  src/master/validation.cpp 9254cd22e67d9aca53135701575fb0857ff6f7d6 


Diff: https://reviews.apache.org/r/66292/diff/1/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 66875: Added an hourly timer for docker store pull latency.

2018-04-30 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66875/
---

(Updated April 30, 2018, 3:36 p.m.)


Review request for mesos, Gilbert Song, Jason Lai, and Jie Yu.


Bugs: MESOS-6451
https://issues.apache.org/jira/browse/MESOS-6451


Repository: mesos


Description
---

This patch added pull latency tracking for docker store, which can tell
us both latency distribution of pull as well as number of pulls.


Diffs
-

  src/slave/containerizer/mesos/provisioner/docker/store.cpp 
8b3f07f5027cb90d4b4ed401960494709d3eda5f 


Diff: https://reviews.apache.org/r/66875/diff/1/


Testing
---


Thanks,

Zhitao Li



Review Request 66875: Added an hourly timer for docker store pull latency.

2018-04-30 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66875/
---

Review request for mesos, Gilbert Song, Jason Lai, and Jie Yu.


Repository: mesos


Description
---

This patch added pull latency tracking for docker store, which can tell
us both latency distribution of pull as well as number of pulls.


Diffs
-

  src/slave/containerizer/mesos/provisioner/docker/store.cpp 
8b3f07f5027cb90d4b4ed401960494709d3eda5f 


Diff: https://reviews.apache.org/r/66875/diff/1/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 53105: Added an hourly timer for `slave/docker_containerizer/pull`.

2018-04-30 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53105/
---

(Updated April 30, 2018, 3:35 p.m.)


Review request for mesos, Xiaojian Huang, Gilbert Song, Jason Lai, and Jie Yu.


Changes
---

Rebase to revive the patch.


Bugs: MESOS-6451
https://issues.apache.org/jira/browse/MESOS-6451


Repository: mesos


Description
---

Added an hourly timer for `slave/docker_containerizer/pull`.


Diffs (updated)
-

  src/slave/containerizer/docker.hpp 6bdf4c7f60384473927de437a061fafe94170557 
  src/slave/containerizer/docker.cpp a4c9c10e91e75f406329cbb2086f39b549cbeed0 


Diff: https://reviews.apache.org/r/53105/diff/5/

Changes: https://reviews.apache.org/r/53105/diff/4-5/


Testing
---

Ran `mesos-execute` with a docker image and docker containerizer, observed the 
following metrics:
```
curl -s localhost:5051/metrics/snapshot | jq . | grep pull
  "containerizer/docker/pull_ms/p999": 26209.36173824,
  "containerizer/docker/pull_ms/p90": 21036.405248,
  "containerizer/docker/pull_ms/p": 26256.388615424,
  "containerizer/docker/pull_ms/p50": 135.570944,
  "containerizer/docker/pull_ms/max": 26261.613824,
  "containerizer/docker/pull_ms/p95": 23649.009536,
  "containerizer/docker/pull_ms/min": 103.215872,
  "containerizer/docker/pull_ms/p99": 25739.0929664,
  "containerizer/docker/pull_ms": 26261.613824,
  "containerizer/docker/pull_ms/count": 3
```


Thanks,

Zhitao Li



Re: Review Request 53330: Tracked layers and pull latency in docker store.

2018-04-30 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53330/
---

(Updated April 30, 2018, 2:23 p.m.)


Review request for mesos, Xiaojian Huang, Gilbert Song, Jason Lai, and Jie Yu.


Bugs: MESOS-6451
https://issues.apache.org/jira/browse/MESOS-6451


Repository: mesos


Description
---

This patch added several metrics for docker store:
- a counter for number of layers pulled
- a gauge for total number of layers
- a timer for pull latency distribution in the last hour


Diffs
-

  src/slave/containerizer/mesos/provisioner/docker/paths.hpp 
232c027f8f96da0cb30b957bce4607d3695050d2 
  src/slave/containerizer/mesos/provisioner/docker/paths.cpp 
cd684b33eb308ce1eeb4539a5b2d51985d835db7 
  src/slave/containerizer/mesos/provisioner/docker/store.cpp 
68ce265a5d0a61f8d9ed55dd14c630dcf893a7d2 


Diff: https://reviews.apache.org/r/53330/diff/4/


Testing
---

Use mesos-execute to pull several different images, and verified following 
counters in agent's metrics/snapshot:
```
curl localhost:5051/metrics/snapshot | jq . | grep 
containerizer/mesos/docker_pull
  "containerizer/mesos/docker_store/layers_pulled": 47,
  "containerizer/mesos/docker_store/pull_ms/p50": 7080.760064,
  "containerizer/mesos/docker_store/pull_ms/p90": 11653.6390912,
  "containerizer/mesos/docker_store/pull_ms/p": 12528.3066648832,
  "containerizer/mesos/docker_store/pull_ms": 4550.814976,
  "containerizer/mesos/docker_store/pull_ms/count": 4,
  "containerizer/mesos/docker_store/pull_ms/max": 12529.182208,
  "containerizer/mesos/docker_store/pull_ms/p999": 12520.426776832,
  "containerizer/mesos/docker_store/pull_ms/min": 3167.337984,
  "containerizer/mesos/docker_store/pull_ms/p95": 12091.4106496,
  "containerizer/mesos/docker_store/pull_ms/p99": 12441.62789632,
  "containerizer/mesos/docker_store/layers": 47,
```


Thanks,

Zhitao Li



Re: Review Request 54987: Updated `docs/monitoring/md` for new agent event queue metrics.

2018-04-30 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54987/
---

(Updated April 30, 2018, 2:09 p.m.)


Review request for mesos, Anand Mazumdar, Xiaojian Huang, and Jason Lai.


Changes
---

Rebase.


Bugs: MESOS-6831
https://issues.apache.org/jira/browse/MESOS-6831


Repository: mesos


Description
---

Updated `docs/monitoring/md` for new agent event queue metrics.


Diffs (updated)
-

  docs/monitoring.md c20b99eb32a50830ea6f66b5bdbfcabe45fc6ecb 


Diff: https://reviews.apache.org/r/54987/diff/2/

Changes: https://reviews.apache.org/r/54987/diff/1-2/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 54986: Added metric for slave message queue.

2018-04-30 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54986/
---

(Updated April 30, 2018, 2:08 p.m.)


Review request for mesos, Eric Chung, Jason Lai, and James Peach.


Changes
---

Rebase.


Bugs: MESOS-6831
https://issues.apache.org/jira/browse/MESOS-6831


Repository: mesos


Description
---

Added metric for slave message queue.


Diffs (updated)
-

  src/slave/metrics.hpp b771c4b54efb6a2ea841e99c16e098b33302ca89 
  src/slave/metrics.cpp 44294af7cac29e462330d94217eed9f767972f0e 
  src/slave/slave.hpp c35996bdd564075b9b8f1467e13e24a6531d022e 
  src/tests/slave_tests.cpp ae82438a4f2f79820fe6d476e25b4915aa5f212c 


Diff: https://reviews.apache.org/r/54986/diff/2/

Changes: https://reviews.apache.org/r/54986/diff/1-2/


Testing
---

make check.


Thanks,

Zhitao Li



Re: Review Request 54986: Added metric for slave message queue.

2018-04-30 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/54986/
---

(Updated April 30, 2018, 2:01 p.m.)


Review request for mesos, Anand Mazumdar, Xiaojian Huang, and Jason Lai.


Bugs: MESOS-6831
https://issues.apache.org/jira/browse/MESOS-6831


Repository: mesos


Description
---

Added metric for slave message queue.


Diffs
-

  src/slave/metrics.hpp 3fc933ca65690d6fad63156398ad9c2c53789296 
  src/slave/metrics.cpp 0eb2b59ed67e14e73b29d7592c239441df0008d5 
  src/slave/slave.hpp be72ba93d4c6ca3e69e3f76f87aa7ddebdbf9cad 
  src/tests/slave_tests.cpp fc6b56c074c71b827a9ee522cd715c0d15ecc7e3 


Diff: https://reviews.apache.org/r/54986/diff/1/


Testing
---

make check.


Thanks,

Zhitao Li



Review Request 66871: Added documentation about new gauge on `master/subscribers_active`.

2018-04-30 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66871/
---

Review request for mesos, Jason Lai and James Peach.


Bugs: MESOS-6499
https://issues.apache.org/jira/browse/MESOS-6499


Repository: mesos


Description
---

Added documentation about new gauge on `master/subscribers_active`.


Diffs
-

  docs/monitoring.md c20b99eb32a50830ea6f66b5bdbfcabe45fc6ecb 


Diff: https://reviews.apache.org/r/66871/diff/1/


Testing
---

https://github.com/zhitaoli/mesos/blob/zhitao/public/active_subscribers/docs/monitoring.md#http-api


Thanks,

Zhitao Li



Re: Review Request 53267: Added a gauge to track active subscribers.

2018-04-30 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/53267/
---

(Updated April 30, 2018, 1:53 p.m.)


Review request for mesos, Anand Mazumdar, Jason Lai, and James Peach.


Changes
---

Rebase and move.


Bugs: MESOS-6499
https://issues.apache.org/jira/browse/MESOS-6499


Repository: mesos


Description
---

Added a gauge to track active subscribers.


Diffs (updated)
-

  src/master/master.hpp a7cadd9c97add92a0276bf64e0da941cae63fd7c 
  src/master/master.cpp c723a291ed8d219ced4492bc905ac6b52683ae47 
  src/master/metrics.hpp 5414c4790f4606f910b0de81cbb08d398a0f3745 
  src/master/metrics.cpp 4cc96a1be532d6c7333a9fcdf8739d7640180777 


Diff: https://reviews.apache.org/r/53267/diff/3/

Changes: https://reviews.apache.org/r/53267/diff/2-3/


Testing
---

Ran this on master with a client keeps subscribing and disconnecting. Observes 
that gauge gets updated and new log line is printed.


Thanks,

Zhitao Li



Re: Review Request 66050: Implemented grow and shrink of persistent volumes.

2018-04-27 Thread Zhitao Li


> On April 24, 2018, 9:05 p.m., Chun-Hung Hsiao wrote:
> > src/master/validation.cpp
> > Lines 2332 (patched)
> > <https://reviews.apache.org/r/66050/diff/13/?file=2011231#file2011231line2332>
> >
> > Can you add tests for this function in 
> > `src/tests/master_validation_tests.cpp`? To unblock this patch, it can be 
> > done in a follow-up one.

Done in https://reviews.apache.org/r/66858/.


- Zhitao


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66050/#review201887
-------


On April 27, 2018, 12:24 p.m., Zhitao Li wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66050/
> ---
> 
> (Updated April 27, 2018, 12:24 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann.
> 
> 
> Bugs: MESOS-4965
> https://issues.apache.org/jira/browse/MESOS-4965
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> The new offer operations are implemented as speculative operations, but
> we will use validation to make them non-speculative on API level so that
> we can transition later without a breaking change.
> 
> 
> Diffs
> -
> 
>   src/common/protobuf_utils.cpp 78bffd8595f0e9f34e981548d8136ff94160573b 
>   src/common/resources_utils.cpp 9be01c1abd48264e308960f35cc7c2ee8a367518 
>   src/master/master.cpp c723a291ed8d219ced4492bc905ac6b52683ae47 
>   src/master/validation.hpp 7c129ceb929596acbb64d37025e055661277e6bf 
>   src/master/validation.cpp ac2e1bb8771841ec59b3bdcdeffb6c6230680d4d 
> 
> 
> Diff: https://reviews.apache.org/r/66050/diff/14/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Zhitao Li
> 
>



Review Request 66858: Added tests for validation of `GrowVolume` and `ShrinkVolume`.

2018-04-27 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66858/
---

Review request for mesos, Chun-Hung Hsiao and Greg Mann.


Bugs: MESOS-4945
https://issues.apache.org/jira/browse/MESOS-4945


Repository: mesos


Description
---

Added tests for validation of `GrowVolume` and `ShrinkVolume`.


Diffs
-

  src/tests/master_validation_tests.cpp 
a5229610a81dfac9d358597135b63ee51de86659 


Diff: https://reviews.apache.org/r/66858/diff/1/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 66531: Added new authorization for `ResizeVolume`.

2018-04-27 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66531/
---

(Updated April 27, 2018, 1:14 p.m.)


Review request for mesos, Chun-Hung Hsiao and Greg Mann.


Bugs: MESOS-8748
https://issues.apache.org/jira/browse/MESOS-8748


Repository: mesos


Description
---

The new authorization action is modelled after `CreateVolume`, and will
be shared by both `GROW_VOLUME` and `SHRINK_VOLUME` operations and
corresponding operator APIs.


Diffs (updated)
-

  include/mesos/authorizer/acls.proto 8ef33210644e7d2924b402a3158b1b38c1fdb464 
  include/mesos/authorizer/authorizer.proto 
1508c01130013d74ed2b2574a2428f38e3d2c064 
  src/authorizer/local/authorizer.cpp c098ba9ded1b29a2e079cf09ab80b61f6fa4af05 
  src/master/master.hpp a7cadd9c97add92a0276bf64e0da941cae63fd7c 
  src/master/master.cpp c723a291ed8d219ced4492bc905ac6b52683ae47 


Diff: https://reviews.apache.org/r/66531/diff/5/

Changes: https://reviews.apache.org/r/66531/diff/4-5/


Testing
---

Manually created ACL and verified that untrusted principals will not allow to 
grow/shrink volumes.
Also created unit test in next patch.


Thanks,

Zhitao Li



Re: Review Request 66051: Implemented operator API to grow and shrink persistent volume.

2018-04-27 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66051/
---

(Updated April 27, 2018, 1:09 p.m.)


Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann.


Changes
---

Updated after validation change.


Bugs: MESOS-8747
https://issues.apache.org/jira/browse/MESOS-8747


Repository: mesos


Description
---

These operator APIs is implemented as speculative for now, but
we plan to convert them to non-speculative in the future.


Diffs (updated)
-

  src/master/http.cpp 135ae4389623385a6638158f5f23d6daca14a0ad 
  src/master/master.hpp a7cadd9c97add92a0276bf64e0da941cae63fd7c 
  src/master/validation.cpp ac2e1bb8771841ec59b3bdcdeffb6c6230680d4d 


Diff: https://reviews.apache.org/r/66051/diff/15/

Changes: https://reviews.apache.org/r/66051/diff/14-15/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 66532: Added test for authorization actions for `RESIZE_VOLUME`.

2018-04-27 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66532/
---

(Updated April 27, 2018, 1:08 p.m.)


Review request for mesos, Chun-Hung Hsiao and Greg Mann.


Changes
---

use `Clock::settle()`.


Bugs: MESOS-8748
https://issues.apache.org/jira/browse/MESOS-8748


Repository: mesos


Description
---

Added test for authorization actions for `RESIZE_VOLUME`.


Diffs (updated)
-

  src/tests/authorization_tests.cpp a76ad18a54ec232e6d6c92ec6a45b445a83f174c 
  src/tests/persistent_volume_tests.cpp 
4edf781711d9efdb994114aeb6289b6af750b87a 


Diff: https://reviews.apache.org/r/66532/diff/6/

Changes: https://reviews.apache.org/r/66532/diff/5-6/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 66220: Added tests for `GROW_VOLUME` and `SHRINK_VOLUME` operations.

2018-04-27 Thread Zhitao Li


> On April 25, 2018, 9:09 a.m., Greg Mann wrote:
> > src/tests/persistent_volume_tests.cpp
> > Lines 451 (patched)
> > <https://reviews.apache.org/r/66220/diff/6/?file=2006752#file2006752line451>
> >
> > It's not 100% clear to me if these tests actually verify that the 
> > agent's resource checkpointing has occurred? IIUC, settling the clock after 
> > awaiting on the `ApplyOperationMessage` which applies the GROW or SHRINK 
> > may be sufficient to verify this?

They don't. I've rewrittent the comment.


- Zhitao


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66220/#review201843
-----------


On April 27, 2018, 1:04 p.m., Zhitao Li wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66220/
> ---
> 
> (Updated April 27, 2018, 1:04 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann.
> 
> 
> Bugs: MESOS-4965
> https://issues.apache.org/jira/browse/MESOS-4965
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added tests for `GROW_VOLUME` and `SHRINK_VOLUME` operations.
> 
> 
> Diffs
> -
> 
>   src/tests/persistent_volume_tests.cpp 
> 4edf781711d9efdb994114aeb6289b6af750b87a 
> 
> 
> Diff: https://reviews.apache.org/r/66220/diff/8/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Zhitao Li
> 
>



Re: Review Request 66220: Added tests for `GROW_VOLUME` and `SHRINK_VOLUME` operations.

2018-04-27 Thread Zhitao Li


> On April 25, 2018, 4:59 p.m., Chun-Hung Hsiao wrote:
> > src/tests/persistent_volume_tests.cpp
> > Lines 501-505 (patched)
> > <https://reviews.apache.org/r/66220/diff/7/?file=2011389#file2011389line501>
> >
> > ```
> > // Disk spaces will be merged if fixture parameter is `NONE`.
> > Bytes totalBytes = GetParam() == NONE ? MegaBytes(4096) : 
> > MegaBytes(2048);
> > ```
> > 
> > TBH I'm not a fan of conditioning on the fixture parameters. If a new 
> > fixture is created, this can be resolved by having only one disk in 
> > `getSlaveResources()` instead. But given the fact that we also do similar 
> > conditional branches in other tests, I can live with this.

I'll leave this as a TODO.


> On April 25, 2018, 4:59 p.m., Chun-Hung Hsiao wrote:
> > src/tests/persistent_volume_tests.cpp
> > Lines 823 (patched)
> > <https://reviews.apache.org/r/66220/diff/7/?file=2011389#file2011389line823>
> >
> > This is not required.

New test has offers so this will be necessary.


> On April 25, 2018, 4:59 p.m., Chun-Hung Hsiao wrote:
> > src/tests/persistent_volume_tests.cpp
> > Lines 853-855 (patched)
> > <https://reviews.apache.org/r/66220/diff/7/?file=2011389#file2011389line853>
> >
> > This is essentially testing the same property as the last 
> > `EXPECT_FALSE` so we don't need it.

New test dropped this.


- Zhitao


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66220/#review201969
---


On April 27, 2018, 1:04 p.m., Zhitao Li wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66220/
> ---
> 
> (Updated April 27, 2018, 1:04 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann.
> 
> 
> Bugs: MESOS-4965
> https://issues.apache.org/jira/browse/MESOS-4965
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added tests for `GROW_VOLUME` and `SHRINK_VOLUME` operations.
> 
> 
> Diffs
> -
> 
>   src/tests/persistent_volume_tests.cpp 
> 4edf781711d9efdb994114aeb6289b6af750b87a 
> 
> 
> Diff: https://reviews.apache.org/r/66220/diff/8/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Zhitao Li
> 
>



Re: Review Request 66220: Added tests for `GROW_VOLUME` and `SHRINK_VOLUME` operations.

2018-04-27 Thread Zhitao Li


> On April 13, 2018, 9:43 a.m., Greg Mann wrote:
> > src/tests/persistent_volume_tests.cpp
> > Lines 455-459 (patched)
> > <https://reviews.apache.org/r/66220/diff/4/?file=1996762#file1996762line455>
> >
> > Is this enforced somewhere in validation code? Can we check for 
> > expected behavior when a GROW/SHRINK operation is submitted for a MOUNT 
> > volume, rather than simply returning?
> 
> Zhitao Li wrote:
> I added validation in r/66050 so we drop shrink volume operation on 
> MOUNT. There is no logical path for triggering `GROW` on `MOUNT` disk type so 
> I'm not writing validation.
> 
> For testing, we can either keep the current structure and check that 
> `GROW`/`SHRINK` do not work on `MOUNT` (operation will be dropped), or take 
> Chun's suggestion to not test them for `MOUNT`. Please let me know.
> 
> Zhitao Li wrote:
> I think this will be the only test case which needs to skip `MOUNT`. It 
> also makes some sense because there is no logical starting point for a 
> framework to even construct a `GROW_VOLUME` message for `MOUNT`.
> 
> I still feel that the handling here could be better. we can discuss on 
> next patch revision.
> 
> Chun-Hung Hsiao wrote:
> I'm sorry for missing this reply. If we want to just skip `MOUNT`, I 
> prefer having a new fixture as a subclass of the original and make it only 
> parameterized against `NONE` and `PATH`; otherwise, how about doing proper 
> checks in `offersAfterGrow` based on the parameter:
> ```
> if (GetParam() == MOUNT) {
>   EXPECT_EQ(
>   allocatedResources(Resources(volume), frameworkInfo.roles(0)),
>   Resources(offer.resources()).persistentVolumes());
> } else {
>   EXPECT_TRUE(os::exists(volumePath));
>   EXPECT_SOME_EQ("abc", os::read(filePath));
> 
>   EXPECT_EQ(
>   allocatedResources(Resources(grownVolume), frameworkInfo.roles(0)),
>   Resources(offer.resources()).persistentVolumes());
> 
>   EXPECT_FALSE(
>   Resources(offer.resources()).contains(
>   allocatedResources(addition, frameworkInfo.roles(0;
> }
> ```
> My reason is that when seeing 
> `ROOT_MountDiskResource/PersistentVolumeTest.GrowVolume/0` passes really 
> means something is tested.

I'm going to leave this as a TODO for future improvement.


> On April 13, 2018, 9:43 a.m., Greg Mann wrote:
> > src/tests/persistent_volume_tests.cpp
> > Lines 541-542 (patched)
> > <https://reviews.apache.org/r/66220/diff/4/?file=1996762#file1996762line541>
> >
> > Is this `Future` necessary? Since the task consumes the volume, it may 
> > be sufficient to await on the task status updates?
> 
> Zhitao Li wrote:
> Yes this is necessary if we do not launch task in this test. we need to 
> reliably know that `Allocator::updateAllocation` is called from master, and 
> this message happens before that, so this `Future` ensures allocator has 
> properly processed all operation conversions and next offer will contain the 
> host and updated resources.
> 
> Chun-Hung Hsiao wrote:
> I see. Here you're relying on the fact that when the 
> `ApplyOperationMessage` is processed on the slave, the master has already 
> speculatively applied the operation called `updatedAllocation`. However 
> `updateAllocation` is asynchronous so there is still a potential race between 
> `ApplyOperationMessage` and the actual allocation update.
> 
> I suggest instead of waiting for the internal message, let's do a 
> `Clock::settle();` before `Clock::advance(...)`. I just tested it locally.
> 
> Ditto for all the following awaits for `ApplyOperationMessage`.

`Clock::settle()` is not sufficient. What we need here is not confirming on 
agent processing, but confirming master has sent this to agent, because we 
reliably know that master has also called `allocatior::updateAllocation()` 
(which happens before this call). Otherwise allocator may not have correct view 
of resources.

Unfornately master -> allocator communication cannot be intecepted by 
`FUTURE...` so this is the closest thing I can find.


- Zhitao


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66220/#review201109
---


On April 27, 2018, 1:04 p.m., Zhitao Li wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66220/
> ---
> 
> (Updated April 27, 2018, 1:04 p.m.

Re: Review Request 66220: Added tests for `GROW_VOLUME` and `SHRINK_VOLUME` operations.

2018-04-27 Thread Zhitao Li


> On April 25, 2018, 9:09 a.m., Greg Mann wrote:
> > src/tests/persistent_volume_tests.cpp
> > Lines 585-586 (patched)
> > <https://reviews.apache.org/r/66220/diff/6/?file=2006752#file2006752line585>
> >
> > IIUC, the `Clock::settle()` here is ensuring that this test actually 
> > verifies that the agent's resources have been checkpointed?
> > 
> > If so, I would recommend:
> > 1) Add a comment explaining the reason for the settle.
> > 2) Place the settle after the await. I think the two orderings are 
> > equivalent, but I think it's more readable if the settle is after the await?
> 
> Chun-Hung Hsiao wrote:
> The await can be removed. See my comments above.

This is necessary to make sure that master -> allocator communication for 
`allocator::updateAllocation()` has happened.


- Zhitao


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66220/#review201843
-------


On April 27, 2018, 1:04 p.m., Zhitao Li wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66220/
> ---
> 
> (Updated April 27, 2018, 1:04 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann.
> 
> 
> Bugs: MESOS-4965
> https://issues.apache.org/jira/browse/MESOS-4965
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added tests for `GROW_VOLUME` and `SHRINK_VOLUME` operations.
> 
> 
> Diffs
> -
> 
>   src/tests/persistent_volume_tests.cpp 
> 4edf781711d9efdb994114aeb6289b6af750b87a 
> 
> 
> Diff: https://reviews.apache.org/r/66220/diff/8/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Zhitao Li
> 
>



Re: Review Request 66220: Added tests for `GROW_VOLUME` and `SHRINK_VOLUME` operations.

2018-04-27 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66220/
---

(Updated April 27, 2018, 1:04 p.m.)


Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann.


Bugs: MESOS-4965
https://issues.apache.org/jira/browse/MESOS-4965


Repository: mesos


Description
---

Added tests for `GROW_VOLUME` and `SHRINK_VOLUME` operations.


Diffs (updated)
-

  src/tests/persistent_volume_tests.cpp 
4edf781711d9efdb994114aeb6289b6af750b87a 


Diff: https://reviews.apache.org/r/66220/diff/8/

Changes: https://reviews.apache.org/r/66220/diff/7-8/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 66050: Implemented grow and shrink of persistent volumes.

2018-04-27 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66050/
---

(Updated April 27, 2018, 12:24 p.m.)


Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann.


Changes
---

Review comments with better validation on chaining.


Bugs: MESOS-4965
https://issues.apache.org/jira/browse/MESOS-4965


Repository: mesos


Description
---

The new offer operations are implemented as speculative operations, but
we will use validation to make them non-speculative on API level so that
we can transition later without a breaking change.


Diffs (updated)
-

  src/common/protobuf_utils.cpp 78bffd8595f0e9f34e981548d8136ff94160573b 
  src/common/resources_utils.cpp 9be01c1abd48264e308960f35cc7c2ee8a367518 
  src/master/master.cpp c723a291ed8d219ced4492bc905ac6b52683ae47 
  src/master/validation.hpp 7c129ceb929596acbb64d37025e055661277e6bf 
  src/master/validation.cpp ac2e1bb8771841ec59b3bdcdeffb6c6230680d4d 


Diff: https://reviews.apache.org/r/66050/diff/14/

Changes: https://reviews.apache.org/r/66050/diff/13-14/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 66049: Added offer operation to grow and shrink persistent volumes.

2018-04-27 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66049/
---

(Updated April 27, 2018, 12:23 p.m.)


Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann.


Changes
---

Fix build when grpc is enabled.


Bugs: MESOS-4965
https://issues.apache.org/jira/browse/MESOS-4965


Repository: mesos


Description
---

Added offer operation to grow and shrink persistent volumes.


Diffs (updated)
-

  include/mesos/mesos.proto 5bc4a80ad5278d20feced73dce755519515de505 
  include/mesos/v1/mesos.proto 5a4e733250831fa5fe86544aeb98dbc0a4f5afa6 
  src/common/protobuf_utils.cpp 78bffd8595f0e9f34e981548d8136ff94160573b 
  src/common/resources_utils.cpp 9be01c1abd48264e308960f35cc7c2ee8a367518 
  src/master/master.cpp c723a291ed8d219ced4492bc905ac6b52683ae47 
  src/resource_provider/storage/provider.cpp 
8ca2d3a98858940e6e027becefb53c2f00b4ae43 
  src/tests/mesos.hpp 756a52189ea3336b203ea1d606e2ba17762d57aa 
  src/tests/persistent_volume_tests.cpp 
4edf781711d9efdb994114aeb6289b6af750b87a 
  src/tests/reservation_tests.cpp 5570df2e0b208512d0a0a3079a180e1acfe08f3d 


Diff: https://reviews.apache.org/r/66049/diff/12/

Changes: https://reviews.apache.org/r/66049/diff/11-12/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 66284: Tested `max_completion_time` support in docker executor.

2018-04-27 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66284/
---

(Updated April 27, 2018, 9:09 a.m.)


Review request for mesos, Jason Lai and James Peach.


Bugs: MESOS-8725
https://issues.apache.org/jira/browse/MESOS-8725


Repository: mesos


Description
---

Tested `max_completion_time` support in docker executor.


Diffs (updated)
-

  src/tests/containerizer/docker_containerizer_tests.cpp 
5e3dfdb537f52a85a82c6e8195e70a389b76aadf 


Diff: https://reviews.apache.org/r/66284/diff/4/

Changes: https://reviews.apache.org/r/66284/diff/3-4/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 66568: Dropped combined operations with GROW and SHRINK volumes.

2018-04-25 Thread Zhitao Li


> On April 23, 2018, 3:43 p.m., Greg Mann wrote:
> > src/master/master.cpp
> > Line 4512 (original), 4515-4527 (patched)
> > <https://reviews.apache.org/r/66568/diff/5/?file=2010510#file2010510line4515>
> >
> > I was just wondering - perhaps this code belongs in the validation 
> > function for the ACCEPT call? Since this is something that we can easily 
> > confirm synchronously in the top-level scheduler API handler and return 
> > 400, rather than calling into the accept handlers.
> 
> Zhitao Li wrote:
> I'm fine in either case as long as it provides better consistency and 
> less code duplication (which I think is a problem). I guess the reason there 
> is a split is due to that AuthZ is async right now.
> 
> I actualy wonder what happens if we move the slave connectivity checks to 
> `Master::accept()`: since master to agent connection through network can be 
> broken at any moment, I do not see checking that in `_accept()` is 100% safe 
> anyway, and `send()` calls to agent need to handle failures anyway.
> 
> Alternatively, we can consider moving authorization checks as early as we 
> can, and condense all other validations after AuthZ returned (assuming AuthZ 
> do not depends on these validations).
> 
> Zhitao Li wrote:
> (replied too fast) I feel both work can reduce the code duplicate, and 
> makes it less ambiguous for future development.
> 
> Chun-Hung Hsiao wrote:
> IMO in general it's better to have validations before authN/authZ, 
> because validations is usually more light-weight than authN/authZ. Also, 
> validations should be synchronous, meaning that if the call is invalid, we 
> will get rid of it faster from the memory (rather than waiting for the 
> authN/authZ to finish).
> 
> Chun-Hung Hsiao wrote:
> But in this case since this is temporary, I'm fine with putting the logic 
> here.
> 
> Chun-Hung Hsiao wrote:
> Question: Are these operations usable with this restriction? It seems to 
> me that with this restriction, the framework will need to maintain one more 
> extra state: 1. reserve enough CPUs, mem and free disk (otherwise it is 
> possible that the framework cannot get enough CPUs and mem after growing the 
> volume); 2. grow the persistent volume; 3. use the grown volume. If we 
> implement a proper validation right now, 1 and 2 can be combined, and there 
> will be no need to rewrite the framework in the future. Can you explain again 
> what the concern of implementing the correct validation logic again?
> 
> Zhitao Li wrote:
> Re: order of valiation vs AuthZ: I agree they should be consistency, but 
> we already have splits between `accept()` and `_accept()` and framework could 
> see operations dropped either before or after AuthZ too. Given that 1) AuthZ 
> will be async by nature and 2) certain states could change between `accept()` 
> and `_accept()` (offer validaty, agent connectivity), I feel that only check 
> AuthZ on `accept()` but delay any actual validation to `_accept()` results in 
> most consistent behavior. Indeed this could increase calls to authorizor but 
> I don't know whether that can be quantified.
> 
> Re: usability of framework: I generally believe that framework always 
> needs to model each state to handle corner cases even if it uses chained 
> operations (because that can have partial success). That's also why I hope to 
> eventually not support chained operations at all. In the final form of the 
> code, whatever "correct validation logic" we write now will not be necessary, 
> and we can implicitly rely on `offeredResources` containment checks chained 
> operations (because `converted` resources will not be added back 
> immediately). Implementing this right now would require us to add extra 
> variables like `observedOfferedResources` and `actualOfferedResources` and 
> check them on each `case` of operations, which is not be as clean or as local 
> as current patch (which also easier to back out).
> 
> Chun-Hung Hsiao wrote:
> I'm not sure if I get it. Instead of applying the conversion on 
> `_offeredResources`, we could just do
> ```
> _offeredResources -= consumed;
> ```
> Wouldn't this solve the issue? Am I missing anything?
> 
> Zhitao Li wrote:
> This code block 
> (https://github.com/apache/mesos/blob/bd688e4cf9f6f35c9d75aad50b99e1fdeb8104da/src/master/master.cpp#L5502)
>  requires `_offeredResources` properly tracks what allocator needs to 
> recover. I believe we still need to handle `converted` part of `grow` and 
> `shrink`, otherwise allocator could see inconsitent views.
> 
> Chun-Hung Hsiao wrote:
> Thanks for point it out Zhitao.
> 
>

Re: Review Request 66568: Dropped combined operations with GROW and SHRINK volumes.

2018-04-25 Thread Zhitao Li


> On April 23, 2018, 3:43 p.m., Greg Mann wrote:
> > src/master/master.cpp
> > Line 4512 (original), 4515-4527 (patched)
> > <https://reviews.apache.org/r/66568/diff/5/?file=2010510#file2010510line4515>
> >
> > I was just wondering - perhaps this code belongs in the validation 
> > function for the ACCEPT call? Since this is something that we can easily 
> > confirm synchronously in the top-level scheduler API handler and return 
> > 400, rather than calling into the accept handlers.
> 
> Zhitao Li wrote:
> I'm fine in either case as long as it provides better consistency and 
> less code duplication (which I think is a problem). I guess the reason there 
> is a split is due to that AuthZ is async right now.
> 
> I actualy wonder what happens if we move the slave connectivity checks to 
> `Master::accept()`: since master to agent connection through network can be 
> broken at any moment, I do not see checking that in `_accept()` is 100% safe 
> anyway, and `send()` calls to agent need to handle failures anyway.
> 
> Alternatively, we can consider moving authorization checks as early as we 
> can, and condense all other validations after AuthZ returned (assuming AuthZ 
> do not depends on these validations).
> 
> Zhitao Li wrote:
> (replied too fast) I feel both work can reduce the code duplicate, and 
> makes it less ambiguous for future development.
> 
> Chun-Hung Hsiao wrote:
> IMO in general it's better to have validations before authN/authZ, 
> because validations is usually more light-weight than authN/authZ. Also, 
> validations should be synchronous, meaning that if the call is invalid, we 
> will get rid of it faster from the memory (rather than waiting for the 
> authN/authZ to finish).
> 
> Chun-Hung Hsiao wrote:
> But in this case since this is temporary, I'm fine with putting the logic 
> here.
> 
> Chun-Hung Hsiao wrote:
> Question: Are these operations usable with this restriction? It seems to 
> me that with this restriction, the framework will need to maintain one more 
> extra state: 1. reserve enough CPUs, mem and free disk (otherwise it is 
> possible that the framework cannot get enough CPUs and mem after growing the 
> volume); 2. grow the persistent volume; 3. use the grown volume. If we 
> implement a proper validation right now, 1 and 2 can be combined, and there 
> will be no need to rewrite the framework in the future. Can you explain again 
> what the concern of implementing the correct validation logic again?
> 
> Zhitao Li wrote:
> Re: order of valiation vs AuthZ: I agree they should be consistency, but 
> we already have splits between `accept()` and `_accept()` and framework could 
> see operations dropped either before or after AuthZ too. Given that 1) AuthZ 
> will be async by nature and 2) certain states could change between `accept()` 
> and `_accept()` (offer validaty, agent connectivity), I feel that only check 
> AuthZ on `accept()` but delay any actual validation to `_accept()` results in 
> most consistent behavior. Indeed this could increase calls to authorizor but 
> I don't know whether that can be quantified.
> 
> Re: usability of framework: I generally believe that framework always 
> needs to model each state to handle corner cases even if it uses chained 
> operations (because that can have partial success). That's also why I hope to 
> eventually not support chained operations at all. In the final form of the 
> code, whatever "correct validation logic" we write now will not be necessary, 
> and we can implicitly rely on `offeredResources` containment checks chained 
> operations (because `converted` resources will not be added back 
> immediately). Implementing this right now would require us to add extra 
> variables like `observedOfferedResources` and `actualOfferedResources` and 
> check them on each `case` of operations, which is not be as clean or as local 
> as current patch (which also easier to back out).
> 
> Chun-Hung Hsiao wrote:
> I'm not sure if I get it. Instead of applying the conversion on 
> `_offeredResources`, we could just do
> ```
> _offeredResources -= consumed;
> ```
> Wouldn't this solve the issue? Am I missing anything?

This code block 
(https://github.com/apache/mesos/blob/bd688e4cf9f6f35c9d75aad50b99e1fdeb8104da/src/master/master.cpp#L5502)
 requires `_offeredResources` properly tracks what allocator needs to recover. 
I believe we still need to handle `converted` part of `grow` and `shrink`, 
otherwise allocator could see inconsitent views.


- Zhitao


---
This is an automatically generated e-mail. To reply, visit:
https://r

Re: Review Request 66532: Added test for authorization actions for `RESIZE_VOLUME`.

2018-04-25 Thread Zhitao Li


> On April 17, 2018, 4:24 p.m., Greg Mann wrote:
> > src/tests/authorization_tests.cpp
> > Line 1979 (original), 1979 (patched)
> > <https://reviews.apache.org/r/66532/diff/2/?file=1996775#file1996775line1979>
> >
> > Could you also add an end-to-end test of authorization for these 
> > operations?
> 
> Zhitao Li wrote:
> Sure. Do you have an example or existing test to add this?
> 
> Greg Mann wrote:
> There are some similar tests in both 'persistent_volume_tests.cpp' as 
> well as 'master_authorization_tests.cpp'.
> 
> Tests which do something very similar in the persistent volume tests are 
> 'PersistentVolumeTest.BadACLDropCreateAndDestroy' and 
> 'PersistentVolumeTest.GoodACLCreateThenDestroy'. It would be good to verify 
> both the successful and failed authorization cases.

I've added some test cases. It seems that there will be some level of code 
duplication and I'm not sure whether we want to follow the ones for 
`...CreateAndDrop`, or Chun's idea of smaller tests verifying one thing at a 
time. We can definitely iterate on these. Marking this comment as fixed.


- Zhitao


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66532/#review201362
-------


On April 24, 2018, 11:16 p.m., Zhitao Li wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66532/
> ---
> 
> (Updated April 24, 2018, 11:16 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao and Greg Mann.
> 
> 
> Bugs: MESOS-8748
> https://issues.apache.org/jira/browse/MESOS-8748
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added test for authorization actions for `RESIZE_VOLUME`.
> 
> 
> Diffs
> -
> 
>   src/tests/authorization_tests.cpp a76ad18a54ec232e6d6c92ec6a45b445a83f174c 
>   src/tests/persistent_volume_tests.cpp 
> 4edf781711d9efdb994114aeb6289b6af750b87a 
> 
> 
> Diff: https://reviews.apache.org/r/66532/diff/5/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Zhitao Li
> 
>



Re: Review Request 66532: Added test for authorization actions for `RESIZE_VOLUME`.

2018-04-25 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66532/
---

(Updated April 24, 2018, 11:16 p.m.)


Review request for mesos, Chun-Hung Hsiao and Greg Mann.


Changes
---

Add initial tests for authorization of new operations.


Bugs: MESOS-8748
https://issues.apache.org/jira/browse/MESOS-8748


Repository: mesos


Description
---

Added test for authorization actions for `RESIZE_VOLUME`.


Diffs (updated)
-

  src/tests/authorization_tests.cpp a76ad18a54ec232e6d6c92ec6a45b445a83f174c 
  src/tests/persistent_volume_tests.cpp 
4edf781711d9efdb994114aeb6289b6af750b87a 


Diff: https://reviews.apache.org/r/66532/diff/5/

Changes: https://reviews.apache.org/r/66532/diff/4-5/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 66568: Dropped combined operations with GROW and SHRINK volumes.

2018-04-25 Thread Zhitao Li


> On April 23, 2018, 3:43 p.m., Greg Mann wrote:
> > src/master/master.cpp
> > Line 4512 (original), 4515-4527 (patched)
> > <https://reviews.apache.org/r/66568/diff/5/?file=2010510#file2010510line4515>
> >
> > I was just wondering - perhaps this code belongs in the validation 
> > function for the ACCEPT call? Since this is something that we can easily 
> > confirm synchronously in the top-level scheduler API handler and return 
> > 400, rather than calling into the accept handlers.
> 
> Zhitao Li wrote:
> I'm fine in either case as long as it provides better consistency and 
> less code duplication (which I think is a problem). I guess the reason there 
> is a split is due to that AuthZ is async right now.
> 
> I actualy wonder what happens if we move the slave connectivity checks to 
> `Master::accept()`: since master to agent connection through network can be 
> broken at any moment, I do not see checking that in `_accept()` is 100% safe 
> anyway, and `send()` calls to agent need to handle failures anyway.
> 
> Alternatively, we can consider moving authorization checks as early as we 
> can, and condense all other validations after AuthZ returned (assuming AuthZ 
> do not depends on these validations).
> 
> Zhitao Li wrote:
> (replied too fast) I feel both work can reduce the code duplicate, and 
> makes it less ambiguous for future development.
> 
> Chun-Hung Hsiao wrote:
> IMO in general it's better to have validations before authN/authZ, 
> because validations is usually more light-weight than authN/authZ. Also, 
> validations should be synchronous, meaning that if the call is invalid, we 
> will get rid of it faster from the memory (rather than waiting for the 
> authN/authZ to finish).
> 
> Chun-Hung Hsiao wrote:
> But in this case since this is temporary, I'm fine with putting the logic 
> here.
> 
> Chun-Hung Hsiao wrote:
> Question: Are these operations usable with this restriction? It seems to 
> me that with this restriction, the framework will need to maintain one more 
> extra state: 1. reserve enough CPUs, mem and free disk (otherwise it is 
> possible that the framework cannot get enough CPUs and mem after growing the 
> volume); 2. grow the persistent volume; 3. use the grown volume. If we 
> implement a proper validation right now, 1 and 2 can be combined, and there 
> will be no need to rewrite the framework in the future. Can you explain again 
> what the concern of implementing the correct validation logic again?

Re: order of valiation vs AuthZ: I agree they should be consistency, but we 
already have splits between `accept()` and `_accept()` and framework could see 
operations dropped either before or after AuthZ too. Given that 1) AuthZ will 
be async by nature and 2) certain states could change between `accept()` and 
`_accept()` (offer validaty, agent connectivity), I feel that only check AuthZ 
on `accept()` but delay any actual validation to `_accept()` results in most 
consistent behavior. Indeed this could increase calls to authorizor but I don't 
know whether that can be quantified.

Re: usability of framework: I generally believe that framework always needs to 
model each state to handle corner cases even if it uses chained operations 
(because that can have partial success). That's also why I hope to eventually 
not support chained operations at all. In the final form of the code, whatever 
"correct validation logic" we write now will not be necessary, and we can 
implicitly rely on `offeredResources` containment checks chained operations 
(because `converted` resources will not be added back immediately). 
Implementing this right now would require us to add extra variables like 
`observedOfferedResources` and `actualOfferedResources` and check them on each 
`case` of operations, which is not be as clean or as local as current patch 
(which also easier to back out).


- Zhitao


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66568/#review201785
---


On April 23, 2018, 1:07 p.m., Zhitao Li wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66568/
> ---
> 
> (Updated April 23, 2018, 1:07 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao and Greg Mann.
> 
> 
> Bugs: MESOS-4965
> https://issues.apache.org/jira/browse/MESOS-4965
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> These two operations are intended to be non-speculative eventuall

Re: Review Request 66050: Implemented grow and shrink of persistent volumes.

2018-04-24 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66050/
---

(Updated April 24, 2018, 2:16 p.m.)


Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann.


Changes
---

Rebase.


Bugs: MESOS-4965
https://issues.apache.org/jira/browse/MESOS-4965


Repository: mesos


Description
---

The new offer operations are implemented as speculative operations, but
we will use validation to make them non-speculative on API level so that
we can transition later without a breaking change.


Diffs (updated)
-

  src/common/protobuf_utils.cpp 78bffd8595f0e9f34e981548d8136ff94160573b 
  src/common/resources_utils.cpp 9be01c1abd48264e308960f35cc7c2ee8a367518 
  src/master/master.cpp c723a291ed8d219ced4492bc905ac6b52683ae47 
  src/master/validation.hpp 7c129ceb929596acbb64d37025e055661277e6bf 
  src/master/validation.cpp ac2e1bb8771841ec59b3bdcdeffb6c6230680d4d 


Diff: https://reviews.apache.org/r/66050/diff/13/

Changes: https://reviews.apache.org/r/66050/diff/12-13/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 66733: Added a new `RESIZE_VOLUME` agent capability.

2018-04-24 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66733/
---

(Updated April 24, 2018, 2:16 p.m.)


Review request for mesos, Chun-Hung Hsiao, Gaston Kleiman, and Greg Mann.


Changes
---

Review comments.


Bugs: MESOS-4945
https://issues.apache.org/jira/browse/MESOS-4945


Repository: mesos


Description
---

This will be used as a feature flag to gate the new volume resize
feature. This feature will be turn on by default once released.


Diffs (updated)
-

  include/mesos/mesos.proto 5bc4a80ad5278d20feced73dce755519515de505 
  include/mesos/v1/mesos.proto 5a4e733250831fa5fe86544aeb98dbc0a4f5afa6 
  src/common/protobuf_utils.hpp ae060f3a7946ac5862faeca15330e9642f934137 
  src/common/protobuf_utils.cpp 78bffd8595f0e9f34e981548d8136ff94160573b 
  src/slave/constants.cpp 51de71bc45ff4fb862690efd9428b25d92055391 
  src/slave/flags.cpp 0ee4f65b14b578c8f8de60d42dff0ee438cdd8a8 
  src/tests/master_tests.cpp d5ce52c0eb1ba333d1b3dd35518545c13d828ce6 
  src/tests/slave_tests.cpp ae82438a4f2f79820fe6d476e25b4915aa5f212c 


Diff: https://reviews.apache.org/r/66733/diff/2/

Changes: https://reviews.apache.org/r/66733/diff/1-2/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 66291: Added support to `max_completion_time` in default executor.

2018-04-23 Thread Zhitao Li

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66291/
---

(Updated April 23, 2018, 4:56 p.m.)


Review request for mesos, Jason Lai and James Peach.


Changes
---

Use `killedByCompletionTimeout` to indicate kill reason, and refactor `kill()` 
to reuse a grace period.


Bugs: MESOS-8725
https://issues.apache.org/jira/browse/MESOS-8725


Repository: mesos


Description
---

When a task group has multiple tasks:
- each task can have its own `max_completion_time`, or not have one;
- if a task succeeds before its `max_completion_time`, all other tasks
will keep running;
- if a task fails, all other tasks in the same group will fail (as
before);
- if a task does not succeed before its `max_completion_time`, it will
fail with `TASK_FAILED` and reason `REASON_MAX_COMPLETION_TIME_REACHED`,
while other tasks will be killed without above reason.


Diffs (updated)
-

  src/launcher/default_executor.cpp ea0d425bb60e970f209f411632e1d486c279259b 


Diff: https://reviews.apache.org/r/66291/diff/3/

Changes: https://reviews.apache.org/r/66291/diff/2-3/


Testing
---


Thanks,

Zhitao Li



Re: Review Request 66568: Dropped combined operations with GROW and SHRINK volumes.

2018-04-23 Thread Zhitao Li


> On April 23, 2018, 3:43 p.m., Greg Mann wrote:
> > src/master/master.cpp
> > Line 4512 (original), 4515-4527 (patched)
> > <https://reviews.apache.org/r/66568/diff/5/?file=2010510#file2010510line4515>
> >
> > I was just wondering - perhaps this code belongs in the validation 
> > function for the ACCEPT call? Since this is something that we can easily 
> > confirm synchronously in the top-level scheduler API handler and return 
> > 400, rather than calling into the accept handlers.
> 
> Zhitao Li wrote:
> I'm fine in either case as long as it provides better consistency and 
> less code duplication (which I think is a problem). I guess the reason there 
> is a split is due to that AuthZ is async right now.
> 
> I actualy wonder what happens if we move the slave connectivity checks to 
> `Master::accept()`: since master to agent connection through network can be 
> broken at any moment, I do not see checking that in `_accept()` is 100% safe 
> anyway, and `send()` calls to agent need to handle failures anyway.
> 
> Alternatively, we can consider moving authorization checks as early as we 
> can, and condense all other validations after AuthZ returned (assuming AuthZ 
> do not depends on these validations).

(replied too fast) I feel both work can reduce the code duplicate, and makes it 
less ambiguous for future development.


- Zhitao


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66568/#review201785
---


On April 23, 2018, 1:07 p.m., Zhitao Li wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66568/
> ---
> 
> (Updated April 23, 2018, 1:07 p.m.)
> 
> 
> Review request for mesos, Chun-Hung Hsiao and Greg Mann.
> 
> 
> Bugs: MESOS-4965
> https://issues.apache.org/jira/browse/MESOS-4965
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> These two operations are intended to be non-speculative eventually but
> are implemented as speculative right now. To avoid frameworks opt-in to
> dangerous behavior, we require that accept can only contain one
> `GROW_VOLUME` or `SHRINK_VOLUME` and no other operations.
> 
> This is implemented by reuse existing code which already drops `LAUNCH`
> or `LAUNCH_GROUP` with proper reason and message.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp 767ad8cfe142b47ef07172bcb2a4fb49fc3e833a 
> 
> 
> Diff: https://reviews.apache.org/r/66568/diff/5/
> 
> 
> Testing
> ---
> 
> Behavior tested in https://reviews.apache.org/r/66569/.
> 
> 
> Thanks,
> 
> Zhitao Li
> 
>



  1   2   3   4   5   6   7   8   9   10   >