from:"Jiang Yan Xu"

Re: Review Request 71742: Mesos agent shouldn't respond pings if no master is registered

2019-11-22 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71742/#review218775
---



As we are discussing on slack how to deal with the broken test and how to 
change it to be more meaningful I think it won't be unreasonable to just delete 
it for now. We can always add it in its new form back.

- Jiang Yan Xu


On Nov. 22, 2019, noon, Xudong Ni wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71742/
> ---
> 
> (Updated Nov. 22, 2019, noon)
> 
> 
> Review request for mesos, Benjamin Mahler and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-10032
> https://issues.apache.org/jira/browse/MESOS-10032
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> In the case agents lost ZooKeeper connections and resetting its
> master to none and beginning to dropping control messages from the
> master, agent should not respond pings from master.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.cpp 3839a120446339fea8aa857f431a2dba28ed4002 
>   src/tests/slave_recovery_tests.cpp 0efd3a6ac09ad06d9365b7bb2295157b5175e6b8 
> 
> 
> Diff: https://reviews.apache.org/r/71742/diff/4/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>

Re: Review Request 71742: Mesos agent shouldn't respond pings if no master is registered

2019-11-15 Thread Jiang Yan Xu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71742/#review218653
---

Chatted on https://mesos.slack.com/archives/C8PDVCDE3/p1573688798006800.

The approach LGTM. Let's fix the tests. :)

src/slave/slave.cpp
Lines 6415 (patched)
<https://reviews.apache.org/r/71742/#comment306509>

s/respond/respond to/

But the comment is basically reiterating the behavior which is pretty 
straightforward. We can add a bit of rationale for the behavior here: 

```
// Don't respond to pings if the agent cannot detect the master (e.g., due 
to failing to connect to ZK) because it stops responding to control messages 
such as run/kill tasks. It's better to have the master eventually mark the 
agent as partitioned after prolonged ZK disconnection than to silently drop 
messages.
```

src/slave/slave.cpp
Lines 6419 (patched)
<https://reviews.apache.org/r/71742/#comment306510>

"registered master" is not a valid concept here, let's say "because the 
master cannot be detected"?

- Jiang Yan Xu

On Nov. 13, 2019, 2:39 p.m., Xudong Ni wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71742/
> ---
> 
> (Updated Nov. 13, 2019, 2:39 p.m.)
> 
> 
> Review request for mesos and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-10032
> https://issues.apache.org/jira/browse/MESOS-10032
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> In the case agents lost ZooKeeper connections and resetting its
> master to none and beginning to dropping control messages from the
> master, agent should not respond pings from master.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.cpp 3839a120446339fea8aa857f431a2dba28ed4002 
> 
> 
> Diff: https://reviews.apache.org/r/71742/diff/2/
> 
> 
> Testing
> ---
> 
> ==] 2322 tests from 222 test cases ran. (1038166 ms total)
> [  PASSED  ] 2321 tests.
> [  FAILED  ] 1 test, listed below:
> [  FAILED  ] SlaveRecoveryTest/0.PingTimeoutDuringRecovery, where TypeParam = 
> mesos::internal::slave::MesosContainerizer
> 
> This failed test verifies that the agent responds to pings from the master 
> while the agent is performing recovery, this PR will break this scenario.
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>

Re: Review Request 71193: Supported multiple quota paths in the `disk/xfs` isolator.

2019-08-05 Thread Jiang Yan Xu



> On Aug. 2, 2019, 3:17 p.m., Jiang Yan Xu wrote:
> > src/slave/containerizer/mesos/isolators/xfs/disk.cpp
> > Line 570 (original), 566 (patched)
> > <https://reviews.apache.org/r/71193/diff/3/?file=2158799#file2158799line570>
> >
> > You are not changing the logic here but could you remind me why this 
> > error doesn't fail the update?
> 
> James Peach wrote:
> I think that we are just trying to be robust against host problems. This 
> would only fail if there was a serious disk error, so we just try to maintain 
> our internal consistency.

If it's btween continuing without being able to enforce quota and failing task 
+ inititating clean up, I'd say the latter is better protecting consistency?


- Jiang Yan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71193/#review217059
---


On Aug. 5, 2019, 2:17 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71193/
> ---
> 
> (Updated Aug. 5, 2019, 2:17 a.m.)
> 
> 
> Review request for mesos, Xudong Ni, Gilbert Song, Jie Yu, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9900
> https://issues.apache.org/jira/browse/MESOS-9900
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> The `disk/xfs` isolator assumed that there would only be a single
> directory path for each project quota. When we apply project quotas
> to the overlayfs upperdir, that won't be true any more, since the
> upperdir will come under the same quota as the task sandbox.
> 
> Update the quota reclamation tracking to keep a set of disk paths that
> the quota has been applied to, and only reclaim the project ID once all
> those paths have been removed.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/isolators/xfs/disk.hpp 
> 94d44e7f39dc01037461015b499a1fc3169b24e8 
>   src/slave/containerizer/mesos/isolators/xfs/disk.cpp 
> 646330c65b24aa28801ec99d7909db08a3e05c79 
> 
> 
> Diff: https://reviews.apache.org/r/71193/diff/4/
> 
> 
> Testing
> ---
> 
> sudo make check (Frdora 30)
> 
> 
> Thanks,
> 
> James Peach
> 
>

Re: Review Request 71194: Add `disk/xfs` isolator support for ephemeral volumes.

2019-08-02 Thread Jiang Yan Xu



> On Aug. 2, 2019, 4:03 p.m., Jiang Yan Xu wrote:
> > src/slave/containerizer/mesos/isolators/xfs/disk.cpp
> > Lines 424 (patched)
> > <https://reviews.apache.org/r/71194/diff/3/?file=2158813#file2158813line424>
> >
> > FWIW sandbox scanning predates the container run state, now it does 
> > look like at least sandboxes (or any paths that can be deduced from the 
> > container ID) would be covered given it's checkpointed pretty early: 
> > https://github.com/apache/mesos/blob/e1176c453d04a8ef8f53cf23928b5bbb09173d78/src/slave/containerizer/mesos/containerizer.cpp#L1547
> > 
> > but of course the emphemeral volumes are added later than that and 
> > there could be dirs with projectIDs set before `ephemeral_volumes` is 
> > persisted.

Sorry I meant to add that the provisioner dir for a container can also be 
[deduced](https://github.com/apache/mesos/blob/e1176c453d04a8ef8f53cf23928b5bbb09173d78/src/slave/containerizer/mesos/provisioner/paths.hpp#L38)
 from the containerID so you can just scan the subdirs of the containers 
recovered. 

Up to you.


- Jiang Yan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71194/#review217058
---


On Aug. 1, 2019, 7:27 p.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71194/
> ---
> 
> (Updated Aug. 1, 2019, 7:27 p.m.)
> 
> 
> Review request for mesos, Xudong Ni, Gilbert Song, Jie Yu, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9900
> https://issues.apache.org/jira/browse/MESOS-9900
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Add support for labeling ephemeral volumes with the sandbox XFS
> project ID. This makes changes to the container rootfs share the
> same disk quota as the sandbox.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/isolators/xfs/disk.cpp 
> 646330c65b24aa28801ec99d7909db08a3e05c79 
>   src/slave/containerizer/mesos/provisioner/backends/overlay.hpp 
> 362e02172d2fd8e6e241fb6f5689f569ba74a0d1 
>   src/slave/containerizer/mesos/provisioner/backends/overlay.cpp 
> f2040cf36c601a13281a78ff844ebd41000a2d65 
> 
> 
> Diff: https://reviews.apache.org/r/71194/diff/3/
> 
> 
> Testing
> ---
> 
> sudo make check (Frdora 30)
> 
> 
> Thanks,
> 
> James Peach
> 
>

Re: Review Request 71194: Add `disk/xfs` isolator support for ephemeral volumes.

2019-08-02 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71194/#review217058
---




src/slave/containerizer/mesos/isolators/xfs/disk.cpp
Lines 342 (patched)
<https://reviews.apache.org/r/71194/#comment304302>

Use `at()`?



src/slave/containerizer/mesos/isolators/xfs/disk.cpp
Lines 346 (patched)
<https://reviews.apache.org/r/71194/#comment304307>

There are three experessions `foreachpair` so if we break them apart, put 
each one a line?



src/slave/containerizer/mesos/isolators/xfs/disk.cpp
Lines 347 (patched)
<https://reviews.apache.org/r/71194/#comment304308>

s/container volumes/persistent volumes/?



src/slave/containerizer/mesos/isolators/xfs/disk.cpp
Lines 423 (patched)
<https://reviews.apache.org/r/71194/#comment304310>

Do we need to single out the overlay backend? Seems like we can just scan 
all backends?



src/slave/containerizer/mesos/isolators/xfs/disk.cpp
Lines 424 (patched)
<https://reviews.apache.org/r/71194/#comment304309>

FWIW sandbox scanning predates the container run state, now it does look 
like at least sandboxes (or any paths that can be deduced from the container 
ID) would be covered given it's checkpointed pretty early: 
https://github.com/apache/mesos/blob/e1176c453d04a8ef8f53cf23928b5bbb09173d78/src/slave/containerizer/mesos/containerizer.cpp#L1547

but of course the emphemeral volumes are added later than that and there 
could be dirs with projectIDs set before `ephemeral_volumes` is persisted.



src/slave/containerizer/mesos/isolators/xfs/disk.cpp
Lines 517 (patched)
<https://reviews.apache.org/r/71194/#comment304311>

Same question: why don't we fail here?


- Jiang Yan Xu


On Aug. 1, 2019, 7:27 p.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71194/
> ---
> 
> (Updated Aug. 1, 2019, 7:27 p.m.)
> 
> 
> Review request for mesos, Xudong Ni, Gilbert Song, Jie Yu, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9900
> https://issues.apache.org/jira/browse/MESOS-9900
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Add support for labeling ephemeral volumes with the sandbox XFS
> project ID. This makes changes to the container rootfs share the
> same disk quota as the sandbox.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/isolators/xfs/disk.cpp 
> 646330c65b24aa28801ec99d7909db08a3e05c79 
>   src/slave/containerizer/mesos/provisioner/backends/overlay.hpp 
> 362e02172d2fd8e6e241fb6f5689f569ba74a0d1 
>   src/slave/containerizer/mesos/provisioner/backends/overlay.cpp 
> f2040cf36c601a13281a78ff844ebd41000a2d65 
> 
> 
> Diff: https://reviews.apache.org/r/71194/diff/3/
> 
> 
> Testing
> ---
> 
> sudo make check (Frdora 30)
> 
> 
> Thanks,
> 
> James Peach
> 
>

Re: Review Request 71192: Propagate ephemeral volume information from rootfs.

2019-08-02 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71192/#review217060
---


Ship it!




Ship It!

- Jiang Yan Xu


On July 31, 2019, 6:03 p.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71192/
> ---
> 
> (Updated July 31, 2019, 6:03 p.m.)
> 
> 
> Review request for mesos, Xudong Ni, Gilbert Song, Jie Yu, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9900
> https://issues.apache.org/jira/browse/MESOS-9900
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Propagate the overlayfs upperdir path to the containerization
> layer in a general form as a set of ephemeral volume paths.
> 
> 
> Diffs
> -
> 
>   include/mesos/slave/containerizer.proto 
> a60c96302a6cec90ecd0a0885b844fff8d37db71 
>   src/slave/containerizer/mesos/containerizer.cpp 
> a01edc8793a2eaa655f1729a01a01f1f61fbf7cb 
>   src/slave/containerizer/mesos/provisioner/backend.hpp 
> 7257d3a962ecdf87fe9d52facbd6a2619311a018 
>   src/slave/containerizer/mesos/provisioner/backends/aufs.hpp 
> 2c2518775d2bcb3b1857775723828c4ac08111ca 
>   src/slave/containerizer/mesos/provisioner/backends/aufs.cpp 
> 2eba55228e7596c170dddea2a9930ba6cf73404f 
>   src/slave/containerizer/mesos/provisioner/backends/bind.hpp 
> c4a1d5f35710f6eab13938333ed4cb97d9ff2f6b 
>   src/slave/containerizer/mesos/provisioner/backends/bind.cpp 
> 4cc4c520aa6872c05f474bc7837a353b43983e0b 
>   src/slave/containerizer/mesos/provisioner/backends/copy.hpp 
> 5dc9a2fd38cdfafda02129b0d144eee3230a1ff2 
>   src/slave/containerizer/mesos/provisioner/backends/copy.cpp 
> 10516caab82d910944814b80e0a1d9aeba19007c 
>   src/slave/containerizer/mesos/provisioner/backends/overlay.hpp 
> 362e02172d2fd8e6e241fb6f5689f569ba74a0d1 
>   src/slave/containerizer/mesos/provisioner/backends/overlay.cpp 
> f2040cf36c601a13281a78ff844ebd41000a2d65 
>   src/slave/containerizer/mesos/provisioner/provisioner.hpp 
> 7f84aa499b21140eb29ef7f81e2608743b12eb75 
>   src/slave/containerizer/mesos/provisioner/provisioner.cpp 
> a081fb059f9e0788c2153d1e04eda37fe514a1f7 
> 
> 
> Diff: https://reviews.apache.org/r/71192/diff/3/
> 
> 
> Testing
> ---
> 
> sudo make check (Frdora 30)
> 
> 
> Thanks,
> 
> James Peach
> 
>

Re: Review Request 71193: Supported multiple quota paths in the `disk/xfs` isolator.

2019-08-02 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71193/#review217059
---


Fix it, then Ship it!





src/slave/containerizer/mesos/isolators/xfs/disk.cpp
Line 570 (original), 566 (patched)
<https://reviews.apache.org/r/71193/#comment304303>

You are not changing the logic here but could you remind me why this error 
doesn't fail the update?



src/slave/containerizer/mesos/isolators/xfs/disk.cpp
Lines 776 (patched)
<https://reviews.apache.org/r/71193/#comment304304>

For `hashmap` there's already a `contains` defined.



src/slave/containerizer/mesos/isolators/xfs/disk.cpp
Lines 778-779 (patched)
<https://reviews.apache.org/r/71193/#comment304305>

You could've used a `else if` followed by an `else`  to avoid another level 
of nesting.



src/slave/containerizer/mesos/isolators/xfs/disk.cpp
Line 807 (original)
<https://reviews.apache.org/r/71193/#comment304306>

We could move the logging about the `dir` up above `erase` if we still want 
to log it (maybe for `VLOG(1)`)?


- Jiang Yan Xu


On July 30, 2019, 12:52 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71193/
> ---
> 
> (Updated July 30, 2019, 12:52 a.m.)
> 
> 
> Review request for mesos, Xudong Ni, Gilbert Song, Jie Yu, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9900
> https://issues.apache.org/jira/browse/MESOS-9900
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> The `disk/xfs` isolator assumed that there would only be a single
> directory path for each project quota. When we apply project quotas
> to the overlayfs upperdir, that won't be true any more, since the
> upperdir will come under the same quota as the task sandbox.
> 
> Update the quota reclamation tracking to keep a set of disk paths that
> the quota has been applied to, and only reclaim the project ID once all
> those paths have been removed.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/isolators/xfs/disk.hpp 
> 94d44e7f39dc01037461015b499a1fc3169b24e8 
>   src/slave/containerizer/mesos/isolators/xfs/disk.cpp 
> 646330c65b24aa28801ec99d7909db08a3e05c79 
> 
> 
> Diff: https://reviews.apache.org/r/71193/diff/3/
> 
> 
> Testing
> ---
> 
> sudo make check (Frdora 30)
> 
> 
> Thanks,
> 
> James Peach
> 
>

Re: Review Request 71192: Propagate ephemeral volume information from rootfs.

2019-08-02 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71192/#review217057
---




src/slave/containerizer/mesos/provisioner/backend.hpp
Lines 55 (patched)
<https://reviews.apache.org/r/71192/#comment304301>

The backend abstraction probably shouldn't care about quota so probably 
phrasing it as "additional paths the backend has created" or something?



src/slave/containerizer/mesos/provisioner/backends/overlay.cpp
Lines 243 (patched)
<https://reviews.apache.org/r/71192/#comment304300>

Nit: with initializers I don't think we wrap elements with spaces.


- Jiang Yan Xu


On July 31, 2019, 6:03 p.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71192/
> ---
> 
> (Updated July 31, 2019, 6:03 p.m.)
> 
> 
> Review request for mesos, Xudong Ni, Gilbert Song, Jie Yu, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9900
> https://issues.apache.org/jira/browse/MESOS-9900
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Propagate the overlayfs upperdir path to the containerization
> layer in a general form as a set of ephemeral volume paths.
> 
> 
> Diffs
> -
> 
>   include/mesos/slave/containerizer.proto 
> a60c96302a6cec90ecd0a0885b844fff8d37db71 
>   src/slave/containerizer/mesos/containerizer.cpp 
> a01edc8793a2eaa655f1729a01a01f1f61fbf7cb 
>   src/slave/containerizer/mesos/provisioner/backend.hpp 
> 7257d3a962ecdf87fe9d52facbd6a2619311a018 
>   src/slave/containerizer/mesos/provisioner/backends/aufs.hpp 
> 2c2518775d2bcb3b1857775723828c4ac08111ca 
>   src/slave/containerizer/mesos/provisioner/backends/aufs.cpp 
> 2eba55228e7596c170dddea2a9930ba6cf73404f 
>   src/slave/containerizer/mesos/provisioner/backends/bind.hpp 
> c4a1d5f35710f6eab13938333ed4cb97d9ff2f6b 
>   src/slave/containerizer/mesos/provisioner/backends/bind.cpp 
> 4cc4c520aa6872c05f474bc7837a353b43983e0b 
>   src/slave/containerizer/mesos/provisioner/backends/copy.hpp 
> 5dc9a2fd38cdfafda02129b0d144eee3230a1ff2 
>   src/slave/containerizer/mesos/provisioner/backends/copy.cpp 
> 10516caab82d910944814b80e0a1d9aeba19007c 
>   src/slave/containerizer/mesos/provisioner/backends/overlay.hpp 
> 362e02172d2fd8e6e241fb6f5689f569ba74a0d1 
>   src/slave/containerizer/mesos/provisioner/backends/overlay.cpp 
> f2040cf36c601a13281a78ff844ebd41000a2d65 
>   src/slave/containerizer/mesos/provisioner/provisioner.hpp 
> 7f84aa499b21140eb29ef7f81e2608743b12eb75 
>   src/slave/containerizer/mesos/provisioner/provisioner.cpp 
> a081fb059f9e0788c2153d1e04eda37fe514a1f7 
> 
> 
> Diff: https://reviews.apache.org/r/71192/diff/3/
> 
> 
> Testing
> ---
> 
> sudo make check (Frdora 30)
> 
> 
> Thanks,
> 
> James Peach
> 
>

Re: Review Request 71080: Master should store the list of completed framework ids for lifecycle.

2019-07-24 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71080/#review216836
---



LGTM as a work around, which we won't commit upstream.

We reached a decision in 
https://mesos.slack.com/archives/C8PDVCDE3/p1563816071006900 to not merge it 
upstream as we couldn't reach consensus on how the flag / semantics should be 
maintained.


include/mesos/allocator/allocator.hpp
Lines 66 (patched)
<https://reviews.apache.org/r/71080/#comment304084>

`maxCompletedFrameworks` is used for the metrics to keep for completed 
frameworks in the allocator.

`maxCompletedFramework_ids` (we should name it `maxCompletedFrameworkIDs` 
if it were useful) is not used and we don't need to add this field.



src/master/master.hpp
Lines 2210 (patched)
<https://reviews.apache.org/r/71080/#comment304085>

camelCasing to be consistent.



src/master/master.hpp
Lines 2215 (patched)
<https://reviews.apache.org/r/71080/#comment304086>

I would use a `bool` (1 byte) as the value.

And a comment that the value is not meaningful and we are just using the 
map to simulate a "bounded hashset".


- Jiang Yan Xu


On July 18, 2019, 8:10 p.m., Xudong Ni wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71080/
> ---
> 
> (Updated July 18, 2019, 8:10 p.m.)
> 
> 
> Review request for mesos, James Peach and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8636
> https://issues.apache.org/jira/browse/MESOS-8636
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> It should be separately from that for webUI and endpoints.
> Currently the master stores the history of completed frameworks in
> a map with the full historical data of the framework, it is
> prohibitively expensive to keep a long history; In order to reject
> frameworks from reregistering if they have previously marked as
> completed, we only need to persist the framework IDs and are able
> to keep long history.
> 
> 
> Diffs
> -
> 
>   docs/configuration/master.md c56ac8510ea968f9587e23e81ed310caa968ee9e 
>   docs/operator-http-api.md 2d4a9b66e20cf19eceec718b7de3d812ab285772 
>   include/mesos/allocator/allocator.hpp 
> 2bab53ab5fb25931a724c20a039e1301983ba574 
>   src/master/constants.hpp 26afa356b7844b4ec6c2caeef33bd39c51148d5f 
>   src/master/flags.hpp 78623d68bf428cd3f52684303d98a525d42eb878 
>   src/master/flags.cpp 74f4daadd48e8e691be43759b88dc8b3c2df489a 
>   src/master/http.cpp 765d5052ab6a8d731113f8922d20fb280b843003 
>   src/master/master.hpp ffa7423ba533725f7c1123d9aa507b1348e7f281 
>   src/master/master.cpp f1ca637b4cb0382caec53b5a81f6a4eb46f4dd2d 
> 
> 
> Diff: https://reviews.apache.org/r/71080/diff/7/
> 
> 
> Testing
> ---
> 
> make check
> 
> sudo GLOG_v=1 ./bin/mesos-tests.sh --verbose 
> --gtest_filter=MasterTest.MaxCompletedFrameworksFlag --gtest_break_on_failure 
> --gtest_repeat=1000
> [   OK ] MasterTest.MaxCompletedFrameworksFlag (230 ms)
> [--] 1 test from MasterTest (235 ms total)
> 
> [--] Global test environment tear-down
> [==] 1 test from 1 test case ran. (281 ms total)
> [  PASSED  ] 1 test.
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>

Re: Review Request 70862: Update `EXPECT` to `ASSERT` in blkio tests.

2019-06-17 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70862/#review215928
---


Ship it!




Ship It!

- Jiang Yan Xu


On June 16, 2019, 7:13 p.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70862/
> ---
> 
> (Updated June 16, 2019, 7:13 p.m.)
> 
> 
> Review request for mesos, Gilbert Song and Qian Zhang.
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Since the blkio cgroup tests immediately require the result of
> `blkio_statistics()`, we should test for the presence of statistics
> with `ASSERT_TRUE` rather than `EXPECT_TRUE`.
> 
> 
> Diffs
> -
> 
>   src/tests/containerizer/cgroups_isolator_tests.cpp 
> eef8eb28ea57ef0a2a3c626ac9aee202eb7231d9 
> 
> 
> Diff: https://reviews.apache.org/r/70862/diff/1/
> 
> 
> Testing
> ---
> 
> make check (Fedora 30)
> 
> 
> Thanks,
> 
> James Peach
> 
>

Re: Review Request 70741: Adopted container file operations for secrets volumes.

2019-06-06 Thread Jiang Yan Xu



> On June 5, 2019, 12:16 a.m., Jiang Yan Xu wrote:
> > src/slave/containerizer/mesos/launch.cpp
> > Lines 516 (patched)
> > <https://reviews.apache.org/r/70741/diff/2/?file=2147412#file2147412line524>
> >
> > Is it more explicit if you just name the operation 
> > `ContainerFileOperation::MV` or `ContainerFileOperation::MOVE` and 
> > implement it as such (rename + fallback) either here or in stout since you 
> > are essentially implementing how we use a `mv` command?
> 
> James Peach wrote:
> `rename` vs. `move` seems pretty similar. To me, `rename` seemed clear so 
> I prefer it. I didn't add a stout implementation because I didn't have the 
> time to handle all the corner cases and write proper tests for a general 
> utility.

I think people tend to map them to linux syscall or commands so renaming 
doesn't equal potentially copying and deleting? 

Just my opinion. Still giving you the ship it.


- Jiang Yan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70741/#review215696
---


On June 5, 2019, 1:44 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70741/
> ---
> 
> (Updated June 5, 2019, 1:44 a.m.)
> 
> 
> Review request for mesos, Xudong Ni, Gilbert Song, Jie Yu, Jacob Janco, and 
> Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9799
> https://issues.apache.org/jira/browse/MESOS-9799
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Switched the `volumes/secrets` isolator from using container
> pre-exec commands to using file operations. This allows us to
> reduce the number of forked child processes, consolidate code
> and leaves the `network/port_mapping` isolator as the last
> consumer of pre-exec commands.
> 
> 
> Diffs
> -
> 
>   include/mesos/slave/containerizer.proto 
> b2e35cbf01caea6c1e4f45c7b7d833bc7f065099 
>   src/common/protobuf_utils.hpp 8aa9aff71cbbd8e98573d0095cfc007cdea163bc 
>   src/common/protobuf_utils.cpp 870859de93b44a1debc6786a562b4bc28955ddab 
>   src/slave/containerizer/mesos/isolators/volume/secret.cpp 
> 7a9bb82b23b35728408fb37bac53d79883c0a19f 
>   src/slave/containerizer/mesos/launch.cpp 
> 0c482f46a97063133edfe29ae3c6a2721d29f6c6 
> 
> 
> Diff: https://reviews.apache.org/r/70741/diff/4/
> 
> 
> Testing
> ---
> 
> sudo make check (Fedora 30)
> 
> 
> Thanks,
> 
> James Peach
> 
>

Re: Review Request 70741: Adopted container file operations for secrets volumes.

2019-06-06 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70741/#review215714
---


Ship it!




Ship It!

- Jiang Yan Xu


On June 5, 2019, 1:44 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70741/
> ---
> 
> (Updated June 5, 2019, 1:44 a.m.)
> 
> 
> Review request for mesos, Xudong Ni, Gilbert Song, Jie Yu, Jacob Janco, and 
> Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9799
> https://issues.apache.org/jira/browse/MESOS-9799
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Switched the `volumes/secrets` isolator from using container
> pre-exec commands to using file operations. This allows us to
> reduce the number of forked child processes, consolidate code
> and leaves the `network/port_mapping` isolator as the last
> consumer of pre-exec commands.
> 
> 
> Diffs
> -
> 
>   include/mesos/slave/containerizer.proto 
> b2e35cbf01caea6c1e4f45c7b7d833bc7f065099 
>   src/common/protobuf_utils.hpp 8aa9aff71cbbd8e98573d0095cfc007cdea163bc 
>   src/common/protobuf_utils.cpp 870859de93b44a1debc6786a562b4bc28955ddab 
>   src/slave/containerizer/mesos/isolators/volume/secret.cpp 
> 7a9bb82b23b35728408fb37bac53d79883c0a19f 
>   src/slave/containerizer/mesos/launch.cpp 
> 0c482f46a97063133edfe29ae3c6a2721d29f6c6 
> 
> 
> Diff: https://reviews.apache.org/r/70741/diff/4/
> 
> 
> Testing
> ---
> 
> sudo make check (Fedora 30)
> 
> 
> Thanks,
> 
> James Peach
> 
>

Re: Review Request 70741: Adopted container file operations for secrets volumes.

2019-06-05 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70741/#review215696
---




src/common/protobuf_utils.hpp
Lines 100 (patched)
<https://reviews.apache.org/r/70741/#comment302508>

Why this? The paragraphs above are still about `validateProtobufUnion` 
right?



src/common/protobuf_utils.hpp
Lines 436 (patched)
<https://reviews.apache.org/r/70741/#comment302509>

One empty line between the last element and the ending `}`



src/slave/containerizer/mesos/isolators/volume/secret.cpp
Line 138 (original), 147 (patched)
<https://reviews.apache.org/r/70741/#comment302511>

Do you know why `launchInfo.mounts` was not used (and for the same reason 
it couldn't be used here)?



src/slave/containerizer/mesos/launch.cpp
Lines 516 (patched)
<https://reviews.apache.org/r/70741/#comment302510>

Is it more explicit if you just name the operation 
`ContainerFileOperation::MV` or `ContainerFileOperation::MOVE` and implement it 
as such (rename + fallback) either here or in stout since you are essentially 
implementing how we use a `mv` command?


- Jiang Yan Xu


On May 28, 2019, 9:29 p.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70741/
> ---
> 
> (Updated May 28, 2019, 9:29 p.m.)
> 
> 
> Review request for mesos, Xudong Ni, Gilbert Song, Jie Yu, Jacob Janco, and 
> Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9799
> https://issues.apache.org/jira/browse/MESOS-9799
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Switched the `volumes/secrets` isolator from using container
> pre-exec commands to using file operations. This allows us to
> reduce the number of forked child processes, consolidate code
> and leaves the `network/port_mapping` isolator as the last
> consumer of pre-exec commands.
> 
> 
> Diffs
> -
> 
>   include/mesos/slave/containerizer.proto 
> b2e35cbf01caea6c1e4f45c7b7d833bc7f065099 
>   src/common/protobuf_utils.hpp 8aa9aff71cbbd8e98573d0095cfc007cdea163bc 
>   src/common/protobuf_utils.cpp 870859de93b44a1debc6786a562b4bc28955ddab 
>   src/slave/containerizer/mesos/isolators/volume/secret.cpp 
> 7a9bb82b23b35728408fb37bac53d79883c0a19f 
>   src/slave/containerizer/mesos/launch.cpp 
> 0c482f46a97063133edfe29ae3c6a2721d29f6c6 
> 
> 
> Diff: https://reviews.apache.org/r/70741/diff/2/
> 
> 
> Testing
> ---
> 
> sudo make check (Fedora 30)
> 
> 
> Thanks,
> 
> James Peach
> 
>

Re: Review Request 70712: Added filesystem operations to the `ContainerLaunchInfo`.

2019-05-24 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70712/#review215515
---




include/mesos/slave/containerizer.proto
Lines 207 (patched)
<https://reviews.apache.org/r/70712/#comment302249>

Nit: not all possible file operation have a source and a target? Would 
something like 

```
message ContainerFileOperation {
...
  message Symlink {
optional string target = 1;
  }
  
optional string path = 2;
```

make more sense?


- Jiang Yan Xu


On May 23, 2019, 11:46 p.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70712/
> ---
> 
> (Updated May 23, 2019, 11:46 p.m.)
> 
> 
> Review request for mesos, Xudong Ni, Gilbert Song, Jie Yu, Jacob Janco, and 
> Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9769
> https://issues.apache.org/jira/browse/MESOS-9769
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> The `filesystem/linux` isolator was using pre-exec commands
> to set up Linux ABI symlinks. Not only is this inefficient,
> it has the undesirable security property of running programs
> in a user-controlled container image.
> 
> The fix added a new `ContainerFileOperation` message to the
> containerizer launch information. The containerizer executes
> the requested file operation after performing the container
> mounts.
> 
> 
> Diffs
> -
> 
>   include/mesos/slave/containerizer.proto 
> e9924489000efabebd55bf070f18149f23e4a510 
>   src/common/protobuf_utils.hpp 273ae270695db33b6c9d8b32cb38f8840a815787 
>   src/common/protobuf_utils.cpp 8b252cb11e17356836988dfc44a63953579a1def 
>   src/slave/containerizer/mesos/isolators/cgroups/cgroups.cpp 
> 8f94453a7354927ae918d3f2fd746cdf5ef63cb7 
>   src/slave/containerizer/mesos/isolators/filesystem/linux.cpp 
> 190054c26b949aa9ba0f49377b77d9e472edb95a 
>   src/slave/containerizer/mesos/launch.cpp 
> 5ddb4c7d998c17b59164825acc0627a1311b691b 
> 
> 
> Diff: https://reviews.apache.org/r/70712/diff/1/
> 
> 
> Testing
> ---
> 
> sudo make check (Fedora 30)
> 
> 
> Thanks,
> 
> James Peach
> 
>

Re: Review Request 70660: Fix the XFS build for recent Fedora versions.

2019-05-17 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/70660/#review215327
---


Ship it!




Ship It!

- Jiang Yan Xu


On May 16, 2019, 7:24 p.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/70660/
> ---
> 
> (Updated May 16, 2019, 7:24 p.m.)
> 
> 
> Review request for mesos, Xudong Ni, Gilbert Song, Jie Yu, and Jiang Yan Xu.
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> On Fedora 30, we need to include  to get the right
> types, since  is no longer self-contained. For
> earlier versions, it is still safe to include both headers if
> they are available, though h> may need to be
> obtained via the xfsprogs-qa-devel package.
> 
> 
> Diffs
> -
> 
>   configure.ac b4bad5716986e2f7c132c6515179a65ccbfdaeac 
>   src/slave/containerizer/mesos/isolators/xfs/utils.hpp 
> db1829abbaac6113d39e71673403afe75b5ee738 
> 
> 
> Diff: https://reviews.apache.org/r/70660/diff/1/
> 
> 
> Testing
> ---
> 
> make check (Fedota 30)
> make (CentOS 6)
> 
> 
> Thanks,
> 
> James Peach
> 
>

Re: Review Request 69615: Disable containerizer ptrace attach.

2019-02-13 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69615/#review212822
---




src/slave/containerizer/mesos/launch.cpp
Lines 556 (patched)
<https://reviews.apache.org/r/69615/#comment298707>

The main thing I wanted to confirm is what we chatted about offline but I 
didn't write down: the interpretation of this paragraph in 
http://man7.org/linux/man-pages/man2/prctl.2.html

```
PR_SET_DUMPABLE
...
  Normally, this flag is set to 1.  However, it is reset to the
  current value contained in the file /proc/sys/fs/suid_dumpable
  (which by default has the value 0), in the following
  circumstances:

  *  The process's effective user or group ID is changed.
...
```

This reads to me like, in a typical setup which runs Mesos agent as uid 0 
and the task with another uid (hence the call `os::setuid(uid.get());` below 
does change uid):

- If `/proc/sys/fs/suid_dumpable` is the default 0, then regardless of 
`--allow_ptrace` the result is that the launcher process becomes undumpable.
- If `/proc/sys/fs/suid_dumpable` is 1, then regardless of `--allow_ptrace` 
the result is that the launcher process becomes dumpable.

i.e., the `prctl()` call here is a noop.

Maybe I misunderstood the doc but perhaps we can tweak the test to be a 
`ROOT_*` test and change to a different task user in order to verify the 
behavior? If you have tested it manually that's fine too just wanted to double 
check :) 

FWIW, I checked a sample container that we run (without this patch) and see 
that

```
# ls -l /proc/1/
```

shows the files under it are all owned by root, which does appear to mean 
that the process' dumpable flag is not 1 according to 
http://man7.org/linux/man-pages/man5/proc.5.html?


- Jiang Yan Xu


On Feb. 8, 2019, 1:09 p.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69615/
> ---
> 
> (Updated Feb. 8, 2019, 1:09 p.m.)
> 
> 
> Review request for mesos, Xudong Ni, Gilbert Song, Jie Yu, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9349
> https://issues.apache.org/jira/browse/MESOS-9349
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Use `prctl(PR_SET_DUMPABLE)` to disable the ability to attach to
> the containerizer process(es) on Linux systems. This prevents
> unprivileged containerized processes from reading information
> about the containerizer process(es) from `/proc`. This gives an
> additional layer of protection against leaking information to
> untrusted container processes.
> 
> 
> Diffs
> -
> 
>   docs/configuration/agent.md a4015c409d00a4c117df6c869d5ba5072bcfe58e 
>   src/launcher/default_executor.cpp 5837cfa4deba557cae43112092ff24b97137951f 
>   src/launcher/executor.cpp f962e800f23d5582b1bc04a263253893492a5054 
>   src/slave/containerizer/mesos/containerizer.cpp 
> 35f51ad33da53b3e6a8eec275fbf3e77782b0fba 
>   src/slave/containerizer/mesos/launch.hpp 
> 0a6394d56321948ad760ac69c05456319a254842 
>   src/slave/containerizer/mesos/launch.cpp 
> 7f401cdf481123b8c6cc500ac02bb7daf2613d2c 
>   src/slave/flags.hpp 7346ba5b711a8353a4bc1d7dcd2f6184b777ddd0 
>   src/slave/flags.cpp 066b84f528b4c888dde399e0b5d5fe5531934de6 
>   src/slave/slave.cpp 0182dd2ca326723e96eef8c072696ad3c873de0b 
>   src/tests/containerizer/mesos_containerizer_tests.cpp 
> 449928c10b897061642af8ad267f8b70695940e6 
>   src/tests/slave_tests.cpp 22a0295086ae4f4ec26df00a0e077eecfa27f1fb 
> 
> 
> Diff: https://reviews.apache.org/r/69615/diff/3/
> 
> 
> Testing
> ---
> 
> make check (Fedora 29)
> 
> 
> Thanks,
> 
> James Peach
> 
>

Re: Review Request 69615: Disable containerizer ptrace attach.

2019-01-18 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69615/#review211839
---



For consistency if we updated command excutor we should update the default 
executor?


src/slave/containerizer/mesos/launch.cpp
Lines 140 (patched)
<https://reviews.apache.org/r/69615/#comment297764>

s/to to/to/



src/slave/containerizer/mesos/launch.cpp
Lines 505 (patched)
<https://reviews.apache.org/r/69615/#comment297763>

I haven't tried but I trust that you've verified that `PR_SET_DUMPABLE` is 
not persisted across fork but could you clarfiy via a comment? It's not clear 
from http://man7.org/linux/man-pages/man2/prctl.2.html.



src/slave/flags.cpp
Lines 338 (patched)
<https://reviews.apache.org/r/69615/#comment297762>

If disallow is the default, it's a breaking change right? Can the default 
be allowing it?


- Jiang Yan Xu


On Jan. 2, 2019, 9:15 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69615/
> ---
> 
> (Updated Jan. 2, 2019, 9:15 a.m.)
> 
> 
> Review request for mesos, Xudong Ni, Gilbert Song, Jie Yu, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9349
> https://issues.apache.org/jira/browse/MESOS-9349
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Use `prctl(PR_SET_DUMPABLE)` to disable the ability to attach to
> the containerizer process(es) on Linux systems. This prevents
> unprivileged containerized processes from reading information
> about the containerizer process(es) from `/proc`. This gives an
> additional layer of protection against leaking information to
> untrusted container processes.
> 
> 
> Diffs
> -
> 
>   docs/configuration/agent.md 330283f4e3957075dd4310de4a841feac23de36c 
>   src/launcher/executor.cpp f962e800f23d5582b1bc04a263253893492a5054 
>   src/slave/containerizer/mesos/containerizer.cpp 
> a5cf2da55c046c5c45e0c2ca3400f64de12de62b 
>   src/slave/containerizer/mesos/launch.hpp 
> 0a6394d56321948ad760ac69c05456319a254842 
>   src/slave/containerizer/mesos/launch.cpp 
> 2f1c9e7a8748c9d7eab25bc8567ca68308e680f9 
>   src/slave/flags.hpp 494ae02ab5eb365e2cda5017be573691107c3f28 
>   src/slave/flags.cpp 6bac8e1409f04d639204c45eda8a90c098e3dbd0 
>   src/slave/slave.cpp ad3b693a716cf6103345a157bf28dd60a7b07d32 
>   src/tests/containerizer/mesos_containerizer_tests.cpp 
> 449928c10b897061642af8ad267f8b70695940e6 
>   src/tests/slave_tests.cpp 4aed5d68e9a408821880ffaede482937be1999f4 
> 
> 
> Diff: https://reviews.apache.org/r/69615/diff/2/
> 
> 
> Testing
> ---
> 
> make check (Fedora 29)
> 
> 
> Thanks,
> 
> James Peach
> 
>

Re: Review Request 69373: Replaced a log consensus `CHECK()` with CHECK_GE()`.

2018-11-16 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69373/#review210613
---


Ship it!




Ship It!

- Jiang Yan Xu


On Nov. 16, 2018, 11:25 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69373/
> ---
> 
> (Updated Nov. 16, 2018, 11:25 a.m.)
> 
> 
> Review request for mesos and Jiang Yan Xu.
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> In scale testing, this `CHECK` failed with the following message:
> 
>   F1116 17:50:04.868387 53766 consensus.cpp:771] Check failed:
> highestNackProposal >= proposal
> 
> Emitting the values for `highestNackProposal` and `proposal` may
> help in debugging this failure.
> 
> 
> Diffs
> -
> 
>   src/log/consensus.cpp bce7facb0fe0d89ee96951bcf144332e41eb683e 
> 
> 
> Diff: https://reviews.apache.org/r/69373/diff/1/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> James Peach
> 
>

Re: Review Request 68706: Added master failover reregistration progress metrics.

2018-11-02 Thread Jiang Yan Xu



> On Oct. 18, 2018, 11:14 a.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Lines 1874-1875 (patched)
> > <https://reviews.apache.org/r/68706/diff/5/?file=2098490#file2098490line1874>
> >
> > If we use equality operator and only set the timer when such a number 
> > of reregistered agent is reached we are guaranteed to only set each timer 
> > once (but we may need to set multiple timers in one call) right? This 
> > alleviates the need to check if the timer is already set!
> > 
> > This should also work with boundary cases like the total recovered 
> > agents count being 0 or 1 (overlapping percentiles) etc. Right?
> 
> Jiang Yan Xu wrote:
> We should comment on the reasons for dropping issues.
> 
> Xudong Ni wrote:
> This is actaully fixed and marked as dropped somehow. We used equality 
> operator to 0.0 to check whether the percentile was reached or not; The 
> reason we used push gauge not timer is explained in push gauge vs timer 
> comments section

The point here is "This alleviates the need to check if the timer is already 
set!". It's still in the new revision. I made another comment about it.


> On Oct. 18, 2018, 11:14 a.m., Jiang Yan Xu wrote:
> > src/master/metrics.hpp
> > Lines 51 (patched)
> > <https://reviews.apache.org/r/68706/diff/5/?file=2098491#file2098491line51>
> >
> > On using Timer (e.g., like 
> > [state_fetch](https://github.com/apache/mesos/blob/7f36ebc1775398a43b2aa3a81bb647fb296b8313/src/master/registrar.cpp#L172))
> >  vs. PushGauge, after looking at how it'll be used I think the main 
> > advantage of Timer is that it doesn't export any value if you haven't set 
> > it.
> > 
> > Consider the two cases:
> > 
> > 1. There are 1000 recovered agents and 0 have reregsitered, should all 
> > the timers have zero values or should they be absent?
> > 
> > 2. There are 0 recovered agents (e.g., brand new cluster), should all 
> > of the metrics be zero or non-existent? I feel like they should be zero, as 
> > in, e.g., 100% of all 0 agents are reregistered within 0 secs.
> > 
> > So timer handles this natrually. Also it sets the `_secs` name for you 
> > but that's a minor conveninence.
> 
> Jiang Yan Xu wrote:
> We should comment on the reasons for dropping issues.
> 
> Xudong Ni wrote:
> Sorry about this, I did make the comments but it must be in one of draft 
> which was not saved. 
> 
> I agree that metrics should be zero but not absent when certain 
> percentige were not reached. I did tried both PushGauge and Timer 
> implementation and tested in our clusters.
> 
> If we used the timer, when the value was not set, that metric is actually 
> missing. PushGauge is set with the initial value 0.0 and we can tell whether 
> it's set yet, the metric will always exist no matter that percentile reached 
> or not, and it has better performance.
> 
> The brand new cluster case was tested and the results were included in 
> the test results.

My point was the opposite. When a percentage is not reached, I feel that the 
value being absent is preferrable and I used the above two cases to demonstrate 
it. (There is a subtle difference between the two IMO).

It's in fact the current practice used by metrics like `state_fetch`.


- Jiang Yan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/#review209722
---


On Oct. 19, 2018, 4:56 p.m., Xudong Ni wrote:
> 
> -----------
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68706/
> ---
> 
> (Updated Oct. 19, 2018, 4:56 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9178
> https://issues.apache.org/jira/browse/MESOS-9178
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> During the master failover, the time that the master elected is
> considered as the start of failover. In the progress of
> reregistration, the percentile represents the time when such
> percentile of agents finished registration again; The percentile of
> these data as in this metrics can represent overall reregistration
> progress; In case of degradation towards to the end of reregistration,
> the high percentile can reflect it; In the case there are unreachable
> agents in the failover, if certain percentile recovery

Re: Review Request 68706: Added master failover reregistration progress metrics.

2018-11-02 Thread Jiang Yan Xu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/#review210145
---

There were changes added for a test but it's gone in the latest revision?

src/master/master.cpp
Lines 1720 (patched)
<https://reviews.apache.org/r/68706/#comment294924>

You can construct the entire `Option 
agentReregistrations` struct here using the `expectedAgentCount` as well as the 
`electedTime` which is a class member (Rationale explained below).

Initializing `agentReregistrations` is just one step in the list of steps 
taken in `Master::_recover` so you don't have to squeeze it between the two 
lines for allocator recover code. It's not appropriate as the block comment 
above clearly tries to explain the allocator reocvery. 

You can just put it somewhere below by itself (blank lines above and below 
the statement). The `expectedAgentCount` variable is already set and you can 
use; an frankly you can just use `registry.slaves().slaves().size()` as well.

src/master/master.cpp
Lines 2099 (patched)
<https://reviews.apache.org/r/68706/#comment294925>

- End sentences with `.`
- s/master/the master/

src/master/metrics.hpp
Lines 87 (patched)
<https://reviews.apache.org/r/68706/#comment294779>

- End sentences with `.`.
- s/when/during/

src/master/metrics.hpp
Lines 90 (patched)
<https://reviews.apache.org/r/68706/#comment294778>

No need for `explicit` as it's only for single argument constructors, you 
don't need it either with zero or two arguments (one proposal below).

src/master/metrics.hpp
Lines 90-123 (patched)
<https://reviews.apache.org/r/68706/#comment294927>

To be consistent let's put the implementation in the cpp file 
master/metrics.cpp. Here and for other methods as well.

src/master/metrics.hpp
Lines 92 (patched)
<https://reviews.apache.org/r/68706/#comment294926>

Indent two more spaces, we can see examples in master/metrics.cpp, here and 
elsewhere.

src/master/metrics.hpp
Lines 130 (patched)
<https://reviews.apache.org/r/68706/#comment294928>

We generally don't use one letter variable names. `t` here seemingly refers 
to `Time` but it's a `Duration`. It won't hurt to keep it strongly types until 
we need to convert it, i.e., Use `Duration duration = Clock::now() - 
electedTime;` should be fine.

src/master/metrics.hpp
Lines 132 (patched)
<https://reviews.apache.org/r/68706/#comment294929>

A space between `if` and `(`. Here and elsewhere.

src/master/metrics.hpp
Lines 133 (patched)
<https://reviews.apache.org/r/68706/#comment294931>

Let's `#include ` for ceil, in general include all needed headers 
without relying on transitive dependencies is a good pratice.

src/master/metrics.hpp
Lines 138 (patched)
<https://reviews.apache.org/r/68706/#comment294932>

As I commented in the previous revision, we don't need to check if a timer 
is already set. A percentage timer corresponds to one precise 
`reregisteredAgentCount` (and not vice versa) so we just need to check that. 
Does it make sense?

src/master/metrics.hpp
Lines 176-177 (patched)
<https://reviews.apache.org/r/68706/#comment294923>

Semantically it's hard to explain why `recoveredAgentCount` is an `Option` 
while `reregisteredAgentCount` is not? Why not have all three fields as 
Options? But then one may ask why not have the entire struct by an Option?

It's true that we don't have a `recoveredAgentCount` before they are 
recovered and `electedTime` before the master is elected but we can just use 
`Option agentReregistrations` like I suggested?

The semantics of `Option agentReregistrations` seems 
natural to me: it is optional because agent reregistrations only happen after 
the master is recovered.

- Jiang Yan Xu

On Oct. 19, 2018, 4:56 p.m., Xudong Ni wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68706/
> ---
> 
> (Updated Oct. 19, 2018, 4:56 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9178
> https://issues.apache.org/jira/browse/MESOS-9178
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> During the master failover, the time that the master elected is
> considered as the start of failover. In the progress of
> reregistration, the percentile represents the time when such
> percentile of agents finished registration again; The percentile of
> these data as in this metrics can represent overall reregistration
> progress; In case of degradation towards to the end of reregistration,

Re: Review Request 69233: Fixed a GetContainers crash when using XFS disk isolation.

2018-11-01 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/69233/#review210261
---


Ship it!





src/slave/containerizer/mesos/isolators/xfs/disk.cpp
Lines 660 (patched)
<https://reviews.apache.org/r/69233/#comment294910>

Might be worth adding a comment about root disks not having `Source` since 
people often forget. :)


- Jiang Yan Xu


On Nov. 1, 2018, 1:33 p.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/69233/
> ---
> 
> (Updated Nov. 1, 2018, 1:33 p.m.)
> 
> 
> Review request for mesos and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9367
> https://issues.apache.org/jira/browse/MESOS-9367
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Since the XFS disk isolator didn't check that the persistent volume
> has a `DiskInfo::Source` before populating the corresponding field
> in the `ResourceStatistics` message. This caused the JSON conversion
> (triggered by operator API calls) to check fail. The fix is to simply
> check that the volume has a `DiskInfo::Source` before propagating it.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/isolators/xfs/disk.cpp 
> 44b7d0570b00d50e1fc454100ab73f4dfe5ca159 
>   src/tests/containerizer/xfs_quota_tests.cpp 
> b2f977f15e2ea8cbe5f1d60e046a8f7d3560 
> 
> 
> Diff: https://reviews.apache.org/r/69233/diff/1/
> 
> 
> Testing
> ---
> 
> sudo make check (Fedora 28)
> 
> 
> Thanks,
> 
> James Peach
> 
>

Re: Review Request 68706: Added master failover reregistration progress metrics.

2018-10-29 Thread Jiang Yan Xu



> On Oct. 18, 2018, 11:14 a.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Lines 1874-1875 (patched)
> > <https://reviews.apache.org/r/68706/diff/5/?file=2098490#file2098490line1874>
> >
> > If we use equality operator and only set the timer when such a number 
> > of reregistered agent is reached we are guaranteed to only set each timer 
> > once (but we may need to set multiple timers in one call) right? This 
> > alleviates the need to check if the timer is already set!
> > 
> > This should also work with boundary cases like the total recovered 
> > agents count being 0 or 1 (overlapping percentiles) etc. Right?

We should comment on the reasons for dropping issues.


> On Oct. 18, 2018, 11:14 a.m., Jiang Yan Xu wrote:
> > src/master/metrics.hpp
> > Lines 51 (patched)
> > <https://reviews.apache.org/r/68706/diff/5/?file=2098491#file2098491line51>
> >
> > On using Timer (e.g., like 
> > [state_fetch](https://github.com/apache/mesos/blob/7f36ebc1775398a43b2aa3a81bb647fb296b8313/src/master/registrar.cpp#L172))
> >  vs. PushGauge, after looking at how it'll be used I think the main 
> > advantage of Timer is that it doesn't export any value if you haven't set 
> > it.
> > 
> > Consider the two cases:
> > 
> > 1. There are 1000 recovered agents and 0 have reregsitered, should all 
> > the timers have zero values or should they be absent?
> > 
> > 2. There are 0 recovered agents (e.g., brand new cluster), should all 
> > of the metrics be zero or non-existent? I feel like they should be zero, as 
> > in, e.g., 100% of all 0 agents are reregistered within 0 secs.
> > 
> > So timer handles this natrually. Also it sets the `_secs` name for you 
> > but that's a minor conveninence.

We should comment on the reasons for dropping issues.


- Jiang Yan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/#review209722
---


On Oct. 19, 2018, 4:56 p.m., Xudong Ni wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68706/
> ---
> 
> (Updated Oct. 19, 2018, 4:56 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9178
> https://issues.apache.org/jira/browse/MESOS-9178
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> During the master failover, the time that the master elected is
> considered as the start of failover. In the progress of
> reregistration, the percentile represents the time when such
> percentile of agents finished registration again; The percentile of
> these data as in this metrics can represent overall reregistration
> progress; In case of degradation towards to the end of reregistration,
> the high percentile can reflect it; In the case there are unreachable
> agents in the failover, if certain percentile recovery couldn't be
> reached, the intiail value of that percentile will not be updated.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp 868787bb2f9d879531402f83507b322462322efc 
>   src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 
> 
> 
> Diff: https://reviews.apache.org/r/68706/diff/7/
> 
> 
> Testing
> ---
> 
> Automation:
> [ RUN  ] MasterTest.MetricsInMetricsEndpoint
> [   OK ] MasterTest.MetricsInMetricsEndpoint (42 ms)
> 
> Real world cases:
> 
> While the master is not elected or there is no agents recovered yet
> "master/recovered_agents_100_percent_reregistered_secs": 0.0,
> "master/recovered_agents_25_percent_reregistered_secs": 0.0,
> "master/recovered_agents_50_percent_reregistered_secs": 0.0,
> "master/recovered_agents_75_percent_reregistered_secs": 0.0,
> "master/recovered_agents_90_percent_reregistered_secs": 0.0,
> "master/recovered_agents_99_percent_reregistered_secs": 0.0,
> "master/slave_reregistrations": 0.0,
> 
> While reregistrations is in progress: 5 out of 6 completed:
> "master/recovered_agents_100_percent_reregistered_secs": 0.0,
> "master/recovered_agents_25_percent_reregistered_secs": 2.0,
> "master/recovered_agents_50_percent_reregistered_secs": 3.0,
> "master/recovered_agents_75_percent_reregistered_secs": 6.0,
> "master/recovere

Re: Review Request 68706: Added master failover reregistration progress metrics.

2018-10-18 Thread Jiang Yan Xu



> On Oct. 18, 2018, 11:14 a.m., Jiang Yan Xu wrote:
> >

Sorry for the delay. Feel free to chat if it's not clear!


- Jiang Yan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/#review209722
---


On Oct. 16, 2018, 9:49 a.m., Xudong Ni wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68706/
> ---
> 
> (Updated Oct. 16, 2018, 9:49 a.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9178
> https://issues.apache.org/jira/browse/MESOS-9178
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> During the master failover, the time that the master elected is
> considered as the start of failover. In the progress of
> reregistration, the percentile represents the time when such
> percentile of agents finished registration again; The percentile of
> these data as in this metrics can represent overall reregistration
> progress; In case of degradation towards to the end of reregistration,
> the high percentile can reflect it; In the case there are unreachable
> agents in the failover, if certain percentile recovery couldn't be
> reached, the intiail value of that percentile will not be updated.
> 
> 
> Diffs
> -
> 
>   docs/monitoring.md 00c6ea94bcb73746aef740236632ede123f5b534 
>   src/master/master.hpp ea7e9242b62fe6c2cc0e717f9a9f2f0c1cc0a390 
>   src/master/master.cpp 868787bb2f9d879531402f83507b322462322efc 
>   src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 
>   src/master/metrics.cpp 56a7eef2d279ad3248092d37d19013d3ac110757 
>   src/tests/master_tests.cpp 1db8ed7d81acbcd8bad4b7ca77c501d1d99cc135 
> 
> 
> Diff: https://reviews.apache.org/r/68706/diff/5/
> 
> 
> Testing
> ---
> 
> Automation:
> [ RUN  ] MasterTest.MetricsInMetricsEndpoint
> [   OK ] MasterTest.MetricsInMetricsEndpoint (42 ms)
> 
> Real world cases:
> While reregistrations is in progress: 3277 out of 3582 completed:
> "master/slave_reregistrations": 3277.0,
> "master/slaves_100_percent_reregistered_secs": 0.0,
> "master/slaves_25_percent_reregistered_secs": 5.0,
> "master/slaves_50_percent_reregistered_secs": 11.0,
> "master/slaves_75_percent_reregistered_secs": 20.0,
> "master/slaves_90_percent_reregistered_secs": 30.0,
> "master/slaves_99_percent_reregistered_secs": 0.0,
> 
> 
> While 3582 reregistrations were all completed:
> "master/slave_reregistrations": 3582.0,
> "master/slaves_100_percent_reregistered_secs": 54.0,
> "master/slaves_25_percent_reregistered_secs": 5.0,
> "master/slaves_50_percent_reregistered_secs": 11.0,
> "master/slaves_75_percent_reregistered_secs": 20.0,
> "master/slaves_90_percent_reregistered_secs": 30.0,
> "master/slaves_99_percent_reregistered_secs": 39.0,
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>

Re: Review Request 68706: Added master failover reregistration progress metrics.

2018-10-18 Thread Jiang Yan Xu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/#review209722
---

src/master/master.hpp
Lines 2346-2347 (patched)
<https://reviews.apache.org/r/68706/#comment294277>

I wasn't sure if we needed a dedicated struct / method for these metrics 
when we first discussed this approach but looking at the the additional method, 
math and class member state that we need to track I feel that we do. 

It's probably the best if we put such logic in the master `Metrics` class. 
It'll isolate the metric calculation logic from the master and you can use the 
struct's member state to help with tracking.

Similar examples already exist here. e.g., 
https://github.com/apache/mesos/blob/f03f91545fec64a767f808c48a3b352dc86b1ac0/src/master/metrics.hpp#L89

I propose that we add something like this

```
struct AgentReregistrations
{
  AgentReregistrations(int recoveredAgentCount);
  ~AgentReregistrations();

  // If you use `Timer` you don't even need this. See comment about Timer 
below.
  process::Time electedTime, 
  int recoveredAgentCount;
  int reregisteredAgentCount = 0;
  // ... Whatever else state you may need.

  process::metrics::PushGauge recovered_agents_25_percent_reregistered_secs;
  // ... other gauges.

  void incrementAgentReregistered();
};

// Option since you only set it after the master is elected.
Option agentReregistrations;
```

src/master/master.cpp
Lines 1866-1868 (patched)
<https://reviews.apache.org/r/68706/#comment294282>

But we naturally don't need float comparison, we are comparing the 
reregistered agents count (int) against the **minimum** number of agents that 
**satisfy** the percentage requirement (int). So if you take a `ceil` of the 
floating point multiplication result and cast to int you'll get the 
semantically correct number?

e.g., if total = 5 and we are calculating 75% the `reregisteredAgentCount` 
we are looking for is 4.

src/master/master.cpp
Lines 1874-1875 (patched)
<https://reviews.apache.org/r/68706/#comment294312>

If we use equality operator and only set the timer when such a number of 
reregistered agent is reached we are guaranteed to only set each timer once 
(but we may need to set multiple timers in one call) right? This alleviates the 
need to check if the timer is already set!

This should also work with boundary cases like the total recovered agents 
count being 0 or 1 (overlapping percentiles) etc. Right?

src/master/metrics.hpp
Lines 50 (patched)
<https://reviews.apache.org/r/68706/#comment294275>

Can we use the names agreed-upon with BenM in the slack channel which is 
e.g., `recovered_agents_50_percent_reregistered_secs`?

src/master/metrics.hpp
Lines 51 (patched)
<https://reviews.apache.org/r/68706/#comment294313>

On using Timer (e.g., like 
[state_fetch](https://github.com/apache/mesos/blob/7f36ebc1775398a43b2aa3a81bb647fb296b8313/src/master/registrar.cpp#L172))
 vs. PushGauge, after looking at how it'll be used I think the main advantage 
of Timer is that it doesn't export any value if you haven't set it.

Consider the two cases:

1. There are 1000 recovered agents and 0 have reregsitered, should all the 
timers have zero values or should they be absent?

2. There are 0 recovered agents (e.g., brand new cluster), should all of 
the metrics be zero or non-existent? I feel like they should be zero, as in, 
e.g., 100% of all 0 agents are reregistered within 0 secs.

So timer handles this natrually. Also it sets the `_secs` name for you but 
that's a minor conveninence.

- Jiang Yan Xu

On Oct. 16, 2018, 9:49 a.m., Xudong Ni wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68706/
> ---
> 
> (Updated Oct. 16, 2018, 9:49 a.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9178
> https://issues.apache.org/jira/browse/MESOS-9178
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> During the master failover, the time that the master elected is
> considered as the start of failover. In the progress of
> reregistration, the percentile represents the time when such
> percentile of agents finished registration again; The percentile of
> these data as in this metrics can represent overall reregistration
> progress; In case of degradation towards to the end of reregistration,
> the high percentile can reflect it; In the case there are unreachable

Re: Review Request 68586: Add the output file to the hash on CommandInfo::URI.

2018-10-17 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68586/#review209710
---




include/mesos/type_utils.hpp
Lines 532 (patched)
<https://reviews.apache.org/r/68586/#comment294257>

So `cache` is counted here. Does it mean that if I specify two duplicate 
URIs with different `cache` values one will overwrite another in fetching?


- Jiang Yan Xu


On Aug. 31, 2018, 9:47 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68586/
> ---
> 
> (Updated Aug. 31, 2018, 9:47 a.m.)
> 
> 
> Review request for mesos, Gilbert Song and Joseph Wu.
> 
> 
> Bugs: MESOS-9172
> https://issues.apache.org/jira/browse/MESOS-9172
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> The `output_file` field of the `CommandInfo::URI` is not combined
> into the hash value. This means that the fetcher does not consider two
> messages that differ only by the `output_file` as different. By adding
> the `output_file` to the hash it is possible for a task to fetch the
> same URI to multiple, distinct ouptuts.
> 
> 
> Diffs
> -
> 
>   include/mesos/type_utils.hpp 19ea81716496bcc0117a1b0ff157a0374f38bbfa 
>   include/mesos/v1/mesos.hpp fda3eb42061f820869a2d8da939fccadc4e5ddfb 
> 
> 
> Diff: https://reviews.apache.org/r/68586/diff/1/
> 
> 
> Testing
> ---
> 
> make check (Fedora 28)
> 
> 
> Thanks,
> 
> James Peach
> 
>

Re: Review Request 68586: Add the output file to the hash on CommandInfo::URI.

2018-10-17 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68586/#review209706
---


Ship it!




Ship It!

- Jiang Yan Xu


On Aug. 31, 2018, 9:47 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68586/
> ---
> 
> (Updated Aug. 31, 2018, 9:47 a.m.)
> 
> 
> Review request for mesos, Gilbert Song and Joseph Wu.
> 
> 
> Bugs: MESOS-9172
> https://issues.apache.org/jira/browse/MESOS-9172
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> The `output_file` field of the `CommandInfo::URI` is not combined
> into the hash value. This means that the fetcher does not consider two
> messages that differ only by the `output_file` as different. By adding
> the `output_file` to the hash it is possible for a task to fetch the
> same URI to multiple, distinct ouptuts.
> 
> 
> Diffs
> -
> 
>   include/mesos/type_utils.hpp 19ea81716496bcc0117a1b0ff157a0374f38bbfa 
>   include/mesos/v1/mesos.hpp fda3eb42061f820869a2d8da939fccadc4e5ddfb 
> 
> 
> Diff: https://reviews.apache.org/r/68586/diff/1/
> 
> 
> Testing
> ---
> 
> make check (Fedora 28)
> 
> 
> Thanks,
> 
> James Peach
> 
>

Re: Review Request 68587: Fixed fetcher deadlock with duplicate URIs.

2018-10-17 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68587/#review209704
---


Fix it, then Ship it!





src/slave/containerizer/fetcher.cpp
Lines 409-411 (patched)
<https://reviews.apache.org/r/68587/#comment294254>

The crux here is that URIs with the same `value` but  are different in 
other fields like `executable` or `extract` can lead to multiple items in 
`entries` pointing to the same cache entry. Add it to the comment?



src/slave/containerizer/fetcher.cpp
Lines 421 (patched)
<https://reviews.apache.org/r/68587/#comment294252>

Using `newEntries.at()` is more idiomatic even though we did check that the 
entry exists.


- Jiang Yan Xu


On Aug. 31, 2018, 9:47 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68587/
> ---
> 
> (Updated Aug. 31, 2018, 9:47 a.m.)
> 
> 
> Review request for mesos, Gilbert Song and Joseph Wu.
> 
> 
> Bugs: MESOS-9172
> https://issues.apache.org/jira/browse/MESOS-9172
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> If a fetch request contains duplicate URIs that are not already
> in the cache, the fetcher would erroneously expect that some other
> fetch process is going to download that cache entry. It will then
> wait for a Future that will never complete.
> 
> The fix is to track whether the cache entry was created in this
> fetch, and in that case to simply allow the duplicate URI. In
> the fetcher, we check the cache before downloading so that a URIs
> can be fetched to distinct output files without being downloaded
> multiple times.
> 
> 
> Diffs
> -
> 
>   src/launcher/fetcher.cpp ef8d7ebdcf6d5d15459fd3392cb99774f078347a 
>   src/slave/containerizer/fetcher.cpp 
> 17f5388200c8341936cb4d7f8da67b5f286b727d 
>   src/tests/fetcher_tests.cpp f3ea7092635c88b9dddcc2998a4c7350fb56110c 
> 
> 
> Diff: https://reviews.apache.org/r/68587/diff/1/
> 
> 
> Testing
> ---
> 
> make check (Fedora 28)
> 
> 
> Thanks,
> 
> James Peach
> 
>

Re: Review Request 68999: Fixed the XFS project ID labeling to not cross mount points.

2018-10-12 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68999/#review209500
---


Ship it!




+1 on the issue you raised yourself as well.


src/slave/containerizer/mesos/isolators/xfs/utils.cpp
Lines 363 (patched)
<https://reviews.apache.org/r/68999/#comment293968>

I think to summarize the linked article here one key point worth mentioning 
is the "mount crossings" check applies even when the source and target are on 
the same filesystem (which `FTS_XDEV` doesn't give us)?


- Jiang Yan Xu


On Oct. 11, 2018, 3:23 p.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68999/
> ---
> 
> (Updated Oct. 11, 2018, 3:23 p.m.)
> 
> 
> Review request for mesos, Ilya Pronin, Jie Yu, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9300
> https://issues.apache.org/jira/browse/MESOS-9300
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> If we bind mount a directory on the same filesystem, the fts(3)
> directory traversal that recursively assigns XFS project IDs isn't
> able to detect when it crosses the mount point. This leads it to
> erroneously rewrite the project IDs for persistent volumes if they
> are bind mounted into the container by the `filesystem/linux`
> isolator. The fix is to test whether we crossed a mount point
> whenever we descend into a new directory so that we can avoid
> assigning project ID beneath it.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/isolators/xfs/utils.cpp 
> d5d03774b39ff6fada4a56e565d8c754790c42db 
>   src/tests/containerizer/xfs_quota_tests.cpp 
> a615ab125cd0144c5da9445468be2aaa0dc05507 
> 
> 
> Diff: https://reviews.apache.org/r/68999/diff/1/
> 
> 
> Testing
> ---
> 
> sudo make check (Fedora 28)
> 
> 
> Thanks,
> 
> James Peach
> 
>

Re: Review Request 68706: Added master failover reregistration progress metrics.

2018-09-13 Thread Jiang Yan Xu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68706/#review208599
---

src/master/master.cpp
Lines 7436 (patched)
<https://reviews.apache.org/r/68706/#comment292676>

There are a few considerations on using one timer for all agent 
reregistrations.

1. The full list of statistics only make sense when 100% of the agents are 
reregistered.

See this example in the testing section:

```
"master/slaves_reregistration_secs":32.321583104,
"master/slaves_reregistration_secs/count":7,
"master/slaves_reregistration_secs/max":32.321583104,
"master/slaves_reregistration_secs/min":3.35373696,
"master/slaves_reregistration_secs/p50":8.774915072,
"master/slaves_reregistration_secs/p90":30.8676036608,
"master/slaves_reregistration_secs/p95":31.59459338236,
"master/slaves_reregistration_secs/p99":32.176185159679996,
"master/slaves_reregistration_secs/p999":32.307043309567995,
"master/slaves_reregistration_secs/p":32.3201291245568,
```

What does /p50 mean when 200 of the 1000 total are reregistered? I think 
metrics always being meaningful is a requirement for continuous monitoring.

2. `TimeSeries` drops old values during 
[sparsification](https://github.com/apache/mesos/blob/91ca8e2b2071f7e4b89702ae7c807b074bdef31b/3rdparty/libprocess/include/process/timeseries.hpp#L191).
 Therefore if the total agent count in the cluster is bigger than the time 
series capacity (large Mesos deployments have O(1) agents, old values get 
dropped and the p25 is no longer "25% of the total agents" even after all 
agents are reregistered. This leads to the next point.

3. `Timer`'s semantics: it's suppose to work when each observation has 
equal semantics, e.g., each 
[state_store](http://mesos.apache.org/documentation/latest/monitoring/#registrar)
 has the same meaning, therefore it's OK to drop old values which may reduce 
precision but will not changing the meaning to the statistics.

Let's start with how the monitoring system will use it and work backwards. 
I think my proposal in https://issues.apache.org/jira/browse/MESOS-9178 is 
worth considering but maybe there are better ones.

- Jiang Yan Xu

On Sept. 12, 2018, 2:42 p.m., Xudong Ni wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68706/
> -----------
> 
> (Updated Sept. 12, 2018, 2:42 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9178
> https://issues.apache.org/jira/browse/MESOS-9178
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> When an agent is reregistrated, the time delta from that moment to
> the master elected time was saved; In the progress of reregistration,
> each data entry represents the registration time delta from master
> elected time; The percentile of these data as in this metrics can
> represent overall reregistration progress; In case of degradation
> towards to the end of reregistration, the high percentile will
> reflect it.
> 
> Note: These metrics only represent the completed reregistration; It
> does not monitor agents were finally marked as unreachable that the
> reregistration didn't actually happen, the unreachable agents were
> already monitored by existing metrics.
> 
> 
> Diffs
> -
> 
>   docs/monitoring.md 00c6ea94bcb73746aef740236632ede123f5b534 
>   src/master/master.cpp 06d769aeba16586a020729d454f4d00688b78c78 
>   src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 
>   src/master/metrics.cpp 56a7eef2d279ad3248092d37d19013d3ac110757 
> 
> 
> Diff: https://reviews.apache.org/r/68706/diff/1/
> 
> 
> Testing
> ---
> 
> Tested in mmaster with seven reregistration agents:
> "master/slaves_reregistration_secs":32.321583104,
> "master/slaves_reregistration_secs/count":7,
> "master/slaves_reregistration_secs/max":32.321583104,
> "master/slaves_reregistration_secs/min":3.35373696,
> "master/slaves_reregistration_secs/p50":8.774915072,
> "master/slaves_reregistration_secs/p90":30.8676036608,
> "master/slaves_reregistration_secs/p95":31.59459338236,
> "master/slaves_reregistration_secs/p99":32.176185159679996,
> "master/slaves_reregistration_secs/p999":32.307043309567995,
> "master/slaves_reregistration_secs/p":32.3201291245568,
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>

Re: Review Request 68448: Skip the container if there's no container network.

2018-08-21 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/68448/#review207693
---


Ship it!




Ship It!

- Jiang Yan Xu


On Aug. 21, 2018, 12:16 p.m., Sergey Urbanovich wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/68448/
> ---
> 
> (Updated Aug. 21, 2018, 12:16 p.m.)
> 
> 
> Review request for mesos, Jie Yu and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-9171
> https://issues.apache.org/jira/browse/MESOS-9171
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> If the container wants to join host network it won't have container
> networks. In that case `rootDir` could be not configured and
> `rootDir.get()` in `usage()` will cause mesos agent crash.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/isolators/network/cni/cni.cpp 
> 6df52a55b13675df593872559b5238e8a10fd666 
> 
> 
> Diff: https://reviews.apache.org/r/68448/diff/1/
> 
> 
> Testing
> ---
> 
> bin/mesos-tests.sh --verbose 
> --gtest_filter="CniIsolatorTest.VETH_VerifyResourceStatistics" 
> --gtest_break_on_failure --gtest_repeat=100
> 
> 
> Thanks,
> 
> Sergey Urbanovich
> 
>

Re: Review Request 67365: Added MESOS-8340 to the 1.7.x CHANGELOG.

2018-05-30 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67365/#review204104
---


Ship it!




Ship It!

- Jiang Yan Xu


On May 30, 2018, 11:57 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67365/
> ---
> 
> (Updated May 30, 2018, 11:57 a.m.)
> 
> 
> Review request for mesos, Xudong Ni and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8340
> https://issues.apache.org/jira/browse/MESOS-8340
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added MESOS-8340 to the 1.7.x CHANGELOG.
> 
> 
> Diffs
> -
> 
>   CHANGELOG 52c7fc3bcd73a0eead2934644f636b85c60e0061 
>   docs/upgrades.md 1a1705cd9c38583c591392dd154d136fce04129d 
> 
> 
> Diff: https://reviews.apache.org/r/67365/diff/2/
> 
> 
> Testing
> ---
> 
> Manual inspection of the dev website.
> 
> 
> Thanks,
> 
> James Peach
> 
>

Re: Review Request 67365: Added MESOS-8340 to the 1.7.x CHANGELOG.

2018-05-30 Thread Jiang Yan Xu



> On May 30, 2018, 10:56 a.m., Jiang Yan Xu wrote:
> > docs/upgrades.md
> > Lines 445 (patched)
> > <https://reviews.apache.org/r/67365/diff/1/?file=2031690#file2031690line445>
> >
> > Is it TCP specific?
> 
> James Peach wrote:
> Yes.

I see and I found the word TCP in code comments. So apps are free to listening 
on unassigned UDP ports? This is fine but I am just a bit surprised this is the 
place in the documentation that TCP is mentioned (not in network-ports.md 
either). Worth clarifying (separately as this RR is about CHANGELOG)?


- Jiang Yan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67365/#review204058
---


On May 30, 2018, 11:57 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67365/
> ---
> 
> (Updated May 30, 2018, 11:57 a.m.)
> 
> 
> Review request for mesos, Xudong Ni and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8340
> https://issues.apache.org/jira/browse/MESOS-8340
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added MESOS-8340 to the 1.7.x CHANGELOG.
> 
> 
> Diffs
> -
> 
>   CHANGELOG 52c7fc3bcd73a0eead2934644f636b85c60e0061 
>   docs/upgrades.md 1a1705cd9c38583c591392dd154d136fce04129d 
> 
> 
> Diff: https://reviews.apache.org/r/67365/diff/2/
> 
> 
> Testing
> ---
> 
> Manual inspection of the dev website.
> 
> 
> Thanks,
> 
> James Peach
> 
>

Re: Review Request 67365: Added MESOS-8340 to the 1.7.x CHANGELOG.

2018-05-30 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67365/#review204058
---




docs/upgrades.md
Lines 445 (patched)
<https://reviews.apache.org/r/67365/#comment286451>

Is it TCP specific?


- Jiang Yan Xu


On May 29, 2018, 3:46 p.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/67365/
> ---
> 
> (Updated May 29, 2018, 3:46 p.m.)
> 
> 
> Review request for mesos, Xudong Ni and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8340
> https://issues.apache.org/jira/browse/MESOS-8340
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added MESOS-8340 to the 1.7.x CHANGELOG.
> 
> 
> Diffs
> -
> 
>   CHANGELOG 52c7fc3bcd73a0eead2934644f636b85c60e0061 
>   docs/upgrades.md 1a1705cd9c38583c591392dd154d136fce04129d 
> 
> 
> Diff: https://reviews.apache.org/r/67365/diff/1/
> 
> 
> Testing
> ---
> 
> Manual inspection of the dev website.
> 
> 
> Thanks,
> 
> James Peach
> 
>

Review Request 67115: Added more diagnostic info for a CHECK.

2018-05-14 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/67115/
---

Review request for mesos and James Peach.


Repository: mesos


Description
---

Added more diagnostic info for a CHECK.


Diffs
-

  src/master/allocator/mesos/hierarchical.cpp 
1000968be6a2935a4cac571414d7f06d7df7acf0 


Diff: https://reviews.apache.org/r/67115/diff/1/


Testing
---

make check


Thanks,

Jiang Yan Xu

Re: Review Request 66919: Failure to update registry should abort the master process.

2018-05-07 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66919/#review202606
---


Fix it, then Ship it!




FYI committed with some fixes on the commit message (which comes from the 
review summary).


src/master/http.cpp
Lines 4163 (patched)
<https://reviews.apache.org/r/66919/#comment284492>

You missed a space between `result)` and `{` which I didn't catch initially 
but fixed up in a subsequent commit.


- Jiang Yan Xu


On May 7, 2018, 11:10 a.m., Xudong Ni wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66919/
> ---
> 
> (Updated May 7, 2018, 11:10 a.m.)
> 
> 
> Review request for mesos, Joseph Wu and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8630
> https://issues.apache.org/jira/browse/MESOS-8630
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> When the registrar fails to update the registry it would abort the
> actor and fail all future operations,However when the registrar
> updates is requested by an operator API such as maintenance update,
> the master process doesn't shut down (a 500 error is returned to the
> client instead)and all subsequent operations will fail.
> 
> Review: https://reviews.apache.org/r/66919
> 
> 
> Diffs
> -
> 
>   src/master/http.cpp 135ae4389623385a6638158f5f23d6daca14a0ad 
> 
> 
> Diff: https://reviews.apache.org/r/66919/diff/3/
> 
> 
> Testing
> ---
> 
> xujyan made a local test change to verify this patch:
> https://github.com/xujyan/mesos/commit/68051320a87431f6d2f3fbad6b0b97814200a731
> 
> Test output:
> F0504 20:43:36.530680 1858991 http.cpp:4304] CHECK_READY(result): is FAILED: 
> Failed to update registry: Failed to perform store within 5secs
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>

Re: Review Request 66919: Failure to update registry should abort the master process.

2018-05-04 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66919/#review202496
---



The "process aborting" logic is a bit hard to write tests for but we can test 
it manually without checking in the code.

e.g., I changed one test a bit to show the check failure: 
https://github.com/xujyan/mesos/commit/68051320a87431f6d2f3fbad6b0b97814200a731

Could you update the "Testing Done" section with the link? That should cover 
it. :)

- Jiang Yan Xu


On May 4, 2018, 12:46 p.m., Xudong Ni wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66919/
> ---
> 
> (Updated May 4, 2018, 12:46 p.m.)
> 
> 
> Review request for mesos, Joseph Wu and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8630
> https://issues.apache.org/jira/browse/MESOS-8630
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> When the registrar fails to update the registry it would abort the
> actor and fail all future operations,However when the registrar
> updates is requested by an operator API such as maintenance update,
> the master process doesn't shut down (a 500 error is returned to the
> client instead)and all subsequent operations will fail.
> 
> Review: https://reviews.apache.org/r/66919
> 
> 
> Diffs
> -
> 
>   src/master/http.cpp 135ae4389623385a6638158f5f23d6daca14a0ad 
> 
> 
> Diff: https://reviews.apache.org/r/66919/diff/1/
> 
> 
> Testing
> ---
> 
> No error found by running all unit tests(make check)
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>

Re: Review Request 66919: Failure to update registry should abort the master process.

2018-05-04 Thread Jiang Yan Xu



> On May 4, 2018, 8:19 a.m., Jiang Yan Xu wrote:
> > src/master/http.cpp
> > Lines 4163-4165 (patched)
> > <https://reviews.apache.org/r/66919/diff/1/?file=2016039#file2016039line4163>
> >
> > The following will be more idiomatic.
> > 
> > ```
> >   .onAny([](const Future& result) {
> > CHECK_READY(result);
> >   })
> > ```
> > 
> > Note the spaces and const ref.
> > 
> > 
> > It'll be helpful to get familiarize with
> > 
> > https://github.com/apache/mesos/blob/master/docs/c%2B%2B-style-guide.md
> > 
> > which in turn references 
> > 
> > https://google.github.io/styleguide/cppguide.html
> 
> Jiang Yan Xu wrote:
> It's also helpful to attach a message to the check failure:
> 
> 
> ```
> CHECK_READY(result)
>   << "Failed to update maintenance schedule in the registry";
> ```

Also let's add a TODO above the `CHECK_READY`:

```
TODO(fiu): Consider changing/refactoring the registrar itself so the individual 
call sites don't need to handle this separately. All registrar failures that 
cause it to abort should instead abort the process.
```


- Jiang Yan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66919/#review202416
---


On May 4, 2018, 12:46 p.m., Xudong Ni wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66919/
> ---
> 
> (Updated May 4, 2018, 12:46 p.m.)
> 
> 
> Review request for mesos, Joseph Wu and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8630
> https://issues.apache.org/jira/browse/MESOS-8630
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> When the registrar fails to update the registry it would abort the
> actor and fail all future operations,However when the registrar
> updates is requested by an operator API such as maintenance update,
> the master process doesn't shut down (a 500 error is returned to the
> client instead)and all subsequent operations will fail.
> 
> Review: https://reviews.apache.org/r/66919
> 
> 
> Diffs
> -
> 
>   src/master/http.cpp 135ae4389623385a6638158f5f23d6daca14a0ad 
> 
> 
> Diff: https://reviews.apache.org/r/66919/diff/1/
> 
> 
> Testing
> ---
> 
> No error found by running all unit tests(make check)
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>

Re: Review Request 66919: Failure to update registry should abort the master process.

2018-05-04 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66919/#review202416
---



LGTM

It's hard to write a test for this but we can probably modify a test in 
master_maintenance_tests.cpp to observe the check failure. 

Let's also check Joseph with the comment here.
https://github.com/apache/mesos/blob/520b729857223aeade345cbdf61209ec4f395ad9/src/master/maintenance.hpp#L39


src/master/http.cpp
Lines 4163-4165 (patched)
<https://reviews.apache.org/r/66919/#comment284233>

The following will be more idiomatic.

```
  .onAny([](const Future& result) {
CHECK_READY(result);
  })
```

Note the spaces and const ref.

It'll be helpful to get familiarize with

https://github.com/apache/mesos/blob/master/docs/c%2B%2B-style-guide.md

which in turn references 

https://google.github.io/styleguide/cppguide.html


- Jiang Yan Xu


On May 3, 2018, 9:43 a.m., Xudong Ni wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66919/
> ---
> 
> (Updated May 3, 2018, 9:43 a.m.)
> 
> 
> Review request for mesos and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8630
> https://issues.apache.org/jira/browse/MESOS-8630
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> When the registrar fails to update the registry it would abort the
> actor and fail all future operations,However when the registrar
> updates is requested by an operator API such as maintenance update,
> the master process doesn't shut down (a 500 error is returned to the
> client instead)and all subsequent operations will fail.
> 
> Review: https://reviews.apache.org/r/66919
> 
> 
> Diffs
> -
> 
>   src/master/http.cpp 135ae4389623385a6638158f5f23d6daca14a0ad 
> 
> 
> Diff: https://reviews.apache.org/r/66919/diff/1/
> 
> 
> Testing
> ---
> 
> No error found by running all unit tests(make check)
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>

Re: Review Request 66644: Remove unknown unreachable tasks when agent reregisters.

2018-05-03 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66644/#review202368
---


Fix it, then Ship it!




I'll commit with these minor fixes.


src/master/master.hpp
Lines 1928 (patched)
<https://reviews.apache.org/r/66644/#comment284232>

I'll quote the `unreachable` as it is code.



src/master/master.cpp
Lines 7131 (patched)
<https://reviews.apache.org/r/66644/#comment284170>

Space in between `foreachkey (`



src/master/master.cpp
Lines 7135 (patched)
<https://reviews.apache.org/r/66644/#comment284171>

Space in between `foreach (`



src/tests/partition_tests.cpp
Lines 168-169 (patched)
<https://reviews.apache.org/r/66644/#comment284231>

This clearly can fit in the previous line which would be less jagged.



src/tests/partition_tests.cpp
Lines 396 (patched)
<https://reviews.apache.org/r/66644/#comment284230>

Period to end the sentence.


- Jiang Yan Xu


On May 1, 2018, 10:18 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66644/
> ---
> 
> (Updated May 1, 2018, 10:18 p.m.)
> 
> 
> Review request for mesos and Jiang Yan Xu.
> 
> 
> Bugs: 8750
> https://issues.apache.org/jira/browse/8750
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> A RunTask messsage could get dropped for an agent while its
> disconnected from the master and when such an agent goes unreachable
> then this dropped task message gets added to the unreachable tasks.
> When the agent reregisters, the master sends status updates for the
> tasks that the agent reported when re-registering and these tasks are
> also removed from the unreachableTasks on the framework but since the
> agent doesn't know about the dropped task so it doesn't get removed
> from the unreachableTasks leading to a check failure when
> this inconsistency is detected during framework removal.
> 
> 
> Diffs
> -
> 
>   src/master/master.hpp 0d9620dd0c232dc1df83477e838eeb7313bf8828 
>   src/master/master.cpp 767ad8cfe142b47ef07172bcb2a4fb49fc3e833a 
>   src/tests/partition_tests.cpp 9138e5c745cf354a3573e1ab0b251d46702833cc 
> 
> 
> Diff: https://reviews.apache.org/r/66644/diff/5/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>

Re: Review Request 66644: Remove unknown unreachable tasks when agent reregisters.

2018-05-01 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66644/#review202215
---



LGTM sans the unresolved issue from the previous revision!


src/tests/partition_tests.cpp
Lines 167-169 (original), 167-170 (patched)
<https://reviews.apache.org/r/66644/#comment283966>

This reads like "the `RunTaskMessage` is remove correctly" but I think the 
subject is the `task`?

Perhaps revise a bit so it says 

```
// This test verifies that if a `RunTaskMessage` dropped en route to
// an agent which later becomes unreachable, the task is removed correctly
// from the master's unreachable task records when the agent
// reregisters.
```

?



src/tests/partition_tests.cpp
Lines 171 (patched)
<https://reviews.apache.org/r/66644/#comment283967>

This seems to be exceeded the 80 character limit.



src/tests/partition_tests.cpp
Lines 47 (patched)
<https://reviews.apache.org/r/66644/#comment283968>

It doesn't look like this is used?


- Jiang Yan Xu


On April 25, 2018, 8:46 a.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66644/
> ---
> 
> (Updated April 25, 2018, 8:46 a.m.)
> 
> 
> Review request for mesos and Jiang Yan Xu.
> 
> 
> Bugs: 8750
> https://issues.apache.org/jira/browse/8750
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> A RunTask messsage could get dropped for an agent while it's
> disconnected from the master and when such an agent goes unreachable
> then this dropped task gets added to the unreachable tasks.
> When the agent reregisters, tasks reported by it are removed from the 
> unreachableTasks bookkeeping on the master but since the
> agent doesn't know about the dropped task so it doesn't get removed
> from the unreachableTasks leading to a master check failure when
> this inconsistency is detected during framework removal.
> 
> 
> Diffs
> -
> 
>   src/master/master.hpp 0d9620dd0c232dc1df83477e838eeb7313bf8828 
>   src/master/master.cpp 767ad8cfe142b47ef07172bcb2a4fb49fc3e833a 
>   src/tests/partition_tests.cpp 9138e5c745cf354a3573e1ab0b251d46702833cc 
> 
> 
> Diff: https://reviews.apache.org/r/66644/diff/4/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>

Re: Review Request 66644: Remove unknown unreachable tasks when agent reregisters.

2018-05-01 Thread Jiang Yan Xu



> On April 25, 2018, 3:39 p.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Lines 10577-10584 (patched)
> > <https://reviews.apache.org/r/66644/diff/3/?file=2012148#file2012148line10577>
> >
> > This is the case we don't need to use `at()` as we are modifying the 
> > hashmap, we are not concerned about adding empty entries. We can just use 
> > `[]`: 
> > http://www.cplusplus.com/reference/unordered_map/unordered_map/operator[]/
> > 
> > The following should work for all the cases here.
> > 
> > ```
> > slaves.unreachableTasks[slave->id].put(task->framework_id(), 
> > task->task_id());
> > ```
> 
> Megha Sharma wrote:
> I agree we can just use the `unordered_map::operator[]` without the empty 
> entry adding concern but `unordered_map::at` here has the same behavior. I 
> don't mind changing it but just curious here as to why we prefer the 
> `operator[]` in this case?

Because the 8 lines can be condensed into 1-2? It's not about `[]` vs. `at` per 
se but the whole condition checking. I'd say the `[]` behavior is precesely 
created for this case so we'd better use it. :)


- Jiang Yan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66644/#review201963
---


On April 25, 2018, 8:46 a.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/66644/
> -----------
> 
> (Updated April 25, 2018, 8:46 a.m.)
> 
> 
> Review request for mesos and Jiang Yan Xu.
> 
> 
> Bugs: 8750
> https://issues.apache.org/jira/browse/8750
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> A RunTask messsage could get dropped for an agent while it's
> disconnected from the master and when such an agent goes unreachable
> then this dropped task gets added to the unreachable tasks.
> When the agent reregisters, tasks reported by it are removed from the 
> unreachableTasks bookkeeping on the master but since the
> agent doesn't know about the dropped task so it doesn't get removed
> from the unreachableTasks leading to a master check failure when
> this inconsistency is detected during framework removal.
> 
> 
> Diffs
> -
> 
>   src/master/master.hpp 0d9620dd0c232dc1df83477e838eeb7313bf8828 
>   src/master/master.cpp 767ad8cfe142b47ef07172bcb2a4fb49fc3e833a 
>   src/tests/partition_tests.cpp 9138e5c745cf354a3573e1ab0b251d46702833cc 
> 
> 
> Diff: https://reviews.apache.org/r/66644/diff/4/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>

Review Request 66778: Fixed Master which generated updates with the latest task state.

2018-04-24 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66778/
---

Review request for mesos, Andrei Budnik, Alexander Rukletsov, and Megha Sharma.


Bugs: MESOS-8824
https://issues.apache.org/jira/browse/MESOS-8824


Repository: mesos


Description
---

The master should use `Task.status_update_state` to generate status
updates during agent reregistration instead of `Task.state` because
using the latter could result in the scheduler seeing out-of-order task
state transitions (e.g., TASK_FINISHED -> TASK_RUNNING).


Diffs
-

  src/master/master.cpp c723a291ed8d219ced4492bc905ac6b52683ae47 
  src/tests/partition_tests.cpp 606025a54dc54a777222c4dca5fd8c3c084031d2 


Diff: https://reviews.apache.org/r/66778/diff/1/


Testing
---

Make check.


Thanks,

Jiang Yan Xu

Review Request 66769: Fixed flaky ReconciliationTest.ReconcileStatusUpdateTaskState.

2018-04-23 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/66769/
---

Review request for mesos, Andrei Budnik, Alexander Rukletsov, and Megha Sharma.


Bugs: MESOS-8618
https://issues.apache.org/jira/browse/MESOS-8618


Repository: mesos


Description
---

To simulate a master failover we need to use `replicated_log` as the
registry otherwiser the master loses persisted info about the agents.


Diffs
-

  src/tests/reconciliation_tests.cpp b06244b5f3efad3a88e367cd8e26cebd3d9f2e43 


Diff: https://reviews.apache.org/r/66769/diff/1/


Testing
---

Make check.


Thanks,

Jiang Yan Xu

Re: Review Request 65967: Eliminated some copying in the master's Accept call handler path.

2018-03-08 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65967/#review198911
---


Ship it!




Sweet! Would it make sense to group JIRAs into a "copy elimination" epic? 
Completing a sweep is a large scale effort so an Epic with subtasks helps 
tracking remaining work?

- Jiang Yan Xu


On March 7, 2018, 7:52 p.m., Benjamin Mahler wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65967/
> ---
> 
> (Updated March 7, 2018, 7:52 p.m.)
> 
> 
> Review request for mesos and Meng Zhu.
> 
> 
> Bugs: MESOS-8628
> https://issues.apache.org/jira/browse/MESOS-8628
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This eliminates two copies of the `Accept` message.
> 
> 
> Diffs
> -
> 
>   src/master/http.cpp acf10a0f396234ec9f444c9b41515d657379d6ad 
>   src/master/master.hpp c9c8a968b6f56fe261ac9fe374e926a28d40fccb 
>   src/master/master.cpp e666b664dd125317cda5d16285d444b9c21e1f35 
> 
> 
> Diff: https://reviews.apache.org/r/65967/diff/1/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Benjamin Mahler
> 
>

Review Request 65438: Fixed flaky SlaveTest.RegisteredAgentReregisterAfterFailover.

2018-01-31 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65438/
---

Review request for mesos, Alexander Rukletsov and James Peach.


Bugs: MESOS-8232
https://issues.apache.org/jira/browse/MESOS-8232


Repository: mesos


Description
---

Previously due to a race condition a registration retry could be held
in the authorizer and forwarded back to the master only after the
agent's disconnection is detected, which leads to an remove registrar
operation when the test asserts there's none.


Diffs
-

  src/tests/slave_tests.cpp 20b874481d3818574731fc30ba9df1fc2bcbe900 


Diff: https://reviews.apache.org/r/65438/diff/1/


Testing
---

Ran the test with 1000 iterations.


Thanks,

Jiang Yan Xu

Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-12 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review195350
---


Ship it!




Will commit with a few touches like the following.


src/master/master.cpp
Lines 10937-10941 (original), 10935-10939 (patched)
<https://reviews.apache.org/r/64940/#comment274514>

Given our latest conclusion let's revert this back to 

```
count += framework->unreachableTasks.size();
    ```


- Jiang Yan Xu


On Jan. 12, 2018, 11:24 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64940/
> ---
> 
> (Updated Jan. 12, 2018, 11:24 a.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, Gaston Kleiman, Jie Yu, Vinod 
> Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8337
> https://issues.apache.org/jira/browse/MESOS-8337
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> If an agent is lost, we try to remove all the tasks that might have
> been lost. If a task is already terminal but has unacknowleged status
> updates, it is expected that we track it in the unreachable tasks list
> so we should remove the CHECK that prevents this. This also backs out
> changes to how unreachable tasks are presented in the HTTP endpoints to
> restore compatibility with previous Mesos releases.
> 
> 
> Diffs
> -
> 
>   src/master/http.cpp bc29fafb1f4b515aec3b77850f960c88a65c8362 
>   src/master/master.hpp f915c6f611eaffd5c2cb6cba6c43def679e8f92d 
>   src/master/master.cpp c96cd7090875fbf1b11b1708390e88182c77655b 
>   src/tests/mesos.hpp 93913f2e01898c73e09de58a975aa467e714d882 
>   src/tests/partition_tests.cpp 3813139f576ea01db0197f0fe8a73597db1bb69a 
> 
> 
> Diff: https://reviews.apache.org/r/64940/diff/7/
> 
> 
> Testing
> ---
> 
> make check (Fedora 27)
> 
> 
> Thanks,
> 
> James Peach
> 
>

Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-12 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review195241
---



LGTM modulo Vinod's comments. Sorry we have gone back and forth on this but 
this time I think we are close to reach what's shippable to unblock 1.5.


src/master/http.cpp
Lines 360-362 (original)
<https://reviews.apache.org/r/64940/#comment274470>

So, as reminded by Vinod, for backward compatibility let's keep the checks. 
However here we should change the condition to

```
if (task->state() == TASK_UNREACHABLE) {
  continue;
}
```

because other terminal tasks that are not TASK_LOST could also be in this 
map...



src/master/master.hpp
Lines 2326-2327 (original)
<https://reviews.apache.org/r/64940/#comment274417>

The following CHECK looks correct?

```
CHECK(Master::isRemovable(task->state()) << task->state();
```

Although failing CHECKs are annoying, it helped us uncover the 
inconsistency in our task state transitions so it's better to correct it?



src/master/master.cpp
Lines 11023-11036 (original), 11021-11034 (patched)
<https://reviews.apache.org/r/64940/#comment274415>

I was going to suggest that we remove the `if (task->state() == 
TASK_UNREACHABLE)` check but per Vinod's comment we shouldn't.



src/tests/mesos.hpp
Lines 3511 (patched)
<https://reviews.apache.org/r/64940/#comment274471>

s/a/an/?



src/tests/mesos.hpp
Line 3511 (original), 3518 (patched)
<https://reviews.apache.org/r/64940/#comment274472>

s/a/an/?



src/tests/partition_tests.cpp
Lines 2401 (patched)
<https://reviews.apache.org/r/64940/#comment274477>

We have to s/unreachable/completed/ and s/but/and/ ?



src/tests/partition_tests.cpp
Lines 2402 (patched)
<https://reviews.apache.org/r/64940/#comment274478>

With all the discussions about teminology it seems  s/Completed/Terminal/ 
is more accurate?



src/tests/partition_tests.cpp
Lines 2440-2442 (patched)
<https://reviews.apache.org/r/64940/#comment274473>

Nit: declare them in the order we are expecting on them? i.e., starting -> 
running -> finished.



src/tests/partition_tests.cpp
Lines 2474-2475 (patched)
<https://reviews.apache.org/r/64940/#comment274474>

Without a paused clock, there is a chance of race here right? If the same 
status update is resent after a timeout before we acknowledge, it's possible 
that the resent update is going to be caught by the 2nd expectation?

Would it be possible to hoist the Clock::pause() statement?



src/tests/partition_tests.cpp
Lines 2527 (patched)
<https://reviews.apache.org/r/64940/#comment274475>

Remove this debugging log.



src/tests/partition_tests.cpp
Lines 2532-2534 (patched)
<https://reviews.apache.org/r/64940/#comment274476>

    Sorry now you have to expect it to be in the completed_task section...


- Jiang Yan Xu


On Jan. 5, 2018, 11:37 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64940/
> ---
> 
> (Updated Jan. 5, 2018, 11:37 a.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, Gaston Kleiman, Jie Yu, Vinod 
> Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8337
> https://issues.apache.org/jira/browse/MESOS-8337
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> If an agent is lost, we try to remove all the tasks that might have
> been lost. If a task is already terminal but has unacknowleged status
> updates, it is expected that we track it in the unreachable tasks list
> so we should remove the CHECK that prevents this. This also backs out
> changes to how unreachable tasks are presented in the HTTP endpoints to
> restore compatibility with previous Mesos releases.
> 
> 
> Diffs
> -
> 
>   src/master/http.cpp bc29fafb1f4b515aec3b77850f960c88a65c8362 
>   src/master/master.hpp 5e6ba53c075174a1e514a395ceb17c26201ec470 
>   src/master/master.cpp 6fc5de89e54ba0b9ae2c4fb475be9878910820d3 
>   src/tests/mesos.hpp 93913f2e01898c73e09de58a975aa467e714d882 
>   src/tests/partition_tests.cpp 3813139f576ea01db0197f0fe8a73597db1bb69a 
> 
> 
> Diff: https://reviews.apache.org/r/64940/diff/6/
> 
> 
> Testing
> ---
> 
> make check (Fedora 27)
> 
> 
> Thanks,
> 
> James Peach
> 
>

Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-11 Thread Jiang Yan Xu



> On Jan. 11, 2018, 6:24 p.m., Vinod Kone wrote:
> > AFAICT, in 1.4.x we show unreachable terminal tasks on a removed agent as 
> > completed tasks. But now, we show them as unreachable tasks. If that's true 
> > it's an API change that needs to be called out. Is that really backwards 
> > compatible?

Yeah it's true. Despite it being a bug that if the unreachable terminal tasks 
are considered completed and added to the completed list, it cannot be later 
removed when the agent reregisters and duplicates are shown in the webUI and 
APIs, it is indeed what 1.4.x gives you.

1.5 fixes the duplication problem but we did the extra work (the additional `if 
(task->state() != TASK_UNREACHABLE)` checks we added and this revision removes) 
to make it look like the 1.4.x version, I guess it's fine to keep it that way 
until we have a plan for an overhaul...

So, probably let's not revert the code that involves the http endpoints (sorry 
for suggesting it earlier, there are small changes needed which I'll comment 
on).


- Jiang Yan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review195282
---


On Jan. 5, 2018, 11:37 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64940/
> ---
> 
> (Updated Jan. 5, 2018, 11:37 a.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, Gaston Kleiman, Jie Yu, Vinod 
> Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8337
> https://issues.apache.org/jira/browse/MESOS-8337
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> If an agent is lost, we try to remove all the tasks that might have
> been lost. If a task is already terminal but has unacknowleged status
> updates, it is expected that we track it in the unreachable tasks list
> so we should remove the CHECK that prevents this. This also backs out
> changes to how unreachable tasks are presented in the HTTP endpoints to
> restore compatibility with previous Mesos releases.
> 
> 
> Diffs
> -
> 
>   src/master/http.cpp bc29fafb1f4b515aec3b77850f960c88a65c8362 
>   src/master/master.hpp 5e6ba53c075174a1e514a395ceb17c26201ec470 
>   src/master/master.cpp 6fc5de89e54ba0b9ae2c4fb475be9878910820d3 
>   src/tests/mesos.hpp 93913f2e01898c73e09de58a975aa467e714d882 
>   src/tests/partition_tests.cpp 3813139f576ea01db0197f0fe8a73597db1bb69a 
> 
> 
> Diff: https://reviews.apache.org/r/64940/diff/6/
> 
> 
> Testing
> ---
> 
> make check (Fedora 27)
> 
> 
> Thanks,
> 
> James Peach
> 
>

Re: Review Request 64516: Improved documentation on resource reservations.

2018-01-10 Thread Jiang Yan Xu



> On Jan. 9, 2018, 8:27 a.m., James Peach wrote:
> > docs/reservation.md
> > Lines 88 (patched)
> > <https://reviews.apache.org/r/64516/diff/1/?file=1912814#file1912814line98>
> >
> > "Similarly, agents can register ..."

Meaning s/by default// because of repetition? It doesn't feel too redundant to 
me because I wanted to emphasize the default behavior?


- Jiang Yan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64516/#review195041
-----------


On Jan. 10, 2018, 2:23 p.m., Jiang Yan Xu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64516/
> ---
> 
> (Updated Jan. 10, 2018, 2:23 p.m.)
> 
> 
> Review request for mesos, Alexander Rojas, James Peach, and Michael Park.
> 
> 
> Bugs: MESOS-8306
> https://issues.apache.org/jira/browse/MESOS-8306
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added explanation for authorizations on static reservations.
> 
> 
> Diffs
> -
> 
>   docs/reservation.md fa1f309f2b770e37964af2d5599519875195dbec 
> 
> 
> Diff: https://reviews.apache.org/r/64516/diff/2/
> 
> 
> Testing
> ---
> 
> Used a markdown viewer.
> 
> 
> Thanks,
> 
> Jiang Yan Xu
> 
>

Re: Review Request 64516: Improved documentation on resource reservations.

2018-01-10 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64516/
---

(Updated Jan. 10, 2018, 2:23 p.m.)


Review request for mesos, Alexander Rojas, James Peach, and Michael Park.


Changes
---

Comments.


Bugs: MESOS-8306
https://issues.apache.org/jira/browse/MESOS-8306


Repository: mesos


Description
---

Added explanation for authorizations on static reservations.


Diffs (updated)
-

  docs/reservation.md fa1f309f2b770e37964af2d5599519875195dbec 


Diff: https://reviews.apache.org/r/64516/diff/2/

Changes: https://reviews.apache.org/r/64516/diff/1-2/


Testing
---

Used a markdown viewer.


Thanks,

Jiang Yan Xu

Re: Review Request 64515: Used `reserve_resources` ACL for static reservations.

2018-01-10 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64515/
---

(Updated Jan. 10, 2018, 2:22 p.m.)


Review request for mesos, Alexander Rojas and James Peach.


Changes
---

Review comments.


Bugs: MESOS-8306
https://issues.apache.org/jira/browse/MESOS-8306


Repository: mesos


Description
---

Used `reserve_resources` ACL for static reservations.


Diffs (updated)
-

  include/mesos/authorizer/acls.proto d869820491f1b81f81e1157493aa73e32735b9c3 
  src/master/master.hpp 5e6ba53c075174a1e514a395ceb17c26201ec470 
  src/master/master.cpp 6fc5de89e54ba0b9ae2c4fb475be9878910820d3 
  src/tests/master_authorization_tests.cpp 
676543a5ad1bb5d47011fc2a8b05dfaaeef18c64 


Diff: https://reviews.apache.org/r/64515/diff/4/

Changes: https://reviews.apache.org/r/64515/diff/3-4/


Testing
---

make check.


Thanks,

Jiang Yan Xu

Re: Review Request 64515: Used `reserve_resources` ACL for static reservations.

2018-01-10 Thread Jiang Yan Xu



> On Jan. 9, 2018, 8:45 a.m., James Peach wrote:
> > What do you think about testing the re-registration case? There's no 
> > coverage here, but equally, it would be annoying to duplicate all these 
> > tests for re-registration ...

Two other tests `AuthorizedToRegisterAndReregisterAgent` and 
`UnauthorizedToReregisterAgent` have covered the reregister case (both success 
and failure), even though the condition in these test differ slightly in that 
they don't include roles, I feel that they share the common logic around the 
handling the authorization success and failure of reregistering agents so they 
are adequate?


> On Jan. 9, 2018, 8:45 a.m., James Peach wrote:
> > src/tests/master_authorization_tests.cpp
> > Lines 2516 (patched)
> > <https://reviews.apache.org/r/64515/diff/3/?file=1929962#file1929962line2516>
> >
> > What happens with the "*" role? Is that covered by 
> > "AuthorizedToRegisterNoStaticReservations"?

Yes.


- Jiang Yan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64515/#review195042
-----------


On Jan. 10, 2018, 2:22 p.m., Jiang Yan Xu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64515/
> ---
> 
> (Updated Jan. 10, 2018, 2:22 p.m.)
> 
> 
> Review request for mesos, Alexander Rojas and James Peach.
> 
> 
> Bugs: MESOS-8306
> https://issues.apache.org/jira/browse/MESOS-8306
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Used `reserve_resources` ACL for static reservations.
> 
> 
> Diffs
> -
> 
>   include/mesos/authorizer/acls.proto 
> d869820491f1b81f81e1157493aa73e32735b9c3 
>   src/master/master.hpp 5e6ba53c075174a1e514a395ceb17c26201ec470 
>   src/master/master.cpp 6fc5de89e54ba0b9ae2c4fb475be9878910820d3 
>   src/tests/master_authorization_tests.cpp 
> 676543a5ad1bb5d47011fc2a8b05dfaaeef18c64 
> 
> 
> Diff: https://reviews.apache.org/r/64515/diff/4/
> 
> 
> Testing
> ---
> 
> make check.
> 
> 
> Thanks,
> 
> Jiang Yan Xu
> 
>

Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-09 Thread Jiang Yan Xu



> On Jan. 4, 2018, 5:25 p.m., Vinod Kone wrote:
> > src/master/master.cpp
> > Lines 10037-10056 (original), 10039-10062 (patched)
> > <https://reviews.apache.org/r/64940/diff/1/?file=1930130#file1930130line10039>
> >
> > I think we shouldn't create a TASK_UNREACHABLE status update and call 
> > `updateTask` or `forward` for a terminal task at all. . Also, `forward` 
> > sends TASK_UNREACHABLE update for terminal task to the framework which 
> > looks incorrect.
> > 
> > 
> > Ideally, we want terminal but unacknowledged tasks to still be marked 
> > unreachable in some way, either via task state being TASK_UNREACHABLE or 
> > task being present in `unreachableTasks`. This allows, for example, the 
> > WebUI to not show sandbox links for unreachable tasks irrespective of 
> > whether they were terminal or not before going unreachable. 
> > 
> > But doing this is tricky for various reasons:
> > 
> > --> `updateTask()` doesn't allow a terminal state to be transitioned to 
> > TASK_UNREACHABLE. Right now when we call `updateTask` for a terminal task, 
> > it adds TASK_UNREACHABLE status to `Task.statuses` and also sends it to 
> > operator API stream subscribers which looks incorrect. The fact that 
> > `updateTask` internally deals with already terminal tasks is a bad design 
> > decision in retrospect. I think the callers shouldn't call it for terminal 
> > tasks instead.
> > 
> > --> It's not clear to our users what a `completed` task means. The 
> > intention was for this to hold a cache of terminal and acknowledged tasks 
> > for storing recent history. The users of the WebUI probably equate 
> > "Completed Tasks" to terminal tasks irrespective of their acknowledgement 
> > status, which is why it is confusing for them to see terminal but 
> > unacknowledged tasks in the "Active tasks" section in the WebUI.
> > 
> > --> When a framework reconciles the state of a task on an unreachable 
> > agent, master replies with TASK_UNREACHABLE irrespective of whether the 
> > task was in a non-terminal state or terminal but un-acknowledged state or 
> > terminal and acknowledged state when the agent went unreachable.  
> > 
> > I think the direction we want to go towards is
> > 
> > --> Completed tasks should consist of terminal unacknowledged and 
> > terminal acknowled tasks, likely in two different data structures.
> > --> Unreachable tasks should consist of all non-complete tasks on an 
> > unreachable agent.  All the tasks in this map should be in TASK_UNREACHABLE 
> > state.
> > 
> > 
> > Given all the above is a very involved change, I would recommend 
> > keeping what you have here but with a giant TODO (your current comment in 
> > #10058 doesn't go into enough detail about the complexity here) that talks 
> > about the above stuff. Your change at least keeps the parity with the 
> > (broken) semantics that we have in 1.4 and earlier so that's a bit better.
> 
> Vinod Kone wrote:
> Ignore the first line. Forgot to delete it.
> 
> Jiang Yan Xu wrote:
> Future direction
> 
> 1. If completed == terminal unacknowledged + terminal acknowledged, then 
> completed == terminal right? Should we then unify the terminology and pick 
> one?
> 2. Unreachable tasks == non-terminal tasks on an unreachable agent: this 
> is what this RR is going to do but IIUC you want a different behavior.
> 
> Current semantics
> 
> 1. In 1.4 the the master (in `updateTask` sends `TASK_UNREACHABLE` to the 
> operator API subsribers for terminal tasks), as it stands right now we are 
> going to send `TASK_UNREACHABLE` to the schedulers as well. Should we change 
> that?
> 2. You also said above that "Ideally, we want terminal but unacknowledged 
> tasks to still be marked unreachable in some way" which seems to contradict 
> your later point that "Unreachable tasks should consist of all non-complete 
> (terminal) tasks", could you clarify?
> 
> Overall it sounds to me that the most correct semantic is still to set 
> `TASK_UNREACHABLE` only for the tasks that are non-terminal (because 
> otherwise we know that the state is not going to change to something else 
> that we don't know yet) but perhaps we can use another field in the status to 
> signal the fact that the agent is partitioned?
> 
> James Peach wrote:
> https://issues.apache.org/jira/browse/MESOS-8405
> 
> Vinod Kone wrote:
> Sorry for the confusion. In the future direct

Re: Review Request 64968: Removed the inaccurate `slaves.transitioning` function in the master.

2018-01-09 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64968/#review195094
---


Ship it!





src/master/master.cpp
Lines 8507 (patched)
<https://reviews.apache.org/r/64968/#comment274241>

Perhaps it'll be more informative to log the `slaveId` if it is present? 
i.e., distinguish the two sub-cases in logging?


- Jiang Yan Xu


On Jan. 4, 2018, 5:38 p.m., Benjamin Mahler wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64968/
> ---
> 
> (Updated Jan. 4, 2018, 5:38 p.m.)
> 
> 
> Review request for mesos, Vinod Kone and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8390
> https://issues.apache.org/jira/browse/MESOS-8390
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This function used to return whether the agent was transitioning
> between states, however it was updated in e510813f93e253480005ce
> to return only when the agent is recovered.
> 
> This patch removes the now confusing function in favor of just
> directly checking the state we care about in the reconciliation
> logic. Since there were no other usages, the function is removed.
> 
> 
> Diffs
> -
> 
>   src/master/master.hpp 130f6e28cc62a8912aac66ecfbf014fe1ee444e3 
>   src/master/master.cpp 28d8be3a4769b418b61cff0b95845e4232135bc7 
> 
> 
> Diff: https://reviews.apache.org/r/64968/diff/1/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Benjamin Mahler
> 
>

Re: Review Request 65050: Added some defensive checks for persistent volume sharing check.

2018-01-09 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65050/#review195075
---


Ship it!





src/slave/containerizer/mesos/isolators/filesystem/linux.cpp
Lines 604-605 (original), 611-612 (patched)
<https://reviews.apache.org/r/65050/#comment274214>

The code looks correct but the review summary seems not accurate: we are 
more benefitting from the fact that in these isolators we set `info->resources` 
only after we update the volumes on disk. Docker containerizer has the same `if 
(current.contains(resource))` check but that check doesn't work because we set 
`container->resources` first there.


- Jiang Yan Xu


On Jan. 9, 2018, 12:44 p.m., Jie Yu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65050/
> ---
> 
> (Updated Jan. 9, 2018, 12:44 p.m.)
> 
> 
> Review request for mesos, Gilbert Song, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8356
> https://issues.apache.org/jira/browse/MESOS-8356
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This patch adds some defensive checks for detecting if a persistent
> volume is shared by multiple containers. It's currently not a bug
> because of a `contains` check above in the loop body. But this is a bit
> fragile and should be avoided.
> 
> See more details in MESOS-8356.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/mesos/isolators/filesystem/linux.cpp 
> 5200182970fa978e940d3f0e9cddce37608e84f7 
>   src/slave/containerizer/mesos/isolators/filesystem/posix.cpp 
> 6b9f4bb42621eec3eb648712fbfa41d94f253895 
> 
> 
> Diff: https://reviews.apache.org/r/65050/diff/1/
> 
> 
> Testing
> ---
> 
> sudo make check
> 
> 
> Thanks,
> 
> Jie Yu
> 
>

Re: Review Request 65049: Fixed the persistent volume permission issue in DockerContainerizer.

2018-01-09 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/65049/#review195068
---


Ship it!




Ship It!

- Jiang Yan Xu


On Jan. 9, 2018, 11:31 a.m., Jie Yu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/65049/
> ---
> 
> (Updated Jan. 9, 2018, 11:31 a.m.)
> 
> 
> Review request for mesos, Gilbert Song, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8356
> https://issues.apache.org/jira/browse/MESOS-8356
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This patch fixes MESOS-8356 by skipping the current container to be
> launched when doing the shared volume check (`isVolumeInUse`). Prior to
> this patch, the code is buggy because `isVolumeInUse` will always be set
> to `true`.
> 
> 
> Diffs
> -
> 
>   src/slave/containerizer/docker.cpp 890cb2aba44fe76e891117833eac8ccca00b759b 
> 
> 
> Diff: https://reviews.apache.org/r/65049/diff/1/
> 
> 
> Testing
> ---
> 
> sudo make check
> 
> 
> Thanks,
> 
> Jie Yu
> 
>

Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-05 Thread Jiang Yan Xu



> On Jan. 4, 2018, 2:09 p.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Lines 10061 (patched)
> > <https://reviews.apache.org/r/64940/diff/1/?file=1930130#file1930130line10062>
> >
> > Perhaps move all the logic around determining the unreachable task down 
> > here.
> > 
> > ```
> > // We need to inform `removeTask` whether the task became unreachable 
> > due to the agent being removed. Note that terminal tasks are not considered 
> > unreachable. We can not rely on the task's state because it may have been 
> > set to `TASK_LOST` for backwards-compatibility.
> > bool unreachable =
> >   unreachableTime.isSome() && !protobuf::isTerminalState(task->state());
> > ```
> > 
> > Inlining `unreachable` is probably fine too but separating it helps 
> > clarify it with comments?
> 
> James Peach wrote:
> You have to sample the task state before the update because it might 
> transition to a terminal state at that time (and LOST is terminal).

Right, so put the line above `updateTask`? The main thing I am commenting on is 
that we can consolidate the logic without first defining `bool unreachable = 
true;` much far above?


> On Jan. 4, 2018, 2:09 p.m., Jiang Yan Xu wrote:
> > src/tests/master_tests.cpp
> > Lines 7637 (patched)
> > <https://reviews.apache.org/r/64940/diff/1/?file=1930131#file1930131line7637>
> >
> > You could just wait for the the status update (being droppped) instead 
> > of capaturing `Slave::executorTerminated` and risk the race condition of 
> > the master not receiving the status update before the agent is deemed 
> > partitioned?
> 
> James Peach wrote:
> Dealing with all the status updates was less reliable.

Do you see the race condition I was trying to point out?


> On Jan. 4, 2018, 2:09 p.m., Jiang Yan Xu wrote:
> > src/tests/master_tests.cpp
> > Lines 7686 (patched)
> > <https://reviews.apache.org/r/64940/diff/1/?file=1930131#file1930131line7686>
> >
> > No need to settle?
> 
> James Peach wrote:
> This is settling to ensure that events following from marking the agent 
> down are fully processed.

Oh ok that's right.


- Jiang Yan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review194777
---


On Jan. 5, 2018, 11:06 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64940/
> ---
> 
> (Updated Jan. 5, 2018, 11:06 a.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, Gaston Kleiman, Jie Yu, Vinod 
> Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8337
> https://issues.apache.org/jira/browse/MESOS-8337
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> If an agent is lost, we try to remove all the tasks that might
> have been lost. However, if a task is already terminal, it hasn't
> really been lost so we should not be tracking it in the framework's
> unreachable tasks list.
> 
> 
> Diffs
> -
> 
>   src/master/master.hpp 130f6e28cc62a8912aac66ecfbf014fe1ee444e3 
>   src/master/master.cpp 28d8be3a4769b418b61cff0b95845e4232135bc7 
>   src/tests/partition_tests.cpp 3813139f576ea01db0197f0fe8a73597db1bb69a 
> 
> 
> Diff: https://reviews.apache.org/r/64940/diff/4/
> 
> 
> Testing
> ---
> 
> make check (Fedora 27)
> 
> 
> Thanks,
> 
> James Peach
> 
>

Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-05 Thread Jiang Yan Xu



> On Jan. 4, 2018, 5:25 p.m., Vinod Kone wrote:
> > src/master/master.cpp
> > Lines 10037-10056 (original), 10039-10062 (patched)
> > <https://reviews.apache.org/r/64940/diff/1/?file=1930130#file1930130line10039>
> >
> > I think we shouldn't create a TASK_UNREACHABLE status update and call 
> > `updateTask` or `forward` for a terminal task at all. . Also, `forward` 
> > sends TASK_UNREACHABLE update for terminal task to the framework which 
> > looks incorrect.
> > 
> > 
> > Ideally, we want terminal but unacknowledged tasks to still be marked 
> > unreachable in some way, either via task state being TASK_UNREACHABLE or 
> > task being present in `unreachableTasks`. This allows, for example, the 
> > WebUI to not show sandbox links for unreachable tasks irrespective of 
> > whether they were terminal or not before going unreachable. 
> > 
> > But doing this is tricky for various reasons:
> > 
> > --> `updateTask()` doesn't allow a terminal state to be transitioned to 
> > TASK_UNREACHABLE. Right now when we call `updateTask` for a terminal task, 
> > it adds TASK_UNREACHABLE status to `Task.statuses` and also sends it to 
> > operator API stream subscribers which looks incorrect. The fact that 
> > `updateTask` internally deals with already terminal tasks is a bad design 
> > decision in retrospect. I think the callers shouldn't call it for terminal 
> > tasks instead.
> > 
> > --> It's not clear to our users what a `completed` task means. The 
> > intention was for this to hold a cache of terminal and acknowledged tasks 
> > for storing recent history. The users of the WebUI probably equate 
> > "Completed Tasks" to terminal tasks irrespective of their acknowledgement 
> > status, which is why it is confusing for them to see terminal but 
> > unacknowledged tasks in the "Active tasks" section in the WebUI.
> > 
> > --> When a framework reconciles the state of a task on an unreachable 
> > agent, master replies with TASK_UNREACHABLE irrespective of whether the 
> > task was in a non-terminal state or terminal but un-acknowledged state or 
> > terminal and acknowledged state when the agent went unreachable.  
> > 
> > I think the direction we want to go towards is
> > 
> > --> Completed tasks should consist of terminal unacknowledged and 
> > terminal acknowled tasks, likely in two different data structures.
> > --> Unreachable tasks should consist of all non-complete tasks on an 
> > unreachable agent.  All the tasks in this map should be in TASK_UNREACHABLE 
> > state.
> > 
> > 
> > Given all the above is a very involved change, I would recommend 
> > keeping what you have here but with a giant TODO (your current comment in 
> > #10058 doesn't go into enough detail about the complexity here) that talks 
> > about the above stuff. Your change at least keeps the parity with the 
> > (broken) semantics that we have in 1.4 and earlier so that's a bit better.
> 
> Vinod Kone wrote:
> Ignore the first line. Forgot to delete it.

Future direction

1. If completed == terminal unacknowledged + terminal acknowledged, then 
completed == terminal right? Should we then unify the terminology and pick one?
2. Unreachable tasks == non-terminal tasks on an unreachable agent: this is 
what this RR is going to do but IIUC you want a different behavior.

Current semantics

1. In 1.4 the the master (in `updateTask` sends `TASK_UNREACHABLE` to the 
operator API subsribers for terminal tasks), as it stands right now we are 
going to send `TASK_UNREACHABLE` to the schedulers as well. Should we change 
that?
2. You also said above that "Ideally, we want terminal but unacknowledged tasks 
to still be marked unreachable in some way" which seems to contradict your 
later point that "Unreachable tasks should consist of all non-complete 
(terminal) tasks", could you clarify?

Overall it sounds to me that the most correct semantic is still to set 
`TASK_UNREACHABLE` only for the tasks that are non-terminal (because otherwise 
we know that the state is not going to change to something else that we don't 
know yet) but perhaps we can use another field in the status to signal the fact 
that the agent is partitioned?


- Jiang Yan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review194794
---


On Jan. 3, 2018, 5:35 p.m., James Peach wrote:
> 
> -

Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.

2018-01-04 Thread Jiang Yan Xu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64940/#review194777
---

src/master/master.cpp
Lines 10058-10060 (patched)
<https://reviews.apache.org/r/64940/#comment273769>

Perhaps we just need to comment on the logic we use to determine if the 
task is `unreachable` and not what `removeTask` is going to do about it 
(because it's hidden here)? i.e., remove the comment here and use something 
like what I suggested below?

src/master/master.cpp
Lines 10061 (patched)
<https://reviews.apache.org/r/64940/#comment273760>

Perhaps move all the logic around determining the unreachable task down 
here.

```
// We need to inform `removeTask` whether the task became unreachable due 
to the agent being removed. Note that terminal tasks are not considered 
unreachable. We can not rely on the task's state because it may have been set 
to `TASK_LOST` for backwards-compatibility.
bool unreachable =
  unreachableTime.isSome() && !protobuf::isTerminalState(task->state());
```

Inlining `unreachable` is probably fine too but separating it helps clarify 
it with comments?

src/master/master.cpp
Line 10273 (original), 10279-10280 (patched)
<https://reviews.apache.org/r/64940/#comment273766>

So we had to use a boolean because we cannot rely on the task *state*, so 
"any of the unreachable states" is a bit misleading?

I think the CHECK message is already clear enough for now? (Like you did in 
`removeTask`)

I suggest we don't comment further on the removable-ness of the task 
because we might want to change it: 
https://issues.apache.org/jira/browse/MESOS-8389

src/tests/master_tests.cpp
Lines 7601-7602 (patched)
<https://reviews.apache.org/r/64940/#comment273805>

This reads like "terminal tasks" are stored as "either finished or lost" 
when it's only the non-terminal task that is transitioned to lost.

Can we just say that we are verifying "the terminal task is stored in its 
original state when parition happens"?

src/tests/master_tests.cpp
Lines 7603 (patched)
<https://reviews.apache.org/r/64940/#comment273804>

Most agent partition related tests are put in partition_tests.cpp, can we 
put this one there as well for better grouping? 

In particular there is a test named `TaskCompletedOnPartitionedAgent` there 
which is somewhat related. We can name this one 
`AgentWithCompletedTaskPartitioned` or `TaskCompletedBeforeAgentPartitioned`

The main word I want to change is `Lost`. While I can understand you are 
using it interchangably with *partitioned*, in this context when it's call 
partitioned elsewhere and there's `TaskLost` to confuse with it, perhaps it's 
better to use the more precise word.

src/tests/master_tests.cpp
Lines 7621 (patched)
<https://reviews.apache.org/r/64940/#comment273811>

We typically name this `offers` since it's a vector. You'll be using the 
first and only one so later you should

```
  AWAIT_READY(offers);
  ASSERT_FALSE(offers->empty());
  const Offer& offer = offers->at(0);
```

src/tests/master_tests.cpp
Lines 7637 (patched)
<https://reviews.apache.org/r/64940/#comment273812>

You could just wait for the the status update (being droppped) instead of 
capaturing `Slave::executorTerminated` and risk the race condition of the 
master not receiving the status update before the agent is deemed partitioned?

src/tests/master_tests.cpp
Lines 7641 (patched)
<https://reviews.apache.org/r/64940/#comment273810>

Instead of settling it's more direct to just wait for the offer before 
using it (like I mentioned above).

src/tests/master_tests.cpp
Lines 7675 (patched)
<https://reviews.apache.org/r/64940/#comment273783>

"fast fast" to emphasize? :)

src/tests/master_tests.cpp
Lines 7681 (patched)
<https://reviews.apache.org/r/64940/#comment273784>

If you wait for the ping, you don't need to settle?

src/tests/master_tests.cpp
Lines 7686 (patched)
<https://reviews.apache.org/r/64940/#comment273785>

No need to settle?

- Jiang Yan Xu

On Jan. 3, 2018, 5:35 p.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64940/
> -----------
> 
> (Updated Jan. 3, 2018, 5:35 p.m.)
> 
> 
> Review request for mesos, Gaston Kleiman, Jie Yu, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8337
> https://issues.apache.org/jira/browse/MESOS-8337
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> If an agent is lost

Re: Review Request 64939: Abort libprocess when a Process throws an uncaught exception.

2018-01-04 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64939/#review194774
---


Ship it!




Ship It!

- Jiang Yan Xu


On Jan. 3, 2018, 5:30 p.m., Benjamin Mahler wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64939/
> ---
> 
> (Updated Jan. 3, 2018, 5:30 p.m.)
> 
> 
> Review request for mesos, Alexander Rukletsov, Benjamin Hindman, Ilya Pronin, 
> Jie Yu, and Jiang Yan Xu.
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previously, we logged the exception, terminated the throwing
> Process, and continued to run. However, currently there exists
> no known user-level code that I'm aware of that handles the
> unexpected termination due to an uncaught exception.
> 
> Generally, this means that when an exception is thrown (e.g.
> a bad call to std::map::at), the process terminates with a log
> message but things get "stuck" and the user has to debug what
> is wrong / kill the process.
> 
> Libprocess would likely need to provide some primitives to
> better support handling unexpected termination of a Process
> in order for us to provide a strategy where we continue running.
> 
> 
> Diffs
> -
> 
>   3rdparty/libprocess/src/process.cpp 
> 75cf1d3b6d3d257ba9bc81c68017a74a6511cebf 
> 
> 
> Diff: https://reviews.apache.org/r/64939/diff/1/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Benjamin Mahler
> 
>

Review Request 64933: Fixed flaky PartitionedSlaveReregistrationMasterFailover test.

2018-01-03 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64933/
---

Review request for mesos, Alexander Rukletsov and Megha Sharma.


Bugs: MESOS-8334
https://issues.apache.org/jira/browse/MESOS-8334


Repository: mesos


Description
---

The test was flaky due to a race condition that causes status updates
to be dropped because the agent reregisters before the schedulers.


Diffs
-

  src/tests/partition_tests.cpp 7f4b9ed938fd478fdf750b464c531bf76de7d798 


Diff: https://reviews.apache.org/r/64933/diff/1/


Testing
---

make check. Ran the affected test for 1000 iterations.


Thanks,

Jiang Yan Xu

Re: Review Request 64898: Removed duplicated code that tests for removable tasks.

2018-01-03 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64898/#review194706
---




src/master/master.cpp
Lines 10171 (patched)
<https://reviews.apache.org/r/64898/#comment273641>

Would it be simpler to just use the new variable without a lambda?

```
TaskState newState = latestState.getOrElse(status.state());
```


- Jiang Yan Xu


On Jan. 2, 2018, 4:26 p.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64898/
> ---
> 
> (Updated Jan. 2, 2018, 4:26 p.m.)
> 
> 
> Review request for mesos and Jiang Yan Xu.
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> When we get a task update and check whether the latest state
> indicated that the task is removable, invoke a local lambda
> function so that we don't duplicate the same code for both
> the current and latest states.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp 03eb178fa1af7d55ae387e6cb42cdc8d721a2196 
> 
> 
> Diff: https://reviews.apache.org/r/64898/diff/2/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> James Peach
> 
>

Re: Review Request 64515: Used `reserve_resources` ACL for static reservations.

2018-01-03 Thread Jiang Yan Xu

> On Dec. 13, 2017, 11:32 a.m., James Peach wrote:
> > src/master/master.cpp
> > Line 3876 (original), 3881 (patched)
> > <https://reviews.apache.org/r/64515/diff/1/?file=1912811#file1912811line3881>
> >
> > You can now log the `SlaveID` here.

The `SlaveInfo` here may not have a `SlaveID` if it's called from 
`registerSlave`. Here to consistently identity the agent, we would need to pass 
in the pid of the agent, (`SlaveInfo` also has the hostname). This info is 
already logged once the authorization is done. 

Other `Master::authorize*` methods all log the data passed to the authorizer 
here, I'll add the agent's resources for consistency.

> On Dec. 13, 2017, 11:32 a.m., James Peach wrote:
> > src/master/master.cpp
> > Lines 3905 (patched)
> > <https://reviews.apache.org/r/64515/diff/1/?file=1912811#file1912811line3905>
> >
> > So what are the semantics for dynamic resources? If this new ACL 
> > semantics is expressing that resources in role `Foo` may only be provided 
> > on agents which principal `Bar`, it doesn't seem correct that we would 
> > allow dynamic reservations to violate that.

The semantics of `reserve_resources` ACL is still the same: it determines 
whether a principal can reserve resources of certain roles, regardless of 
whether it's static or dynamic.

What's new is the usage that considers the static reservation to be initiated 
by the agent's principal and we are authorizing it here.

I think in reality we use ACLs to prevent "bad actors" to do harmful things so 
in the case of dynamic reservations we don't need to addtinoally check against 
the agent's principal because the agent is not initiating the reservation. In 
any case, it shouldn't be authorized at agent registration time but reservation 
request time.

I adjusted the comments. Do you think it's clearer?

- Jiang Yan

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64515/#review193690
---

On Jan. 3, 2018, 11 a.m., Jiang Yan Xu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64515/
> ---
> 
> (Updated Jan. 3, 2018, 11 a.m.)
> 
> 
> Review request for mesos, Alexander Rojas and James Peach.
> 
> 
> Bugs: MESOS-8306
> https://issues.apache.org/jira/browse/MESOS-8306
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Used `reserve_resources` ACL for static reservations.
> 
> 
> Diffs
> -
> 
>   include/mesos/authorizer/acls.proto 
> d869820491f1b81f81e1157493aa73e32735b9c3 
>   src/master/master.hpp 8fe9420dbe03ea2cefc6a40b0f64284aa9fe7915 
>   src/master/master.cpp 282fdf8ac38e3613c621c1c8e5c50f27bde9dafd 
>   src/tests/master_authorization_tests.cpp 
> 676543a5ad1bb5d47011fc2a8b05dfaaeef18c64 
> 
> 
> Diff: https://reviews.apache.org/r/64515/diff/3/
> 
> 
> Testing
> ---
> 
> make check.
> 
> 
> Thanks,
> 
> Jiang Yan Xu
> 
>

Re: Review Request 64515: Used `reserve_resources` ACL for static reservations.

2018-01-03 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64515/
---

(Updated Jan. 3, 2018, 11 a.m.)


Review request for mesos, Alexander Rojas and James Peach.


Changes
---

Addressed comments.


Bugs: MESOS-8306
https://issues.apache.org/jira/browse/MESOS-8306


Repository: mesos


Description
---

Used `reserve_resources` ACL for static reservations.


Diffs (updated)
-

  include/mesos/authorizer/acls.proto d869820491f1b81f81e1157493aa73e32735b9c3 
  src/master/master.hpp 8fe9420dbe03ea2cefc6a40b0f64284aa9fe7915 
  src/master/master.cpp 282fdf8ac38e3613c621c1c8e5c50f27bde9dafd 
  src/tests/master_authorization_tests.cpp 
676543a5ad1bb5d47011fc2a8b05dfaaeef18c64 


Diff: https://reviews.apache.org/r/64515/diff/3/

Changes: https://reviews.apache.org/r/64515/diff/2-3/


Testing
---

make check.


Thanks,

Jiang Yan Xu

Re: Review Request 64515: Used `reserve_resources` ACL for static reservations.

2018-01-03 Thread Jiang Yan Xu



> On Dec. 13, 2017, 3:54 a.m., Alexander Rojas wrote:
> > src/master/master.cpp
> > Lines 6138-6141 (original), 6169-6173 (patched)
> > <https://reviews.apache.org/r/64515/diff/1/?file=1912811#file1912811line6169>
> >
> > How about:
> > 
> > _Not authorized to register agent with(out) principal XXX providing the 
> > resrouces _

It now reads

"Not authorized to register agent providing resources  with principal " 
or 
"Not authorized to register agent providing resources  with without a 
principal"

I am putting "agent" together with "providing resources" because they together 
are the objects.


> On Dec. 13, 2017, 3:54 a.m., Alexander Rojas wrote:
> > src/tests/master_authorization_tests.cpp
> > Lines 2538-2539 (patched)
> > <https://reviews.apache.org/r/64515/diff/1/?file=1912812#file1912812line2538>
> >
> > Instead of the things in the parenthesis you can add:
> > 
> > ```c++
> > {
> >   mesos::ACL::ReserveResources* acl = acls.add_reserve_resources();
> >   acl->mutable_principals()->add_values(DEFAULT_CREDENTIAL.principal());
> >   acl->mutable_roles()->set_type(ACL::Entity::NONE);
> > }
> > ```
> > 
> > And add a comment when starting the agent indicating that the agent is 
> > registering with `DEFAULT_CREDENTIAL.principal()`

The test above `UnauthorizedToStaticallyReserveResources` has exactly this. I 
am just using different scenarios to broaden the test coverage.


> On Dec. 13, 2017, 3:54 a.m., Alexander Rojas wrote:
> > src/tests/master_authorization_tests.cpp
> > Lines 2571 (patched)
> > <https://reviews.apache.org/r/64515/diff/1/?file=1912812#file1912812line2571>
> >
> > Please add a test where the agent tries to register with 
> > `high-security-role` and succeeds.

Added test `AuthorizedToStaticallyReserveRole`.


> On Dec. 13, 2017, 3:54 a.m., Alexander Rojas wrote:
> > src/tests/master_authorization_tests.cpp
> > Lines 2577 (patched)
> > <https://reviews.apache.org/r/64515/diff/1/?file=1912812#file1912812line2577>
> >
> > this test can be merged with 
> > `MasterAuthorizationTest.UnauthorizedToStaticallyReserveResources` and you 
> > save a restart of a master.

You mean because the ACLs are set up the same way, we can have one test with 
two scenarios, agent1 unable to register for reason1, agent2 able to register 
for reason2? I think that's OK but in this case it's easier to understand the 
two cases individually with smaller and independent tests?


- Jiang Yan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64515/#review193670
---


On Jan. 3, 2018, 11 a.m., Jiang Yan Xu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64515/
> ---
> 
> (Updated Jan. 3, 2018, 11 a.m.)
> 
> 
> Review request for mesos, Alexander Rojas and James Peach.
> 
> 
> Bugs: MESOS-8306
> https://issues.apache.org/jira/browse/MESOS-8306
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Used `reserve_resources` ACL for static reservations.
> 
> 
> Diffs
> -
> 
>   include/mesos/authorizer/acls.proto 
> d869820491f1b81f81e1157493aa73e32735b9c3 
>   src/master/master.hpp 8fe9420dbe03ea2cefc6a40b0f64284aa9fe7915 
>   src/master/master.cpp 282fdf8ac38e3613c621c1c8e5c50f27bde9dafd 
>   src/tests/master_authorization_tests.cpp 
> 676543a5ad1bb5d47011fc2a8b05dfaaeef18c64 
> 
> 
> Diff: https://reviews.apache.org/r/64515/diff/3/
> 
> 
> Testing
> ---
> 
> make check.
> 
> 
> Thanks,
> 
> Jiang Yan Xu
> 
>

Re: Review Request 64514: Refactor out `authorizeReserveResources` that takes a `Resources`.

2017-12-13 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64514/
---

(Updated Dec. 13, 2017, 11:53 a.m.)


Review request for mesos, Alexander Rojas and James Peach.


Bugs: MESOS-8306
https://issues.apache.org/jira/browse/MESOS-8306


Repository: mesos


Description (updated)
---

This allows authorization of static resource reservations.

These reservations do not arrived to the master through an
`Offer::Operation::Reserve` API call but during agent 
registration.


Diffs
-

  src/master/master.hpp 232cc3758f240db626c4fdaf852163fa48af4dd7 
  src/master/master.cpp b10d0341276090bfa70aaa4fd6317a560e3334ea 


Diff: https://reviews.apache.org/r/64514/diff/1/


Testing
---

make check.


Thanks,

Jiang Yan Xu

Re: Review Request 64098: Send status updates when agent re-registers.

2017-12-12 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64098/#review193557
---


Ship it!




Ship It!

- Jiang Yan Xu


On Dec. 12, 2017, 4:54 a.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64098/
> ---
> 
> (Updated Dec. 12, 2017, 4:54 a.m.)
> 
> 
> Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6406
> https://issues.apache.org/jira/browse/MESOS-6406
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Master will send task status updates to frameworks when an agent
> which has been previously removed by the master for being unreachable
> or is unknown to the master due to the garbage collection of
> the unreachable and gone agents in the registry re-registers.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp b10d0341276090bfa70aaa4fd6317a560e3334ea 
>   src/tests/master_allocator_tests.cpp 
> 10de6f0104a6a93ff5037d7e2ab5cf21d57fbec8 
>   src/tests/master_tests.cpp 7b087675200fe0ca69319b715997700a91d9f6b2 
>   src/tests/partition_tests.cpp 54ccf783682e7e6db0847b9a6313489b7f8181f8 
>   src/tests/persistent_volume_tests.cpp 
> f255382af957cfa66f5efaffcaf1082b83b35e58 
>   src/tests/slave_recovery_tests.cpp 253b0fc2ff7ec1f00937d42636151553c46d5175 
>   src/tests/upgrade_tests.cpp 0efaa586153564b20f7884023946a11635425ee4 
> 
> 
> Diff: https://reviews.apache.org/r/64098/diff/15/
> 
> 
> Testing
> ---
> 
> with make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>

Review Request 64516: Improved documentation on resource reservations.

2017-12-11 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64516/
---

Review request for mesos, Alexander Rojas, James Peach, and Michael Park.


Bugs: MESOS-8306
https://issues.apache.org/jira/browse/MESOS-8306


Repository: mesos


Description
---

Added explanation for authorizations on static reservations.


Diffs
-

  docs/reservation.md fa1f309f2b770e37964af2d5599519875195dbec 


Diff: https://reviews.apache.org/r/64516/diff/1/


Testing
---

Used a markdown viewer.


Thanks,

Jiang Yan Xu

Review Request 64514: Refactor out `authorizeReserveResources` that takes a `Resources`.

2017-12-11 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64514/
---

Review request for mesos, Alexander Rojas and James Peach.


Bugs: MESOS-8306
https://issues.apache.org/jira/browse/MESOS-8306


Repository: mesos


Description
---

This allows us to authorize static resource reservations that don't
come from `Offer::Operation::Reserve`.


Diffs
-

  src/master/master.hpp 232cc3758f240db626c4fdaf852163fa48af4dd7 
  src/master/master.cpp b10d0341276090bfa70aaa4fd6317a560e3334ea 


Diff: https://reviews.apache.org/r/64514/diff/1/


Testing
---

make check.


Thanks,

Jiang Yan Xu

Review Request 64515: Used `reserve_resources` ACL for static reservations.

2017-12-11 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64515/
---

Review request for mesos, Alexander Rojas and James Peach.


Bugs: MESOS-8306
https://issues.apache.org/jira/browse/MESOS-8306


Repository: mesos


Description
---

Used `reserve_resources` ACL for static reservations.


Diffs
-

  include/mesos/authorizer/acls.proto 40a1425ca51c5bb70f7af2e17d605f2125dcb4cb 
  src/master/master.hpp 232cc3758f240db626c4fdaf852163fa48af4dd7 
  src/master/master.cpp b10d0341276090bfa70aaa4fd6317a560e3334ea 
  src/tests/master_authorization_tests.cpp 
676543a5ad1bb5d47011fc2a8b05dfaaeef18c64 


Diff: https://reviews.apache.org/r/64515/diff/1/


Testing
---

make check.


Thanks,

Jiang Yan Xu

Re: Review Request 64098: Send status updates when agent re-registers.

2017-12-11 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64098/#review193460
---



LGTM sans Ilya's comment about only use `replicated_log` flag where necessary.


src/master/master.cpp
Line 6807 (original), 6803-6804 (patched)
<https://reviews.apache.org/r/64098/#comment272010>

Move `Framework* framework = getFramework(frameworkId);` down into the if 
condition since it's used only in there?



src/tests/partition_tests.cpp
Lines 402-415 (original), 403-407 (patched)
<https://reviews.apache.org/r/64098/#comment272016>

Expectations above `detector.appoint(master.get()->pid);` everywhere as 
suggested in the last review.



src/tests/partition_tests.cpp
Lines 716-730 (original), 710-714 (patched)
<https://reviews.apache.org/r/64098/#comment272017>

Ditto.



src/tests/partition_tests.cpp
Lines 1030-1032 (original), 1015-1025 (patched)
<https://reviews.apache.org/r/64098/#comment272018>

Ditto.



src/tests/partition_tests.cpp
Lines 2297 (patched)
<https://reviews.apache.org/r/64098/#comment272024>

`statusFromAgent` sounds like it's actually sent by the agent. Maybe 
`finsihedStatusFromMaster` for this update and rename the other `finshedStatus` 
to `finsihedStatusFromAgent` makes it a bit easier to distinguish? (I do 
understand that you meant that the status is from the agent's reregister 
message but I think  `finsihedStatusFromMaster` vs `finsihedStatusFromAgent` is 
clearer).

Also, can we add a comment too? e.g., the first status is sent as part of 
the reregistation and the second is by the agent's status update manager after 
it is reregsitered.


- Jiang Yan Xu


On Dec. 8, 2017, 10:42 a.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64098/
> ---
> 
> (Updated Dec. 8, 2017, 10:42 a.m.)
> 
> 
> Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6406
> https://issues.apache.org/jira/browse/MESOS-6406
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Master will send task status updates to frameworks when an agent
> which has been previously removed by the master for being unreachable
> or is unknown to the master due to the garbage collection of
> the unreachable and gone agents in the registry re-registers.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp dfe60ef670edcaefa0c1241df2e2870f650fcf9e 
>   src/tests/master_allocator_tests.cpp 
> 3400d70bb0ba564eac43c4639eee0efd4d8059e6 
>   src/tests/master_tests.cpp 57eae320a7a398527cd3623c89bf67f319a8e955 
>   src/tests/partition_tests.cpp 31ebfe1655438eceae74d72a223df03a9dbd282d 
>   src/tests/persistent_volume_tests.cpp 
> 4aa3c2e8b0f461cd78053707cff8bcb2e6f2b0d7 
>   src/tests/slave_recovery_tests.cpp f14c6ef69eb20a03454c8197df79b572a3c6d050 
>   src/tests/upgrade_tests.cpp 7f434dbba858f636719eec24e92b306b76430c4c 
> 
> 
> Diff: https://reviews.apache.org/r/64098/diff/12/
> 
> 
> Testing
> ---
> 
> with make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>

Re: Review Request 64429: Fixed skipping of completed frameworks in the master failover benchmark.

2017-12-07 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64429/#review193192
---


Ship it!




Ship It!

- Jiang Yan Xu


On Dec. 7, 2017, 2:33 p.m., Benjamin Mahler wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64429/
> ---
> 
> (Updated Dec. 7, 2017, 2:33 p.m.)
> 
> 
> Review request for mesos and Jiang Yan Xu.
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This benchmark was previously accidentally skipping the completed
> frameworks, due to an incorrect loop.
> 
> 
> Diffs
> -
> 
>   src/tests/master_benchmarks.cpp 6c3a8e514a452f68a0b420b98f5af0dccb01ab68 
> 
> 
> Diff: https://reviews.apache.org/r/64429/diff/1/
> 
> 
> Testing
> ---
> 
> Ran the benchmark
> 
> 
> Thanks,
> 
> Benjamin Mahler
> 
>

Re: Review Request 64098: Send status updates when agent re-registers.

2017-12-05 Thread Jiang Yan Xu

  
Setting `WillRepeatedly` here is probably racy due to 

```
EXPECT_CALL(sched, statusUpdate(, _))
.WillOnce(FutureArg<1>());
```

below, not to mention the case could be interesting to test.

We can set the expectation explicitly:
    
    The first update will come from the master, the reason being 
"agent_reregistered".

The second update comes from the agent.


- Jiang Yan Xu


On Dec. 1, 2017, 4:29 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64098/
> ---
> 
> (Updated Dec. 1, 2017, 4:29 p.m.)
> 
> 
> Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6406
> https://issues.apache.org/jira/browse/MESOS-6406
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Master will send task status updates to frameworks when an agent
> which has been previously removed by the master for being unreachable
> or is unknown to the master due to the garbage collection of
> the unreachable and gone agents in the registry re-registers.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp dfe60ef670edcaefa0c1241df2e2870f650fcf9e 
>   src/tests/master_allocator_tests.cpp 
> 3400d70bb0ba564eac43c4639eee0efd4d8059e6 
>   src/tests/master_tests.cpp 57eae320a7a398527cd3623c89bf67f319a8e955 
>   src/tests/partition_tests.cpp 31ebfe1655438eceae74d72a223df03a9dbd282d 
>   src/tests/persistent_volume_tests.cpp 
> 4aa3c2e8b0f461cd78053707cff8bcb2e6f2b0d7 
>   src/tests/slave_recovery_tests.cpp f14c6ef69eb20a03454c8197df79b572a3c6d050 
>   src/tests/upgrade_tests.cpp 7f434dbba858f636719eec24e92b306b76430c4c 
> 
> 
> Diff: https://reviews.apache.org/r/64098/diff/11/
> 
> 
> Testing
> ---
> 
> with make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>

Re: Review Request 64098: Send status updates when agent re-registers.

2017-12-04 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64098/#review192733
---



Mainly looked at the reason for the test changes. Will take another look at the 
whole patch once it's fixed.


src/tests/master_allocator_tests.cpp
Lines 1452-1454 (original), 1452-1459 (patched)
<https://reviews.apache.org/r/64098/#comment270989>

So these tests had to be changed becauase the master is not using the 
`replicated_log` registry flag so it cannot remember the agent...

I think the right change is to use `replicated_log` for the tests that 
don't make sense to change for this patch.

Note that the use of replicated log registrar breaks on Windows (see 
MESOS-5932) so we need to make sure such tests are disabled on Windows.

The way to do it is to check whether the file is in this list: 
https://github.com/apache/mesos/blob/42952093c081e494d36721bc6c56ab73069a82e4/src/tests/CMakeLists.txt#L152

If it is, then it is already not built on Windows.

If not, we can use `TEST_F_TEMP_DISABLED_ON_WINDOWS` as part of the test 
name to disable it.

Please review all the tests in this RR for this.


- Jiang Yan Xu


On Dec. 1, 2017, 4:29 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64098/
> ---
> 
> (Updated Dec. 1, 2017, 4:29 p.m.)
> 
> 
> Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6406
> https://issues.apache.org/jira/browse/MESOS-6406
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Master will send task status updates to frameworks when an agent
> which has been previously removed by the master for being unreachable
> or is unknown to the master due to the garbage collection of
> the unreachable and gone agents in the registry re-registers.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp dfe60ef670edcaefa0c1241df2e2870f650fcf9e 
>   src/tests/master_allocator_tests.cpp 
> 3400d70bb0ba564eac43c4639eee0efd4d8059e6 
>   src/tests/master_tests.cpp 57eae320a7a398527cd3623c89bf67f319a8e955 
>   src/tests/partition_tests.cpp 31ebfe1655438eceae74d72a223df03a9dbd282d 
>   src/tests/persistent_volume_tests.cpp 
> 4aa3c2e8b0f461cd78053707cff8bcb2e6f2b0d7 
>   src/tests/slave_recovery_tests.cpp f14c6ef69eb20a03454c8197df79b572a3c6d050 
>   src/tests/upgrade_tests.cpp 7f434dbba858f636719eec24e92b306b76430c4c 
> 
> 
> Diff: https://reviews.apache.org/r/64098/diff/8/
> 
> 
> Testing
> ---
> 
> with make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>

Re: Review Request 63831: Fixed a bug that removed the suppressed framework from sorter.

2017-12-01 Thread Jiang Yan Xu



> On Nov. 21, 2017, 10:27 a.m., Alexander Rukletsov wrote:
> > src/master/allocator/mesos/hierarchical.cpp
> > Lines 430-437 (original), 431-438 (patched)
> > <https://reviews.apache.org/r/63831/diff/1/?file=1894169#file1894169line431>
> >
> >     Do we need a lambda here?
> 
> Jiang Yan Xu wrote:
> I kept the original use of lambdas. 
> 
> I think the benefit of labmdas here is more about readability: compared 
> to 
> 
> ```
> set removedRoles = oldRoles;
> foreach (const string& role, newRoles) {
>   result.erase(role);
> }
> ```
> 
> The constness and construction of the variable is clearly isolated into a 
> small block in a long method with many similar variables/code blocks.
> 
> However as pointed out by the TODO, we should probably just implement a 
> set difference operator so all these become one-liners. I'll do it later.
> 
> Alexander Rukletsov wrote:
> Sounds like a wrong tool for the job and hence confusing. I would have 
> used an explicit scope then. But maybe this is just me. I'll ask MPark for 
> the reasoning here. Since it has been here before, it's fine to leave it for 
> now. Regading introducing the set operator, FYI: 
> https://reviews.apache.org/r/57110/#comment239761

Oh ok, thanks for the pointer!


- Jiang Yan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63831/#review191613
---


On Nov. 27, 2017, 4:45 p.m., Jiang Yan Xu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63831/
> ---
> 
> (Updated Nov. 27, 2017, 4:45 p.m.)
> 
> 
> Review request for mesos and Alexander Rukletsov.
> 
> 
> Bugs: MESOS-8223
> https://issues.apache.org/jira/browse/MESOS-8223
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> In `updateFramework`, we currently treat frameworks suppressing a role the 
> same way as frameworks moving off a role and this could lead to the framework 
> being removed from the framework sorter, which is unexpected. We should only 
> remove the framework from a framework sorter when it's moving away from the 
> corresponding role.
> 
> 
> Diffs
> -
> 
>   src/master/allocator/mesos/hierarchical.cpp 
> 5ce9ceaa3a5f84a1e076d45448863c418531cc2b 
>   src/tests/scheduler_tests.cpp 45fc9c0cfccdb22c2e3e8d5de30c04575814a0e9 
> 
> 
> Diff: https://reviews.apache.org/r/63831/diff/2/
> 
> 
> Testing
> ---
> 
> make check.
> 
> Added a test.
> 
> 
> Thanks,
> 
> Jiang Yan Xu
> 
>

Re: Review Request 63830: Fixed 'NoOffersWithAllRolesSuppressed' test.

2017-12-01 Thread Jiang Yan Xu



> On Nov. 21, 2017, 9:53 a.m., Alexander Rukletsov wrote:
> > src/tests/scheduler_tests.cpp
> > Lines 1515-1516 (original), 1534-1535 (patched)
> > <https://reviews.apache.org/r/63830/diff/1/?file=1892884#file1892884line1542>
> >
> > `.WillRepeatedly(Return());`
> 
> Jiang Yan Xu wrote:
> There's only one agent with no tasks launched so I don't expect 
> additional offers?
> 
> Alexander Rukletsov wrote:
> I see it in a different way. The default behaviour is ignore subsequent 
> offers and only if you need to verify that in some particular case no more 
> offers will be made, you should omit `.WillRepeatedly(Return());`. In this 
> case what you want to check is that "scheduler starts getting offers" saying 
> nothing about its quantity or periodicity.

Oh ok, but if I am intentionally testing no more offers will be made I'll 
probably need to settle the clock etc to prevent race conditions (and comment 
as such).

Are you suggesting that when writing tests we should by default always add 
`.WillRepeatedly(Return());`? I am not sure if I have seen this being adopted 
and it seems to result in necessary code.

So my approach has been that, if the test is pretty deterministic, write the 
test asserting the precise case. If there are certain unpredicable race 
condidtions (maybe not worth preventing because you don't care), then use these 
`WillRepeatedly // Ignore because don't care` calls.

Reasonable? :)


- Jiang Yan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63830/#review191602
-----------


On Nov. 21, 2017, 9:46 a.m., Jiang Yan Xu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63830/
> ---
> 
> (Updated Nov. 21, 2017, 9:46 a.m.)
> 
> 
> Review request for mesos and Alexander Rukletsov.
> 
> 
> Bugs: MESOS-7996
> https://issues.apache.org/jira/browse/MESOS-7996
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> The previous iteration of the test failed to reveal the bug in MESOS-8200 due 
> to a race condition.
> 
> 
> Diffs
> -
> 
>   src/tests/scheduler_tests.cpp 45fc9c0cfccdb22c2e3e8d5de30c04575814a0e9 
> 
> 
> Diff: https://reviews.apache.org/r/63830/diff/1/
> 
> 
> Testing
> ---
> 
> As expected, this test would reliably fail without /r/63741/ with the 
> following
> 
> ```
> I1115 07:48:28.536093 3117031 scheduler.cpp:741] Enqueuing event SUBSCRIBED 
> received from http://17.121.128.13:38423/master/api/v1/scheduler
> I1115 07:48:28.536224 3117031 scheduler.cpp:741] Enqueuing event HEARTBEAT 
> received from http://17.121.128.13:38423/master/api/v1/scheduler
> I1115 07:48:28.536244 3117029 master.cpp:8115] Sending 1 offers to framework 
> 560bd0e1-8ab7-4e7e-be7a-54cb5f399531- (default)
> I1115 07:48:28.536650 3117029 scheduler.cpp:741] Enqueuing event OFFERS 
> received from http://17.121.128.13:38423/master/api/v1/scheduler
> ../../src/tests/scheduler_tests.cpp:1497: Failure
> Mock function called more times than expected - returning directly.
> Function call: offers(0x7fffabd197a0, @0x7f7d8c005460 48-byte object 
> <50-98 0B-B0 7D-7F 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 00-00 01-00 00-00 04-00 00-00 A0-54 00-8C 7D-7F 00-00>)
>  Expected: to be never called
>Actual: called once - over-saturated and active
> ```
> 
> The test passes with /r/63741/.
> 
> 
> Thanks,
> 
> Jiang Yan Xu
> 
>

Re: Review Request 64250: Added new reasons task status update.

2017-12-01 Thread Jiang Yan Xu

> On Dec. 1, 2017, 1:55 p.m., Ilya Pronin wrote:
> > docs/task-state-reasons.md
> > Lines 474-477 (patched)
> > <https://reviews.apache.org/r/64250/diff/2/?file=1906002#file1906002line474>
> >
> > I don't quite follow this note. A modified copy of which update? Should 
> > we just say that these updates reflect the states of the tasks reported by 
> > the agent upon its re-registration?
> 
> Megha Sharma wrote:
> Here, we are saying that master takes the actual state reported by the 
> agent and enriches it with reason and message before sending to the 
> framework. But I am open to changing it if you or Yan feel it doesn't convery 
> the idea.

I was suggesting it because this status is generated by the master the same way 
they are generated upon reconciliation.

For `REASON_RECONCILIATION` in the same doc there is this note:

```
Note: Status updates with this reason are not the original ones, but rather a 
modified copy that is re-sent from the master. In particular, the original data 
and message fields are erased and the original reason field is overwritten by 
REASON_RECONCILIATION .
```

It is the same for `REASON_SLAVE_REREGISTERED`. 

Megha I just noticed that you didn't mention the erasure of `data` and 
`message` fields, but perhaps this is not worth repeating. I was originally 
suggesting refering to `REASON_RECONCILIATION` for details: something like: 
`Status updates with this reason are a modified copies re-sent by the master. 
See comments for REASON_RECONCILIATION.`

Would this be clearer?

"reflect the states of the tasks reported by the agent upon its 
re-registration" this sentence is good too.

- Jiang Yan

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64250/#review192567
---

On Dec. 1, 2017, 1:06 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64250/
> ---
> 
> (Updated Dec. 1, 2017, 1:06 p.m.)
> 
> 
> Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu.
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added new reasons `REASON_SLAVE_REREGISTERED` and
> `REASON_AGENT_REREGISTERED` for v0 and v1 task status update.
> The new reason will be used when master starts to send status
> update during the re-registration of an agent which was
> previously removed by the master because of being unreachable
> or is unknown to the master due to the garbage collection of
> the unreachable and gone agents in the registry.
> 
> 
> Diffs
> -
> 
>   docs/task-state-reasons.md 07f7927e28fe5624a6c9b74de731483aaa729ca3 
>   include/mesos/mesos.proto 11089b7d6f48f001ae64e8b0c67a2732684aa2fa 
>   include/mesos/v1/mesos.proto c496da927903f472c78d455b7fdf886d8fb692c4 
> 
> 
> Diff: https://reviews.apache.org/r/64250/diff/4/
> 
> 
> Testing
> ---
> 
> code changes verified with make check and the documents changes with markdown 
> viewer
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>

Re: Review Request 64250: Added new reasons task status update.

2017-12-01 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64250/#review192563
---


Fix it, then Ship it!





docs/task-state-reasons.md
Lines 476 (patched)
<https://reviews.apache.org/r/64250/#comment270801>

What is the `/written` part about? `overwritten` covers all cases?



docs/task-state-reasons.md
Lines 477 (patched)
<https://reviews.apache.org/r/64250/#comment270805>

Given the discussion in /r/64098/, here perhaps add another `Note: Due to 
garbage collection of the unreachable and gone agents in the registry, Mesos 
also sends such status updates for agents unknown to the master`.



docs/task-state-reasons.md
Lines 478 (patched)
<https://reviews.apache.org/r/64250/#comment270804>

Should there be a ``?

Also when it involves doc changes, could you in the testing done section 
mention that you have used a markdown viewer to review the doc change?


- Jiang Yan Xu


On Dec. 1, 2017, 1:06 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64250/
> ---
> 
> (Updated Dec. 1, 2017, 1:06 p.m.)
> 
> 
> Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu.
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added new reasons `REASON_SLAVE_REREGISTERED` and
> `REASON_AGENT_REREGISTERED` for v0 and v1 task status update.
> The new reason will be used when master starts to send status
> update during the re-registeration of an agent which was
> previosuly removed by the master because of being unreachable.
> 
> 
> Diffs
> -
> 
>   docs/task-state-reasons.md 07f7927e28fe5624a6c9b74de731483aaa729ca3 
>   include/mesos/mesos.proto 11089b7d6f48f001ae64e8b0c67a2732684aa2fa 
>   include/mesos/v1/mesos.proto c496da927903f472c78d455b7fdf886d8fb692c4 
> 
> 
> Diff: https://reviews.apache.org/r/64250/diff/2/
> 
> 
> Testing
> ---
> 
> with make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>

Re: Review Request 64098: Send status updates when agent re-registers.

2017-12-01 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64098/#review192557
---




src/master/master.cpp
Line 6808 (original), 6808 (patched)
<https://reviews.apache.org/r/64098/#comment270793>

When considering the comment by Ilya in MESOS-6406 (i.e., what if agents 
GCed from the unreachable or gone list reregster?), seems like we can do this:

1. Move down the line `slaves.recovered.erase(slaveInfo.id());` to after we 
process `recoveredTasks`.
2. Instead of checking `slaves.unreachable.contains(slaveInfo.id()` we 
could check `!slaves.recovered.contains(slaveInfo.id()`
3. Now we are sending status updates for two cases: reregistering 
unreachable agents or unknown agents (which could have been marked either 
unreachable or gone but we can't distiguish)
- We can distinguish unreachable and unknown in the task status message.
- We can probably log a warning about tasks from unknown agents.



src/master/master.cpp
Lines 6818 (patched)
<https://reviews.apache.org/r/64098/#comment270794>

s/REASON_AGENT_REREGISTERED/REASON_SLAVE_REREGISTERED/


- Jiang Yan Xu


On Nov. 27, 2017, 4:55 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64098/
> ---
> 
> (Updated Nov. 27, 2017, 4:55 p.m.)
> 
> 
> Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6406
> https://issues.apache.org/jira/browse/MESOS-6406
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Master will send task status updates to frameworks when an agent
> which has been previously removed by the master for being unreachable
> re-registers.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp dfe60ef670edcaefa0c1241df2e2870f650fcf9e 
>   src/tests/master_allocator_tests.cpp 
> 3400d70bb0ba564eac43c4639eee0efd4d8059e6 
>   src/tests/master_tests.cpp 57eae320a7a398527cd3623c89bf67f319a8e955 
>   src/tests/partition_tests.cpp 31ebfe1655438eceae74d72a223df03a9dbd282d 
>   src/tests/persistent_volume_tests.cpp 
> 4aa3c2e8b0f461cd78053707cff8bcb2e6f2b0d7 
>   src/tests/slave_recovery_tests.cpp f14c6ef69eb20a03454c8197df79b572a3c6d050 
>   src/tests/upgrade_tests.cpp 7f434dbba858f636719eec24e92b306b76430c4c 
> 
> 
> Diff: https://reviews.apache.org/r/64098/diff/5/
> 
> 
> Testing
> ---
> 
> with make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>

Re: Review Request 64250: Added new reasons task status update.

2017-12-01 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64250/#review192531
---




docs/task-state-reasons.md
Lines 370-372 (patched)
<https://reviews.apache.org/r/64250/#comment270738>

Should it be in the "Reasons for Non-Terminal Status Updates" section and 
is there a reason to refer to TASK_LOST?



docs/task-state-reasons.md
Lines 479 (patched)
<https://reviews.apache.org/r/64250/#comment270740>

"The task was part of an accepted offer": this is irrevelent to either 
TASK_UNREACHABLE or the reason REASON_SLAVE_REREGISTERED.



docs/task-state-reasons.md
Lines 480-482 (patched)
<https://reviews.apache.org/r/64250/#comment270741>

We don't need to put too much details on TASK_UNREACHABLE since it's 
described elsewhere.



docs/task-state-reasons.md
Lines 483-485 (patched)
<https://reviews.apache.org/r/64250/#comment270744>

I think we can just focus on:

- This is sent when an agent previously marked as unreachable re-registers.
- Status updates with this reason are not the original ones, see the 
comments for `REASON_RECONCILIATION`.



include/mesos/mesos.proto
Lines 2406 (patched)
<https://reviews.apache.org/r/64250/#comment270736>

I think has convention used here is to order the reasons alphabetically.



include/mesos/v1/mesos.proto
Lines 2387 (patched)
<https://reviews.apache.org/r/64250/#comment270737>

Ditto.


- Jiang Yan Xu


On Dec. 1, 2017, 8:20 a.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64250/
> ---
> 
> (Updated Dec. 1, 2017, 8:20 a.m.)
> 
> 
> Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu.
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added new reasons `REASON_SLAVE_REREGISTERED` and
> `REASON_AGENT_REREGISTERED` for v0 and v1 task status update.
> The new reason will be used when master starts to send status
> update during the re-registeration of an agent which was
> previosuly removed by the master because of being unreachable.
> 
> 
> Diffs
> -
> 
>   docs/task-state-reasons.md 07f7927e28fe5624a6c9b74de731483aaa729ca3 
>   include/mesos/mesos.proto 11089b7d6f48f001ae64e8b0c67a2732684aa2fa 
>   include/mesos/v1/mesos.proto c496da927903f472c78d455b7fdf886d8fb692c4 
> 
> 
> Diff: https://reviews.apache.org/r/64250/diff/1/
> 
> 
> Testing
> ---
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>

Re: Review Request 64098: Send status updates when agent re-registers.

2017-11-30 Thread Jiang Yan Xu



> On Nov. 28, 2017, 11:22 a.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Lines 6788 (patched)
> > <https://reviews.apache.org/r/64098/diff/3/?file=1902267#file1902267line6788>
> >
> > I think this type of status updates could benefit from a distinct 
> > reason for to make it more strongly typed.
> > 
> > Could you add
> > 
> > ```
> > v1/mesos.proto: REASON_AGENT_REREGISTERED
> > mesos.proto: REASON_SLAVE_REREGISTERED
> > ```
> > 
> > and use it?
> > 
> > Note that the tag numbers in the proto enum need to match.
> > 
> > Also note that there are also documentation related changes for these 
> > fields: 
> > https://github.com/apache/mesos/blob/master/docs/task-state-reasons.md
> > 
> >     It's fine to make it a separate patch preceding this one.
> 
> Ilya Pronin wrote:
> Won't the new reason break any frameworks?
> 
> Jiang Yan Xu wrote:
> How will it break? We add new reasons all the time. This is indeed a new 
> sitaution that Mesos master sends status updates, how do you think the 
> framework will assert the reason otherwise?
> 
> Ilya Pronin wrote:
> Sure, the new reason reflects what's going on much better. I was thinking 
> that a framework that uses the old libmesos will get a "default" status 
> update reason, which is `REASON_COMMAND_EXECUTOR_FAILED`, and may act 
> incorrectly. But it seems that Aurora will be fine with such change.
> 
> Megha Sharma wrote:
> Yeah it seems that the frameworks should be fine with the addition of 
> another reason, I guess that's how we have been adding new reasons all along. 
> The proto default behavior to interpret the unknown enum value as first one 
> defined, is of course a problem but I believe the frameworks must already 
> have some guards in place to avoid that.
> 
> Megha Sharma wrote:
> +1 to add the new reasons:
> 
> v1/mesos.proto: REASON_AGENT_REREGISTERED
> mesos.proto: REASON_SLAVE_REREGISTERED

I see. Yeah the fact that there's not a sensible default value for the reason 
field is terrible. 

At least `REASON_COMMAND_EXECUTOR_FAILED` is deprecated: 
https://github.com/apache/mesos/blob/master/docs/task-state-reasons.md

```
The reason `REASON_COMMAND_EXECUTOR_FAILED` is deprecated and will be removed
in the future. It should not be referenced by newly written code.
```

Hopefully we get rid of it soon and replace it with a `REASON_UNKONWN_REASON`.


- Jiang Yan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64098/#review192041
---


On Nov. 27, 2017, 4:55 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64098/
> ---
> 
> (Updated Nov. 27, 2017, 4:55 p.m.)
> 
> 
> Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6406
> https://issues.apache.org/jira/browse/MESOS-6406
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Master will send task status updates to frameworks when an agent
> re-registers.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f 
>   src/tests/master_allocator_tests.cpp 
> 3400d70bb0ba564eac43c4639eee0efd4d8059e6 
>   src/tests/master_tests.cpp 9c450b9f592d9e09a468f537d9b500e97acc636b 
>   src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 
>   src/tests/persistent_volume_tests.cpp 
> acfeac16884b00581a3523607ff26f44f6dca53a 
>   src/tests/slave_recovery_tests.cpp c864aa92d9ff128a89dbc25653385de25653f56a 
>   src/tests/upgrade_tests.cpp 7f434dbba858f636719eec24e92b306b76430c4c 
> 
> 
> Diff: https://reviews.apache.org/r/64098/diff/4/
> 
> 
> Testing
> ---
> 
> with make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>

Re: Review Request 61473: Do not kill non partition aware tasks.

2017-11-29 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/#review192185
---


Ship it!




Committing with some small tweaks below and for the commit description. Please 
resolve all issues from other reviewers with comments.


include/mesos/mesos.proto
Line 345 (original), 343 (patched)
<https://reviews.apache.org/r/61473/#comment270233>

Use a `NOTE: `.



include/mesos/v1/mesos.proto
Line 343 (original), 341 (patched)
<https://reviews.apache.org/r/61473/#comment270234>

Use a `NOTE: `.



src/master/http.cpp
Lines 336-338 (patched)
<https://reviews.apache.org/r/61473/#comment270235>

Put it above `if (!authorizeTask_->accept(*task, framework_->info)) {` line 
so the order is consistent with the block below.

Also add a small comment about this check: 

```
// There could be TASK_LOST tasks in this map. See comment for
// `unreachableTasks`.
```



src/master/master.cpp
Lines 9322-9323 (patched)
<https://reviews.apache.org/r/61473/#comment270231>

Tweak comment so it's less jagged.



src/master/master.cpp
Lines 9614-9616 (original), 9565-9568 (patched)
<https://reviews.apache.org/r/61473/#comment270232>

Reorder the comments and the CHECK.


- Jiang Yan Xu


On Nov. 28, 2017, 4:59 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61473/
> ---
> 
> (Updated Nov. 28, 2017, 4:59 p.m.)
> 
> 
> Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-7215
> https://issues.apache.org/jira/browse/MESOS-7215
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Master will not kill the tasks for non-Partition aware frameworks
> when an unreachable agent re-registers with the master.
> Master used to send a ShutdownFrameworkMessages to the agent
> to kill the tasks from non partition aware frameworks including
> the ones that are still registered which was problematic because
> the offer from this agent could still go to the same framework which
> could then launch new tasks. The agent would then receive tasks
> of the same framework and ignore them because it thinks the
> framework is shutting down. The framework is not shutting down of
> course, so from the master and the scheduler's perspective the task
> is pending in STAGING forever until the next agent reregistration,
> which could happen much later. This commit fixes the problem by
> not shutting down the non-partition aware frameworks on such an
> agent.
> 
> 
> Diffs
> -
> 
>   include/mesos/mesos.proto b1ebfe25301549397a48468a02882e971213d45c 
>   include/mesos/v1/mesos.proto d535eb40b205fc176730937eed4ce84ea7a369af 
>   src/master/http.cpp 9dcdcbeeea6135091db5aa21dd54bc14d84f33fc 
>   src/master/master.hpp 1c6a86fb37dee7a2ff4d564f4641a42af6206bb2 
>   src/master/master.cpp 7bcdb743659435847db6cdea917afc497e641582 
>   src/tests/partition_tests.cpp 067529acc2b3a1d7f0713c602d5f680ea19b6de8 
> 
> 
> Diff: https://reviews.apache.org/r/61473/diff/29/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>

Re: Review Request 61473: Do not kill non partition aware tasks.

2017-11-28 Thread Jiang Yan Xu



> On Nov. 27, 2017, 4:25 p.m., Ilya Pronin wrote:
> > src/master/http.cpp
> > Lines 343-345 (patched)
> > <https://reviews.apache.org/r/61473/diff/20/?file=1902099#file1902099line343>
> >
> > I may be ignorant of the discussion behind this, but since we don't 
> > treat these tasks as completed anymore, do we really need to maintain 
> > backwards compatibility here, in stats and metrics? This kind of hides the 
> > real state of things.
> 
> Megha Sharma wrote:
> Copying @xujyan to the discussion.

It's true that this isn't perfect and my first intuition was to export the real 
state. We chatted with Vinod as well. Ultimately we think it's worse to export 
information inconsistently. More than metrics there are v0 http endpoints, the 
webUI and v1 operator APIs that expose the task's state (in addition to the 
scheduler API). It would be weird for the task's state to show up differently 
via different APIs.

Secondly, the mechanism implemented by Mesos to handle the backwards 
compatibility w.r.t TASK_LOST vs. new states is to decide which to use when the 
task status is **created**, unlike how we handle old/new reservation formats 
and framework API devolve/evolve, for example. This means we lose the original 
state in upon creation and have no way to export it later. We had to use a 
`bool unreachable` to *remember* the original state for unreachable tasks but 
this cannot be generalized to all "new states" and I don't think it's worth 
treating unreachable tasks specically.

Does it make sense?


- Jiang Yan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/#review191674
---


On Nov. 28, 2017, 9:28 a.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61473/
> ---
> 
> (Updated Nov. 28, 2017, 9:28 a.m.)
> 
> 
> Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-7215
> https://issues.apache.org/jira/browse/MESOS-7215
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Master will not kill the tasks for non-Partition aware frameworks
> when an unreachable agent re-registers with the master.
> Master used to send a ShutdownFrameworkMessages to the agent
> to kill the tasks from non partition aware frameworks including
> the ones that are still registered which was problematic because
> the offer from this agent could still go to the same framework which
> could then launch new tasks. The agent would then receive tasks
> of the same framework and ignore them because it thinks the
> framework is shutting down. The framework is not shutting down of
> course, so from the master and the scheduler's perspective the task
> is pending in STAGING forever until the next agent reregistration,
> which could happen much later. This commit fixes the problem by
> not shutting down the non-partition aware frameworks on such an
> agent.
> 
> 
> Diffs
> -
> 
>   include/mesos/mesos.proto e194093e490741acc552fd3ad328fd710b4b4435 
>   include/mesos/v1/mesos.proto 6fb1139683952877667abbcf8bf84b5b31bcd29e 
>   src/master/http.cpp 10084125deb839a9846a4f64d2e433ff02754c02 
>   src/master/master.hpp a309fc78ee2613762f3d5d22ac7559afc7aac4a3 
>   src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f 
>   src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 
> 
> 
> Diff: https://reviews.apache.org/r/61473/diff/25/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>

Re: Review Request 64098: Send status updates when agent re-registers.

2017-11-28 Thread Jiang Yan Xu



> On Nov. 28, 2017, 11:22 a.m., Jiang Yan Xu wrote:
> > src/master/master.cpp
> > Lines 6788 (patched)
> > <https://reviews.apache.org/r/64098/diff/3/?file=1902267#file1902267line6788>
> >
> > I think this type of status updates could benefit from a distinct 
> > reason for to make it more strongly typed.
> > 
> > Could you add
> > 
> > ```
> > v1/mesos.proto: REASON_AGENT_REREGISTERED
> > mesos.proto: REASON_SLAVE_REREGISTERED
> > ```
> > 
> > and use it?
> > 
> > Note that the tag numbers in the proto enum need to match.
> > 
> > Also note that there are also documentation related changes for these 
> > fields: 
> > https://github.com/apache/mesos/blob/master/docs/task-state-reasons.md
> > 
> > It's fine to make it a separate patch preceding this one.
> 
> Ilya Pronin wrote:
> Won't the new reason break any frameworks?

How will it break? We add new reasons all the time. This is indeed a new 
sitaution that Mesos master sends status updates, how do you think the 
framework will assert the reason otherwise?


- Jiang Yan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64098/#review192041
---


On Nov. 27, 2017, 4:55 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64098/
> -------
> 
> (Updated Nov. 27, 2017, 4:55 p.m.)
> 
> 
> Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6406
> https://issues.apache.org/jira/browse/MESOS-6406
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Master will send task status updates to frameworks when an agent
> re-registers.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f 
>   src/tests/master_allocator_tests.cpp 
> 3400d70bb0ba564eac43c4639eee0efd4d8059e6 
>   src/tests/master_tests.cpp 9c450b9f592d9e09a468f537d9b500e97acc636b 
>   src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 
>   src/tests/persistent_volume_tests.cpp 
> acfeac16884b00581a3523607ff26f44f6dca53a 
>   src/tests/slave_recovery_tests.cpp c864aa92d9ff128a89dbc25653385de25653f56a 
>   src/tests/upgrade_tests.cpp 7f434dbba858f636719eec24e92b306b76430c4c 
> 
> 
> Diff: https://reviews.apache.org/r/64098/diff/4/
> 
> 
> Testing
> ---
> 
> with make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>

Re: Review Request 64098: Send status updates when agent re-registers.

2017-11-28 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/64098/#review192041
---



Reviewed the non-testing portion. Will continue to review the tests.


src/master/master.cpp
Lines 6783 (patched)
<https://reviews.apache.org/r/64098/#comment270036>

4 space indentation for function arguments.



src/master/master.cpp
Lines 6788 (patched)
<https://reviews.apache.org/r/64098/#comment270046>

I think this type of status updates could benefit from a distinct reason 
for to make it more strongly typed.

Could you add

```
v1/mesos.proto: REASON_AGENT_REREGISTERED
mesos.proto: REASON_SLAVE_REREGISTERED
```

and use it?

Note that the tag numbers in the proto enum need to match.

Also note that there are also documentation related changes for these 
fields: https://github.com/apache/mesos/blob/master/docs/task-state-reasons.md

It's fine to make it a separate patch preceding this one.



src/master/master.cpp
Lines 6789 (patched)
<https://reviews.apache.org/r/64098/#comment270044>

s/Agent/agent/



src/master/master.cpp
Lines 6791-6792 (patched)
<https://reviews.apache.org/r/64098/#comment270043>

I know that the styling of the ternary operator is terribly inconsistent 
but could you do this which I think it more widely used?

```
(task.has_executor_id()
? Option(task.executor_id()) : None()),
```

There will be a day when we just use clang tidy to fix everything. :)



src/master/master.cpp
Lines 6793-6797 (patched)
<https://reviews.apache.org/r/64098/#comment270041>

Should we actually assign these values?

```
  protobuf::getTaskHealth(*task),
  protobuf::getTaskCheckStatus(*task),
  None(),
  protobuf::getTaskContainerStatus(*task)
```

Note that the `None`s are default values so you don't have to specify 
trailing Nones.


- Jiang Yan Xu


On Nov. 27, 2017, 4:55 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/64098/
> ---
> 
> (Updated Nov. 27, 2017, 4:55 p.m.)
> 
> 
> Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-6406
> https://issues.apache.org/jira/browse/MESOS-6406
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Master will send task status updates to frameworks when an agent
> re-registers.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f 
>   src/tests/master_allocator_tests.cpp 
> 3400d70bb0ba564eac43c4639eee0efd4d8059e6 
>   src/tests/master_tests.cpp 9c450b9f592d9e09a468f537d9b500e97acc636b 
>   src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 
>   src/tests/persistent_volume_tests.cpp 
> acfeac16884b00581a3523607ff26f44f6dca53a 
>   src/tests/slave_recovery_tests.cpp c864aa92d9ff128a89dbc25653385de25653f56a 
>   src/tests/upgrade_tests.cpp 7f434dbba858f636719eec24e92b306b76430c4c 
> 
> 
> Diff: https://reviews.apache.org/r/64098/diff/3/
> 
> 
> Testing
> ---
> 
> with make check
> 
> 
> Thanks,
> 
> Megha Sharma
> 
>

Re: Review Request 61473: Do not kill non partition aware tasks.

2017-11-28 Thread Jiang Yan Xu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/61473/#review191976
---

Almost there!

include/mesos/mesos.proto
Lines 344-346 (original), 342 (patched)
<https://reviews.apache.org/r/61473/#comment269950>

Given the potential confusion about the chagned behavior, at the API level 
I think there's value in explaining the history a little bit:

```
Prior to Mesos 1.5, such tasks will be killed by Mesos when the agent 
reregisters (unless the master has failed over). However due to the lack of 
benefit in maintaining different behaviors depending on whether the master has 
failed over (see MESOS-7215), as of 1.5, Mesos will not kill these tasks in 
either case.
```

How does it soud?

include/mesos/v1/mesos.proto
Lines 342-344 (original), 340 (patched)
<https://reviews.apache.org/r/61473/#comment269957>

Ditto.

src/master/http.cpp
Line 322 (original), 323 (patched)
<https://reviews.apache.org/r/61473/#comment269983>

Do we need a 

```
if (task.get()->state() != TASK_UNREACHABLE) {
  continue;
}
```

here?

src/master/master.hpp
Line 873 (original), 873 (patched)
<https://reviews.apache.org/r/61473/#comment269994>

We need the comment about `unreachable` here as well?

src/master/master.hpp
Line 2644 (original), 2645 (patched)
<https://reviews.apache.org/r/61473/#comment269995>

Are there two cases here or `unreachable` is the only variable that matters?

We can assert the following invariant inside an `if (unreachable) {}`:

```
CHECK(task->state() == TASK_UNREACHABLE || task->state() == TASK_LOST) << 
task->state();
```

src/master/master.hpp
Line 3160 (original), 3157 (patched)
<https://reviews.apache.org/r/61473/#comment269996>

s/all tasks/tasks runing/

src/master/master.cpp
Lines 6804-6807 (original), 6779-6782 (patched)
<https://reviews.apache.org/r/61473/#comment269997>

Fix indentation.

src/master/master.cpp
Line 6849 (original), 6820 (patched)
<https://reviews.apache.org/r/61473/#comment269998>

s/Determine which frameworks on the slave to shutdown, if any.//

Previously when there are two cases, this sentence is a summary. Now that 
there is only one case, this sentence feels redudant.

src/master/master.cpp
Line 9370 (original), 9322 (patched)
<https://reviews.apache.org/r/61473/#comment27>

I have the similar feeling as Ilya, without context it's hard to understand 
this line. Although the context is not too far above, why not just define it 
within the context?

```
TaskState newTaskState = TASK_UNREACHABLE;
TaskStatus::Reason newTaskReason = TaskStatus::REASON_SLAVE_REMOVED;

// Needed to convey task unreachablility because we lose this 
information
// from the task state if `TASK_LOST` is used.
bool unreachable = true;

if (!framework->capabilities.partitionAware) {
  newTaskState = TASK_LOST;
} else if (unreachableTime.isNone()) {
  unreachable = false;
  newTaskState = TASK_GONE_BY_OPERATOR;
  newTaskReason = TaskStatus::REASON_SLAVE_REMOVED_BY_OPERATOR;
}
```

Note that I simplied the nested if else structure. Does it look right?

src/master/master.cpp
Line 9582 (original), 9535 (patched)
<https://reviews.apache.org/r/61473/#comment270001>

Assert this invariant inside this `if`?

```
CHECK(!unreachable) << task->task_id();
```

src/master/master.cpp
Line 10318 (original), 10271 (patched)
<https://reviews.apache.org/r/61473/#comment270002>

Use `foreachvalue`. `values()` call will copy out the values.

- Jiang Yan Xu

On Nov. 27, 2017, 4:56 p.m., Megha Sharma wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/61473/
> ---
> 
> (Updated Nov. 27, 2017, 4:56 p.m.)
> 
> 
> Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-7215
> https://issues.apache.org/jira/browse/MESOS-7215
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Master will not kill the tasks for non-Partition aware frameworks
> when an unreachable agent re-registers with the master.
> Master used to send a ShutdownFrameworkMessages to the agent
> to kill the tasks from non partition aware frameworks including
> the ones that are still registered which was problematic because
> the offer from this agent could still go to the same framework which
> could then launch new t

Re: Review Request 63831: Fixed a bug that removed the suppressed framework from sorter.

2017-11-27 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63831/
---

(Updated Nov. 27, 2017, 4:45 p.m.)


Review request for mesos and Alexander Rukletsov.


Changes
---

Addressed comments.


Bugs: MESOS-8223
https://issues.apache.org/jira/browse/MESOS-8223


Repository: mesos


Description
---

In `updateFramework`, we currently treat frameworks suppressing a role the same 
way as frameworks moving off a role and this could lead to the framework being 
removed from the framework sorter, which is unexpected. We should only remove 
the framework from a framework sorter when it's moving away from the 
corresponding role.


Diffs (updated)
-

  src/master/allocator/mesos/hierarchical.cpp 
5ce9ceaa3a5f84a1e076d45448863c418531cc2b 
  src/tests/scheduler_tests.cpp 45fc9c0cfccdb22c2e3e8d5de30c04575814a0e9 


Diff: https://reviews.apache.org/r/63831/diff/2/

Changes: https://reviews.apache.org/r/63831/diff/1-2/


Testing
---

make check.

Added a test.


Thanks,

Jiang Yan Xu

Re: Review Request 63831: Fixed a bug that removed the suppressed framework from sorter.

2017-11-27 Thread Jiang Yan Xu



> On Nov. 21, 2017, 10:27 a.m., Alexander Rukletsov wrote:
> > src/master/allocator/mesos/hierarchical.cpp
> > Lines 430-437 (original), 431-438 (patched)
> > <https://reviews.apache.org/r/63831/diff/1/?file=1894169#file1894169line431>
> >
> > Do we need a lambda here?

I kept the original use of lambdas. 

I think the benefit of labmdas here is more about readability: compared to 

```
set removedRoles = oldRoles;
foreach (const string& role, newRoles) {
  result.erase(role);
}
```

The constness and construction of the variable is clearly isolated into a small 
block in a long method with many similar variables/code blocks.

However as pointed out by the TODO, we should probably just implement a set 
difference operator so all these become one-liners. I'll do it later.


> On Nov. 21, 2017, 10:27 a.m., Alexander Rukletsov wrote:
> > src/tests/scheduler_tests.cpp
> > Lines 1554-1556 (patched)
> > <https://reviews.apache.org/r/63831/diff/1/?file=1894170#file1894170line1554>
> >
> > How about something like this:
> > // This test verifies that if a framework decides to suppress its roles 
> > on reregisteration, no offers will be made.

The main difference between `NoOffersOnReregistrationWithAllRolesSuppressed` 
and `NoOffersWithAllRolesSuppressed` is the initial condition "a framework 
initially with no roles suppressing offers". Although your suggestion implies 
it, I was emphasizing it. How about:

```
This test verifies that if a framework (initially with no roles suppressed) 
decides to suppress offers for its roles on reregisteration, no offers will be 
made.
```

It's probably simpler to read than my original version.


- Jiang Yan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63831/#review191613
---


On Nov. 15, 2017, 11:05 p.m., Jiang Yan Xu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63831/
> ---
> 
> (Updated Nov. 15, 2017, 11:05 p.m.)
> 
> 
> Review request for mesos and Alexander Rukletsov.
> 
> 
> Bugs: MESOS-8223
> https://issues.apache.org/jira/browse/MESOS-8223
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> In `updateFramework`, we currently treat frameworks suppressing a role the 
> same way as frameworks moving off a role and this could lead to the framework 
> being removed from the framework sorter, which is unexpected. We should only 
> remove the framework from a framework sorter when it's moving away from the 
> corresponding role.
> 
> 
> Diffs
> -
> 
>   src/master/allocator/mesos/hierarchical.cpp 
> f0fa7754e5968288edb15929efc9d244b177 
>   src/tests/scheduler_tests.cpp 45fc9c0cfccdb22c2e3e8d5de30c04575814a0e9 
> 
> 
> Diff: https://reviews.apache.org/r/63831/diff/1/
> 
> 
> Testing
> ---
> 
> make check.
> 
> Added a test.
> 
> 
> Thanks,
> 
> Jiang Yan Xu
> 
>

Re: Review Request 63741: Fixed a bug in devolving framework subscription with suppressed roles.

2017-11-27 Thread Jiang Yan Xu



> On Nov. 21, 2017, 10:04 a.m., Alexander Rukletsov wrote:
> > src/internal/devolve.cpp
> > Lines 194-196 (original), 194-196 (patched)
> > <https://reviews.apache.org/r/63741/diff/1/?file=1892879#file1892879line194>
> >
> >     What about `evolve()`?
> 
> Jiang Yan Xu wrote:
> I suppose I need to handle it there too since `v1::scheduler::Call 
> evolve(const scheduler::Call& call)` is defined there but I have no idea 
> where this is ever used/useful?

I added the custom converstion to `v1::scheduler::Call evolve(const 
scheduler::Call& call)` with a TODO.


> On Nov. 21, 2017, 10:04 a.m., Alexander Rukletsov wrote:
> > src/internal/devolve.cpp
> > Lines 199-200 (patched)
> > <https://reviews.apache.org/r/63741/diff/1/?file=1892879#file1892879line199>
> >
> > Maybe a simple `if(){}` for now? Maybe the next special case will not 
> > be type dependent?

SGTM.


> On Nov. 21, 2017, 10:04 a.m., Alexander Rukletsov wrote:
> > src/internal/devolve.cpp
> > Lines 203-204 (patched)
> > <https://reviews.apache.org/r/63741/diff/1/?file=1892879#file1892879line203>
> >
> > I believe we prefer writing `CopyFrom()` explicitly, no?

I don't think it's estalished as a rule even though we do use CopyFrom most 
often previously. Now that we have upgraded to protobuf 3.5 with move 
semantics, I think we can probably standarize on prefering the assignment 
operator as the choice between copy and move can be determined simply by the 
rvalue-ness of the operand unless copying is explicitly intended. I understand 
that this particular case wouldn't be moved but styling wise it's more 
futre-proof. We should further this discussion. For now I think this is OK as 
this style is already used in the codebase.


- Jiang Yan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63741/#review191605
---


On Nov. 15, 2017, 10:41 a.m., Jiang Yan Xu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63741/
> ---
> 
> (Updated Nov. 15, 2017, 10:41 a.m.)
> 
> 
> Review request for mesos and Alexander Rukletsov.
> 
> 
> Bugs: MESOS-8200
> https://issues.apache.org/jira/browse/MESOS-8200
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Also improved logging to show the suppressed roles.
> 
> 
> Diffs
> -
> 
>   src/internal/devolve.cpp 3a02490d734ef4a558b448871f9a1054d832f457 
>   src/master/master.cpp 81192d5ce7c095495a5faa534d4792601e540e0e 
> 
> 
> Diff: https://reviews.apache.org/r/63741/diff/1/
> 
> 
> Testing
> ---
> 
> Tested together with /r/63830/.
> 
> 
> Thanks,
> 
> Jiang Yan Xu
> 
>

Re: Review Request 63741: Fixed a bug in devolving framework subscription with suppressed roles.

2017-11-27 Thread Jiang Yan Xu



> On Nov. 21, 2017, 10:04 a.m., Alexander Rukletsov wrote:
> > src/internal/devolve.cpp
> > Lines 194-196 (original), 194-196 (patched)
> > <https://reviews.apache.org/r/63741/diff/1/?file=1892879#file1892879line194>
> >
> > What about `evolve()`?

I suppose I need to handle it there too since `v1::scheduler::Call evolve(const 
scheduler::Call& call)` is defined there but I have no idea where this is ever 
used/useful?


- Jiang Yan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63741/#review191605
-------


On Nov. 15, 2017, 10:41 a.m., Jiang Yan Xu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63741/
> ---
> 
> (Updated Nov. 15, 2017, 10:41 a.m.)
> 
> 
> Review request for mesos and Alexander Rukletsov.
> 
> 
> Bugs: MESOS-8200
> https://issues.apache.org/jira/browse/MESOS-8200
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Also improved logging to show the suppressed roles.
> 
> 
> Diffs
> -
> 
>   src/internal/devolve.cpp 3a02490d734ef4a558b448871f9a1054d832f457 
>   src/master/master.cpp 81192d5ce7c095495a5faa534d4792601e540e0e 
> 
> 
> Diff: https://reviews.apache.org/r/63741/diff/1/
> 
> 
> Testing
> ---
> 
> Tested together with /r/63830/.
> 
> 
> Thanks,
> 
> Jiang Yan Xu
> 
>

Re: Review Request 63830: Fixed 'NoOffersWithAllRolesSuppressed' test.

2017-11-27 Thread Jiang Yan Xu



> On Nov. 21, 2017, 9:53 a.m., Alexander Rukletsov wrote:
> > src/tests/scheduler_tests.cpp
> > Lines 1515-1516 (original), 1534-1535 (patched)
> > <https://reviews.apache.org/r/63830/diff/1/?file=1892884#file1892884line1542>
> >
> > `.WillRepeatedly(Return());`

There's only one agent with no tasks launched so I don't expect additional 
offers?


- Jiang Yan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63830/#review191602
---


On Nov. 21, 2017, 9:46 a.m., Jiang Yan Xu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63830/
> ---
> 
> (Updated Nov. 21, 2017, 9:46 a.m.)
> 
> 
> Review request for mesos and Alexander Rukletsov.
> 
> 
> Bugs: MESOS-7996
> https://issues.apache.org/jira/browse/MESOS-7996
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> The previous iteration of the test failed to reveal the bug in MESOS-8200 due 
> to a race condition.
> 
> 
> Diffs
> -
> 
>   src/tests/scheduler_tests.cpp 45fc9c0cfccdb22c2e3e8d5de30c04575814a0e9 
> 
> 
> Diff: https://reviews.apache.org/r/63830/diff/1/
> 
> 
> Testing
> ---
> 
> As expected, this test would reliably fail without /r/63741/ with the 
> following
> 
> ```
> I1115 07:48:28.536093 3117031 scheduler.cpp:741] Enqueuing event SUBSCRIBED 
> received from http://17.121.128.13:38423/master/api/v1/scheduler
> I1115 07:48:28.536224 3117031 scheduler.cpp:741] Enqueuing event HEARTBEAT 
> received from http://17.121.128.13:38423/master/api/v1/scheduler
> I1115 07:48:28.536244 3117029 master.cpp:8115] Sending 1 offers to framework 
> 560bd0e1-8ab7-4e7e-be7a-54cb5f399531- (default)
> I1115 07:48:28.536650 3117029 scheduler.cpp:741] Enqueuing event OFFERS 
> received from http://17.121.128.13:38423/master/api/v1/scheduler
> ../../src/tests/scheduler_tests.cpp:1497: Failure
> Mock function called more times than expected - returning directly.
> Function call: offers(0x7fffabd197a0, @0x7f7d8c005460 48-byte object 
> <50-98 0B-B0 7D-7F 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
> 00-00 00-00 00-00 00-00 01-00 00-00 04-00 00-00 A0-54 00-8C 7D-7F 00-00>)
>  Expected: to be never called
>Actual: called once - over-saturated and active
> ```
> 
> The test passes with /r/63741/.
> 
> 
> Thanks,
> 
> Jiang Yan Xu
> 
>

Re: Review Request 63830: Fixed 'NoOffersWithAllRolesSuppressed' test.

2017-11-21 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63830/
---

(Updated Nov. 21, 2017, 5:46 p.m.)


Review request for mesos and Alexander Rukletsov.


Bugs: MESOS-7996
https://issues.apache.org/jira/browse/MESOS-7996


Repository: mesos


Description
---

The previous iteration of the test failed to reveal the bug in MESOS-8200 due 
to a race condition.


Diffs
-

  src/tests/scheduler_tests.cpp 45fc9c0cfccdb22c2e3e8d5de30c04575814a0e9 


Diff: https://reviews.apache.org/r/63830/diff/1/


Testing
---

As expected, this test would reliably fail without /r/63741/ with the following

```
I1115 07:48:28.536093 3117031 scheduler.cpp:741] Enqueuing event SUBSCRIBED 
received from http://17.121.128.13:38423/master/api/v1/scheduler
I1115 07:48:28.536224 3117031 scheduler.cpp:741] Enqueuing event HEARTBEAT 
received from http://17.121.128.13:38423/master/api/v1/scheduler
I1115 07:48:28.536244 3117029 master.cpp:8115] Sending 1 offers to framework 
560bd0e1-8ab7-4e7e-be7a-54cb5f399531- (default)
I1115 07:48:28.536650 3117029 scheduler.cpp:741] Enqueuing event OFFERS 
received from http://17.121.128.13:38423/master/api/v1/scheduler
../../src/tests/scheduler_tests.cpp:1497: Failure
Mock function called more times than expected - returning directly.
Function call: offers(0x7fffabd197a0, @0x7f7d8c005460 48-byte object <50-98 
0B-B0 7D-7F 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
00-00 00-00 01-00 00-00 04-00 00-00 A0-54 00-8C 7D-7F 00-00>)
 Expected: to be never called
   Actual: called once - over-saturated and active
```

The test passes with /r/63741/.


Thanks,

Jiang Yan Xu

Re: Review Request 63175: Do not generate UnavailableResources for inactive frameworks.

2017-11-17 Thread Jiang Yan Xu



> On Nov. 7, 2017, 5:09 p.m., Benjamin Mahler wrote:
> > src/master/allocator/mesos/hierarchical.hpp
> > Lines 319 (patched)
> > <https://reviews.apache.org/r/63175/diff/1/?file=1864532#file1864532line319>
> >
> > I think this should be `connected`?
> > 
> > Right now there's some confusion around `active` and `connected`. 
> > `active` in the past meant that the framework was receiving offers, but we 
> > never added the ability for frameworks to change this state. When we added 
> > suppression, that was equivalent to deactivation (note that activation has 
> > become a per-role thing).
> > 
> > Mostly `connected` == `active`. But I think the point of this boolean 
> > is to track whether we can talk to the framework? Thoughts?
> 
> Jiang Yan Xu wrote:
> `active` is already used in the class and this review doesn't change its 
> meaning so I think we can defer this for later?
> 
> For activeness vs. connectedness, I feel it's a larger effort to address 
> it consistently. I'll follow up on this.

And we'd better keep it consistent between master and allocator, 
[agents](https://github.com/apache/mesos/blob/94ef10c2d4572cbe8c70b40fe3c59524186d6236/src/master/master.hpp#L196-L202)
 and frameworks, method names 
([e.g.](https://github.com/apache/mesos/blob/94ef10c2d4572cbe8c70b40fe3c59524186d6236/src/master/allocator/mesos/hierarchical.hpp#L123))
 and variables Right?


- Jiang Yan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63175/#review190412
---


On Nov. 7, 2017, 5:02 p.m., Jiang Yan Xu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63175/
> ---
> 
> (Updated Nov. 7, 2017, 5:02 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, James Peach, and Joseph Wu.
> 
> 
> Bugs: MESOS-8085
> https://issues.apache.org/jira/browse/MESOS-8085
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Do not generate UnavailableResources for inactive frameworks.
> 
> 
> Diffs
> -
> 
>   src/master/allocator/mesos/hierarchical.hpp 
> 2f7755afd056455ce6bddfbe2619a359f41d8e6a 
>   src/master/allocator/mesos/hierarchical.cpp 
> f0fa7754e5968288edb15929efc9d244b177 
> 
> 
> Diff: https://reviews.apache.org/r/63175/diff/2/
> 
> 
> Testing
> ---
> 
> make check.
> 
> I didn' write a new test as the externally observable behavior doesn't change.
> 
> 
> Thanks,
> 
> Jiang Yan Xu
> 
>

Re: Review Request 63175: Do not generate UnavailableResources for inactive frameworks.

2017-11-17 Thread Jiang Yan Xu



> On Nov. 7, 2017, 5:09 p.m., Benjamin Mahler wrote:
> > src/master/allocator/mesos/hierarchical.hpp
> > Lines 319 (patched)
> > <https://reviews.apache.org/r/63175/diff/1/?file=1864532#file1864532line319>
> >
> > I think this should be `connected`?
> > 
> > Right now there's some confusion around `active` and `connected`. 
> > `active` in the past meant that the framework was receiving offers, but we 
> > never added the ability for frameworks to change this state. When we added 
> > suppression, that was equivalent to deactivation (note that activation has 
> > become a per-role thing).
> > 
> > Mostly `connected` == `active`. But I think the point of this boolean 
> > is to track whether we can talk to the framework? Thoughts?

`active` is already used in the class and this review doesn't change its 
meaning so I think we can defer this for later?

For activeness vs. connectedness, I feel it's a larger effort to address it 
consistently. I'll follow up on this.


- Jiang Yan


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63175/#review190412
-----------


On Nov. 7, 2017, 5:02 p.m., Jiang Yan Xu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63175/
> ---
> 
> (Updated Nov. 7, 2017, 5:02 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, James Peach, and Joseph Wu.
> 
> 
> Bugs: MESOS-8085
> https://issues.apache.org/jira/browse/MESOS-8085
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Do not generate UnavailableResources for inactive frameworks.
> 
> 
> Diffs
> -
> 
>   src/master/allocator/mesos/hierarchical.hpp 
> c2346054b2c98516f15ab8ce2dc798224ff4def4 
>   src/master/allocator/mesos/hierarchical.cpp 
> 5b6efe5faa3c3b10f1f714f582a155b368f8ccaf 
> 
> 
> Diff: https://reviews.apache.org/r/63175/diff/1/
> 
> 
> Testing
> ---
> 
> make check.
> 
> I didn' write a new test as the externally observable behavior doesn't change.
> 
> 
> Thanks,
> 
> Jiang Yan Xu
> 
>

Review Request 63900: Downgraded the logging level of socket shutdown failures.

2017-11-16 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63900/
---

Review request for mesos, Joseph Wu and Zhitao Li.


Repository: mesos


Description
---

It often merely shows that a connection is closed before libprocess
shuts it down so it should be logged at INFO level.


Diffs
-

  3rdparty/libprocess/src/process.cpp 732aff86bb6f4c56d4eec7107b2cc2aff139c06a 


Diff: https://reviews.apache.org/r/63900/diff/1/


Testing
---

make check.


Thanks,

Jiang Yan Xu

Review Request 63895: Fixed flaky test UnreachableAgentReregisterAfterFailover test.

2017-11-16 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63895/
---

Review request for mesos, Alexander Rukletsov and James Peach.


Bugs: MESOS-8178
https://issues.apache.org/jira/browse/MESOS-8178


Repository: mesos


Description
---

Other tests in this file could probably use the same treatment which I can do 
later.


Diffs
-

  src/tests/slave_tests.cpp a75bb260df223b5b86f31e91eec1b6ba8db00cb2 


Diff: https://reviews.apache.org/r/63895/diff/1/


Testing
---

make check.

I couldn't repro the failure locally but I think the fix makes sense logically 
so it probably is the fix. ;)


Thanks,

Jiang Yan Xu

Review Request 63831: Fixed a bug that removed the suppressed framework from sorter.

2017-11-15 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63831/
---

Review request for mesos and Alexander Rukletsov.


Bugs: MESOS-8223
https://issues.apache.org/jira/browse/MESOS-8223


Repository: mesos


Description
---

In `updateFramework`, we currently treat frameworks suppressing a role the same 
way as frameworks moving off a role and this could lead to the framework being 
removed from the framework sorter, which is unexpected. We should only remove 
the framework from a framework sorter when it's moving away from the 
corresponding role.


Diffs
-

  src/master/allocator/mesos/hierarchical.cpp 
f0fa7754e5968288edb15929efc9d244b177 
  src/tests/scheduler_tests.cpp 45fc9c0cfccdb22c2e3e8d5de30c04575814a0e9 


Diff: https://reviews.apache.org/r/63831/diff/1/


Testing
---

make check.

Added a test.


Thanks,

Jiang Yan Xu

Review Request 63830: Fixed 'NoOffersWithAllRolesSuppressed' test.

2017-11-15 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63830/
---

Review request for mesos and Alexander Rukletsov.


Repository: mesos


Description
---

The previous iteration of the test failed to reveal the bug in MESOS-8200 due 
to a race condition.


Diffs
-

  src/tests/scheduler_tests.cpp 45fc9c0cfccdb22c2e3e8d5de30c04575814a0e9 


Diff: https://reviews.apache.org/r/63830/diff/1/


Testing
---

As expected, this test would reliably fail without /r/63741/ with the following

```
I1115 07:48:28.536093 3117031 scheduler.cpp:741] Enqueuing event SUBSCRIBED 
received from http://17.121.128.13:38423/master/api/v1/scheduler
I1115 07:48:28.536224 3117031 scheduler.cpp:741] Enqueuing event HEARTBEAT 
received from http://17.121.128.13:38423/master/api/v1/scheduler
I1115 07:48:28.536244 3117029 master.cpp:8115] Sending 1 offers to framework 
560bd0e1-8ab7-4e7e-be7a-54cb5f399531- (default)
I1115 07:48:28.536650 3117029 scheduler.cpp:741] Enqueuing event OFFERS 
received from http://17.121.128.13:38423/master/api/v1/scheduler
../../src/tests/scheduler_tests.cpp:1497: Failure
Mock function called more times than expected - returning directly.
Function call: offers(0x7fffabd197a0, @0x7f7d8c005460 48-byte object <50-98 
0B-B0 7D-7F 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 
00-00 00-00 01-00 00-00 04-00 00-00 A0-54 00-8C 7D-7F 00-00>)
 Expected: to be never called
   Actual: called once - over-saturated and active
```

The test passes with /r/63741/.


Thanks,

Jiang Yan Xu

Review Request 63741: Fixed framework subscription with suppressed roles.

2017-11-15 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63741/
---

Review request for mesos and Alexander Rukletsov.


Repository: mesos


Description
---

Also improved logging to show the suppressed roles.


Diffs
-

  src/internal/devolve.cpp 3a02490d734ef4a558b448871f9a1054d832f457 
  src/master/master.cpp 81192d5ce7c095495a5faa534d4792601e540e0e 


Diff: https://reviews.apache.org/r/63741/diff/1/


Testing
---

Tested together with /r/63830/.


Thanks,

Jiang Yan Xu

Re: Review Request 63642: Added a test for ExecutorID validation in ReregisterSlaveMessage.

2017-11-07 Thread Jiang Yan Xu


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/63642/#review190371
---




src/tests/master_validation_tests.cpp
Lines 4215 (patched)
<https://reviews.apache.org/r/63642/#comment267704>

2 space indentation.



src/tests/master_validation_tests.cpp
Lines 4225-4226 (patched)
<https://reviews.apache.org/r/63642/#comment267705>

Nit: I feel the grouping looks clearer to have:

```
frameworks.push_back(DEFAULT_FRAMEWORK_INFO);
frameworks[0].set_name("framework1");
frameworks[0].mutable_id()->set_value("framework1");

frameworks.push_back(DEFAULT_FRAMEWORK_INFO);
frameworks[1].set_name("framework2");
frameworks[1].mutable_id()->set_value("framework2");
```

BTW you could use `back()` as well.

Same for the executors. 

Up to you.



src/tests/master_validation_tests.cpp
Lines 4241-4242 (patched)
<https://reviews.apache.org/r/63642/#comment267726>

These are `ASSERT_*`s? Same below.



src/tests/master_validation_tests.cpp
Lines 4243 (patched)
<https://reviews.apache.org/r/63642/#comment267727>

These are `EXPECT_*`? Same below.


- Jiang Yan Xu


On Nov. 7, 2017, 10:47 a.m., James Peach wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/63642/
> ---
> 
> (Updated Nov. 7, 2017, 10:47 a.m.)
> 
> 
> Review request for mesos, James DeFelice and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-8169
> https://issues.apache.org/jira/browse/MESOS-8169
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Added a test to ensure that the ReregisterSlaveMessage validation
> correctly allows duplicate ExecutorIDs as long as they are scoped
> to different frameworks.
> 
> 
> Diffs
> -
> 
>   src/tests/master_validation_tests.cpp 
> be8c8920fdd6eb72fc77247844a731070db5fa1c 
> 
> 
> Diff: https://reviews.apache.org/r/63642/diff/3/
> 
> 
> Testing
> ---
> 
> make check (Fedora 26) with PR #248 applied.
> 
> 
> Thanks,
> 
> James Peach
> 
>

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 1012 matches

Mail list logo