Re: Review Request 71742: Mesos agent shouldn't respond pings if no master is registered
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71742/#review218775 --- As we are discussing on slack how to deal with the broken test and how to change it to be more meaningful I think it won't be unreasonable to just delete it for now. We can always add it in its new form back. - Jiang Yan Xu On Nov. 22, 2019, noon, Xudong Ni wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/71742/ > --- > > (Updated Nov. 22, 2019, noon) > > > Review request for mesos, Benjamin Mahler and Jiang Yan Xu. > > > Bugs: MESOS-10032 > https://issues.apache.org/jira/browse/MESOS-10032 > > > Repository: mesos > > > Description > --- > > In the case agents lost ZooKeeper connections and resetting its > master to none and beginning to dropping control messages from the > master, agent should not respond pings from master. > > > Diffs > - > > src/slave/slave.cpp 3839a120446339fea8aa857f431a2dba28ed4002 > src/tests/slave_recovery_tests.cpp 0efd3a6ac09ad06d9365b7bb2295157b5175e6b8 > > > Diff: https://reviews.apache.org/r/71742/diff/4/ > > > Testing > --- > > make check > > > Thanks, > > Xudong Ni > >
Re: Review Request 71742: Mesos agent shouldn't respond pings if no master is registered
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71742/#review218653 --- Chatted on https://mesos.slack.com/archives/C8PDVCDE3/p1573688798006800. The approach LGTM. Let's fix the tests. :) src/slave/slave.cpp Lines 6415 (patched) <https://reviews.apache.org/r/71742/#comment306509> s/respond/respond to/ But the comment is basically reiterating the behavior which is pretty straightforward. We can add a bit of rationale for the behavior here: ``` // Don't respond to pings if the agent cannot detect the master (e.g., due to failing to connect to ZK) because it stops responding to control messages such as run/kill tasks. It's better to have the master eventually mark the agent as partitioned after prolonged ZK disconnection than to silently drop messages. ``` src/slave/slave.cpp Lines 6419 (patched) <https://reviews.apache.org/r/71742/#comment306510> "registered master" is not a valid concept here, let's say "because the master cannot be detected"? - Jiang Yan Xu On Nov. 13, 2019, 2:39 p.m., Xudong Ni wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/71742/ > --- > > (Updated Nov. 13, 2019, 2:39 p.m.) > > > Review request for mesos and Jiang Yan Xu. > > > Bugs: MESOS-10032 > https://issues.apache.org/jira/browse/MESOS-10032 > > > Repository: mesos > > > Description > --- > > In the case agents lost ZooKeeper connections and resetting its > master to none and beginning to dropping control messages from the > master, agent should not respond pings from master. > > > Diffs > - > > src/slave/slave.cpp 3839a120446339fea8aa857f431a2dba28ed4002 > > > Diff: https://reviews.apache.org/r/71742/diff/2/ > > > Testing > --- > > ==] 2322 tests from 222 test cases ran. (1038166 ms total) > [ PASSED ] 2321 tests. > [ FAILED ] 1 test, listed below: > [ FAILED ] SlaveRecoveryTest/0.PingTimeoutDuringRecovery, where TypeParam = > mesos::internal::slave::MesosContainerizer > > This failed test verifies that the agent responds to pings from the master > while the agent is performing recovery, this PR will break this scenario. > > > Thanks, > > Xudong Ni > >
Re: Review Request 71193: Supported multiple quota paths in the `disk/xfs` isolator.
> On Aug. 2, 2019, 3:17 p.m., Jiang Yan Xu wrote: > > src/slave/containerizer/mesos/isolators/xfs/disk.cpp > > Line 570 (original), 566 (patched) > > <https://reviews.apache.org/r/71193/diff/3/?file=2158799#file2158799line570> > > > > You are not changing the logic here but could you remind me why this > > error doesn't fail the update? > > James Peach wrote: > I think that we are just trying to be robust against host problems. This > would only fail if there was a serious disk error, so we just try to maintain > our internal consistency. If it's btween continuing without being able to enforce quota and failing task + inititating clean up, I'd say the latter is better protecting consistency? - Jiang Yan --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71193/#review217059 --- On Aug. 5, 2019, 2:17 a.m., James Peach wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/71193/ > --- > > (Updated Aug. 5, 2019, 2:17 a.m.) > > > Review request for mesos, Xudong Ni, Gilbert Song, Jie Yu, and Jiang Yan Xu. > > > Bugs: MESOS-9900 > https://issues.apache.org/jira/browse/MESOS-9900 > > > Repository: mesos > > > Description > --- > > The `disk/xfs` isolator assumed that there would only be a single > directory path for each project quota. When we apply project quotas > to the overlayfs upperdir, that won't be true any more, since the > upperdir will come under the same quota as the task sandbox. > > Update the quota reclamation tracking to keep a set of disk paths that > the quota has been applied to, and only reclaim the project ID once all > those paths have been removed. > > > Diffs > - > > src/slave/containerizer/mesos/isolators/xfs/disk.hpp > 94d44e7f39dc01037461015b499a1fc3169b24e8 > src/slave/containerizer/mesos/isolators/xfs/disk.cpp > 646330c65b24aa28801ec99d7909db08a3e05c79 > > > Diff: https://reviews.apache.org/r/71193/diff/4/ > > > Testing > --- > > sudo make check (Frdora 30) > > > Thanks, > > James Peach > >
Re: Review Request 71194: Add `disk/xfs` isolator support for ephemeral volumes.
> On Aug. 2, 2019, 4:03 p.m., Jiang Yan Xu wrote: > > src/slave/containerizer/mesos/isolators/xfs/disk.cpp > > Lines 424 (patched) > > <https://reviews.apache.org/r/71194/diff/3/?file=2158813#file2158813line424> > > > > FWIW sandbox scanning predates the container run state, now it does > > look like at least sandboxes (or any paths that can be deduced from the > > container ID) would be covered given it's checkpointed pretty early: > > https://github.com/apache/mesos/blob/e1176c453d04a8ef8f53cf23928b5bbb09173d78/src/slave/containerizer/mesos/containerizer.cpp#L1547 > > > > but of course the emphemeral volumes are added later than that and > > there could be dirs with projectIDs set before `ephemeral_volumes` is > > persisted. Sorry I meant to add that the provisioner dir for a container can also be [deduced](https://github.com/apache/mesos/blob/e1176c453d04a8ef8f53cf23928b5bbb09173d78/src/slave/containerizer/mesos/provisioner/paths.hpp#L38) from the containerID so you can just scan the subdirs of the containers recovered. Up to you. - Jiang Yan --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71194/#review217058 --- On Aug. 1, 2019, 7:27 p.m., James Peach wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/71194/ > --- > > (Updated Aug. 1, 2019, 7:27 p.m.) > > > Review request for mesos, Xudong Ni, Gilbert Song, Jie Yu, and Jiang Yan Xu. > > > Bugs: MESOS-9900 > https://issues.apache.org/jira/browse/MESOS-9900 > > > Repository: mesos > > > Description > --- > > Add support for labeling ephemeral volumes with the sandbox XFS > project ID. This makes changes to the container rootfs share the > same disk quota as the sandbox. > > > Diffs > - > > src/slave/containerizer/mesos/isolators/xfs/disk.cpp > 646330c65b24aa28801ec99d7909db08a3e05c79 > src/slave/containerizer/mesos/provisioner/backends/overlay.hpp > 362e02172d2fd8e6e241fb6f5689f569ba74a0d1 > src/slave/containerizer/mesos/provisioner/backends/overlay.cpp > f2040cf36c601a13281a78ff844ebd41000a2d65 > > > Diff: https://reviews.apache.org/r/71194/diff/3/ > > > Testing > --- > > sudo make check (Frdora 30) > > > Thanks, > > James Peach > >
Re: Review Request 71194: Add `disk/xfs` isolator support for ephemeral volumes.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71194/#review217058 --- src/slave/containerizer/mesos/isolators/xfs/disk.cpp Lines 342 (patched) <https://reviews.apache.org/r/71194/#comment304302> Use `at()`? src/slave/containerizer/mesos/isolators/xfs/disk.cpp Lines 346 (patched) <https://reviews.apache.org/r/71194/#comment304307> There are three experessions `foreachpair` so if we break them apart, put each one a line? src/slave/containerizer/mesos/isolators/xfs/disk.cpp Lines 347 (patched) <https://reviews.apache.org/r/71194/#comment304308> s/container volumes/persistent volumes/? src/slave/containerizer/mesos/isolators/xfs/disk.cpp Lines 423 (patched) <https://reviews.apache.org/r/71194/#comment304310> Do we need to single out the overlay backend? Seems like we can just scan all backends? src/slave/containerizer/mesos/isolators/xfs/disk.cpp Lines 424 (patched) <https://reviews.apache.org/r/71194/#comment304309> FWIW sandbox scanning predates the container run state, now it does look like at least sandboxes (or any paths that can be deduced from the container ID) would be covered given it's checkpointed pretty early: https://github.com/apache/mesos/blob/e1176c453d04a8ef8f53cf23928b5bbb09173d78/src/slave/containerizer/mesos/containerizer.cpp#L1547 but of course the emphemeral volumes are added later than that and there could be dirs with projectIDs set before `ephemeral_volumes` is persisted. src/slave/containerizer/mesos/isolators/xfs/disk.cpp Lines 517 (patched) <https://reviews.apache.org/r/71194/#comment304311> Same question: why don't we fail here? - Jiang Yan Xu On Aug. 1, 2019, 7:27 p.m., James Peach wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/71194/ > --- > > (Updated Aug. 1, 2019, 7:27 p.m.) > > > Review request for mesos, Xudong Ni, Gilbert Song, Jie Yu, and Jiang Yan Xu. > > > Bugs: MESOS-9900 > https://issues.apache.org/jira/browse/MESOS-9900 > > > Repository: mesos > > > Description > --- > > Add support for labeling ephemeral volumes with the sandbox XFS > project ID. This makes changes to the container rootfs share the > same disk quota as the sandbox. > > > Diffs > - > > src/slave/containerizer/mesos/isolators/xfs/disk.cpp > 646330c65b24aa28801ec99d7909db08a3e05c79 > src/slave/containerizer/mesos/provisioner/backends/overlay.hpp > 362e02172d2fd8e6e241fb6f5689f569ba74a0d1 > src/slave/containerizer/mesos/provisioner/backends/overlay.cpp > f2040cf36c601a13281a78ff844ebd41000a2d65 > > > Diff: https://reviews.apache.org/r/71194/diff/3/ > > > Testing > --- > > sudo make check (Frdora 30) > > > Thanks, > > James Peach > >
Re: Review Request 71192: Propagate ephemeral volume information from rootfs.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71192/#review217060 --- Ship it! Ship It! - Jiang Yan Xu On July 31, 2019, 6:03 p.m., James Peach wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/71192/ > --- > > (Updated July 31, 2019, 6:03 p.m.) > > > Review request for mesos, Xudong Ni, Gilbert Song, Jie Yu, and Jiang Yan Xu. > > > Bugs: MESOS-9900 > https://issues.apache.org/jira/browse/MESOS-9900 > > > Repository: mesos > > > Description > --- > > Propagate the overlayfs upperdir path to the containerization > layer in a general form as a set of ephemeral volume paths. > > > Diffs > - > > include/mesos/slave/containerizer.proto > a60c96302a6cec90ecd0a0885b844fff8d37db71 > src/slave/containerizer/mesos/containerizer.cpp > a01edc8793a2eaa655f1729a01a01f1f61fbf7cb > src/slave/containerizer/mesos/provisioner/backend.hpp > 7257d3a962ecdf87fe9d52facbd6a2619311a018 > src/slave/containerizer/mesos/provisioner/backends/aufs.hpp > 2c2518775d2bcb3b1857775723828c4ac08111ca > src/slave/containerizer/mesos/provisioner/backends/aufs.cpp > 2eba55228e7596c170dddea2a9930ba6cf73404f > src/slave/containerizer/mesos/provisioner/backends/bind.hpp > c4a1d5f35710f6eab13938333ed4cb97d9ff2f6b > src/slave/containerizer/mesos/provisioner/backends/bind.cpp > 4cc4c520aa6872c05f474bc7837a353b43983e0b > src/slave/containerizer/mesos/provisioner/backends/copy.hpp > 5dc9a2fd38cdfafda02129b0d144eee3230a1ff2 > src/slave/containerizer/mesos/provisioner/backends/copy.cpp > 10516caab82d910944814b80e0a1d9aeba19007c > src/slave/containerizer/mesos/provisioner/backends/overlay.hpp > 362e02172d2fd8e6e241fb6f5689f569ba74a0d1 > src/slave/containerizer/mesos/provisioner/backends/overlay.cpp > f2040cf36c601a13281a78ff844ebd41000a2d65 > src/slave/containerizer/mesos/provisioner/provisioner.hpp > 7f84aa499b21140eb29ef7f81e2608743b12eb75 > src/slave/containerizer/mesos/provisioner/provisioner.cpp > a081fb059f9e0788c2153d1e04eda37fe514a1f7 > > > Diff: https://reviews.apache.org/r/71192/diff/3/ > > > Testing > --- > > sudo make check (Frdora 30) > > > Thanks, > > James Peach > >
Re: Review Request 71193: Supported multiple quota paths in the `disk/xfs` isolator.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71193/#review217059 --- Fix it, then Ship it! src/slave/containerizer/mesos/isolators/xfs/disk.cpp Line 570 (original), 566 (patched) <https://reviews.apache.org/r/71193/#comment304303> You are not changing the logic here but could you remind me why this error doesn't fail the update? src/slave/containerizer/mesos/isolators/xfs/disk.cpp Lines 776 (patched) <https://reviews.apache.org/r/71193/#comment304304> For `hashmap` there's already a `contains` defined. src/slave/containerizer/mesos/isolators/xfs/disk.cpp Lines 778-779 (patched) <https://reviews.apache.org/r/71193/#comment304305> You could've used a `else if` followed by an `else` to avoid another level of nesting. src/slave/containerizer/mesos/isolators/xfs/disk.cpp Line 807 (original) <https://reviews.apache.org/r/71193/#comment304306> We could move the logging about the `dir` up above `erase` if we still want to log it (maybe for `VLOG(1)`)? - Jiang Yan Xu On July 30, 2019, 12:52 a.m., James Peach wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/71193/ > --- > > (Updated July 30, 2019, 12:52 a.m.) > > > Review request for mesos, Xudong Ni, Gilbert Song, Jie Yu, and Jiang Yan Xu. > > > Bugs: MESOS-9900 > https://issues.apache.org/jira/browse/MESOS-9900 > > > Repository: mesos > > > Description > --- > > The `disk/xfs` isolator assumed that there would only be a single > directory path for each project quota. When we apply project quotas > to the overlayfs upperdir, that won't be true any more, since the > upperdir will come under the same quota as the task sandbox. > > Update the quota reclamation tracking to keep a set of disk paths that > the quota has been applied to, and only reclaim the project ID once all > those paths have been removed. > > > Diffs > - > > src/slave/containerizer/mesos/isolators/xfs/disk.hpp > 94d44e7f39dc01037461015b499a1fc3169b24e8 > src/slave/containerizer/mesos/isolators/xfs/disk.cpp > 646330c65b24aa28801ec99d7909db08a3e05c79 > > > Diff: https://reviews.apache.org/r/71193/diff/3/ > > > Testing > --- > > sudo make check (Frdora 30) > > > Thanks, > > James Peach > >
Re: Review Request 71192: Propagate ephemeral volume information from rootfs.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71192/#review217057 --- src/slave/containerizer/mesos/provisioner/backend.hpp Lines 55 (patched) <https://reviews.apache.org/r/71192/#comment304301> The backend abstraction probably shouldn't care about quota so probably phrasing it as "additional paths the backend has created" or something? src/slave/containerizer/mesos/provisioner/backends/overlay.cpp Lines 243 (patched) <https://reviews.apache.org/r/71192/#comment304300> Nit: with initializers I don't think we wrap elements with spaces. - Jiang Yan Xu On July 31, 2019, 6:03 p.m., James Peach wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/71192/ > --- > > (Updated July 31, 2019, 6:03 p.m.) > > > Review request for mesos, Xudong Ni, Gilbert Song, Jie Yu, and Jiang Yan Xu. > > > Bugs: MESOS-9900 > https://issues.apache.org/jira/browse/MESOS-9900 > > > Repository: mesos > > > Description > --- > > Propagate the overlayfs upperdir path to the containerization > layer in a general form as a set of ephemeral volume paths. > > > Diffs > - > > include/mesos/slave/containerizer.proto > a60c96302a6cec90ecd0a0885b844fff8d37db71 > src/slave/containerizer/mesos/containerizer.cpp > a01edc8793a2eaa655f1729a01a01f1f61fbf7cb > src/slave/containerizer/mesos/provisioner/backend.hpp > 7257d3a962ecdf87fe9d52facbd6a2619311a018 > src/slave/containerizer/mesos/provisioner/backends/aufs.hpp > 2c2518775d2bcb3b1857775723828c4ac08111ca > src/slave/containerizer/mesos/provisioner/backends/aufs.cpp > 2eba55228e7596c170dddea2a9930ba6cf73404f > src/slave/containerizer/mesos/provisioner/backends/bind.hpp > c4a1d5f35710f6eab13938333ed4cb97d9ff2f6b > src/slave/containerizer/mesos/provisioner/backends/bind.cpp > 4cc4c520aa6872c05f474bc7837a353b43983e0b > src/slave/containerizer/mesos/provisioner/backends/copy.hpp > 5dc9a2fd38cdfafda02129b0d144eee3230a1ff2 > src/slave/containerizer/mesos/provisioner/backends/copy.cpp > 10516caab82d910944814b80e0a1d9aeba19007c > src/slave/containerizer/mesos/provisioner/backends/overlay.hpp > 362e02172d2fd8e6e241fb6f5689f569ba74a0d1 > src/slave/containerizer/mesos/provisioner/backends/overlay.cpp > f2040cf36c601a13281a78ff844ebd41000a2d65 > src/slave/containerizer/mesos/provisioner/provisioner.hpp > 7f84aa499b21140eb29ef7f81e2608743b12eb75 > src/slave/containerizer/mesos/provisioner/provisioner.cpp > a081fb059f9e0788c2153d1e04eda37fe514a1f7 > > > Diff: https://reviews.apache.org/r/71192/diff/3/ > > > Testing > --- > > sudo make check (Frdora 30) > > > Thanks, > > James Peach > >
Re: Review Request 71080: Master should store the list of completed framework ids for lifecycle.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/71080/#review216836 --- LGTM as a work around, which we won't commit upstream. We reached a decision in https://mesos.slack.com/archives/C8PDVCDE3/p1563816071006900 to not merge it upstream as we couldn't reach consensus on how the flag / semantics should be maintained. include/mesos/allocator/allocator.hpp Lines 66 (patched) <https://reviews.apache.org/r/71080/#comment304084> `maxCompletedFrameworks` is used for the metrics to keep for completed frameworks in the allocator. `maxCompletedFramework_ids` (we should name it `maxCompletedFrameworkIDs` if it were useful) is not used and we don't need to add this field. src/master/master.hpp Lines 2210 (patched) <https://reviews.apache.org/r/71080/#comment304085> camelCasing to be consistent. src/master/master.hpp Lines 2215 (patched) <https://reviews.apache.org/r/71080/#comment304086> I would use a `bool` (1 byte) as the value. And a comment that the value is not meaningful and we are just using the map to simulate a "bounded hashset". - Jiang Yan Xu On July 18, 2019, 8:10 p.m., Xudong Ni wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/71080/ > --- > > (Updated July 18, 2019, 8:10 p.m.) > > > Review request for mesos, James Peach and Jiang Yan Xu. > > > Bugs: MESOS-8636 > https://issues.apache.org/jira/browse/MESOS-8636 > > > Repository: mesos > > > Description > --- > > It should be separately from that for webUI and endpoints. > Currently the master stores the history of completed frameworks in > a map with the full historical data of the framework, it is > prohibitively expensive to keep a long history; In order to reject > frameworks from reregistering if they have previously marked as > completed, we only need to persist the framework IDs and are able > to keep long history. > > > Diffs > - > > docs/configuration/master.md c56ac8510ea968f9587e23e81ed310caa968ee9e > docs/operator-http-api.md 2d4a9b66e20cf19eceec718b7de3d812ab285772 > include/mesos/allocator/allocator.hpp > 2bab53ab5fb25931a724c20a039e1301983ba574 > src/master/constants.hpp 26afa356b7844b4ec6c2caeef33bd39c51148d5f > src/master/flags.hpp 78623d68bf428cd3f52684303d98a525d42eb878 > src/master/flags.cpp 74f4daadd48e8e691be43759b88dc8b3c2df489a > src/master/http.cpp 765d5052ab6a8d731113f8922d20fb280b843003 > src/master/master.hpp ffa7423ba533725f7c1123d9aa507b1348e7f281 > src/master/master.cpp f1ca637b4cb0382caec53b5a81f6a4eb46f4dd2d > > > Diff: https://reviews.apache.org/r/71080/diff/7/ > > > Testing > --- > > make check > > sudo GLOG_v=1 ./bin/mesos-tests.sh --verbose > --gtest_filter=MasterTest.MaxCompletedFrameworksFlag --gtest_break_on_failure > --gtest_repeat=1000 > [ OK ] MasterTest.MaxCompletedFrameworksFlag (230 ms) > [--] 1 test from MasterTest (235 ms total) > > [--] Global test environment tear-down > [==] 1 test from 1 test case ran. (281 ms total) > [ PASSED ] 1 test. > > > Thanks, > > Xudong Ni > >
Re: Review Request 70862: Update `EXPECT` to `ASSERT` in blkio tests.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70862/#review215928 --- Ship it! Ship It! - Jiang Yan Xu On June 16, 2019, 7:13 p.m., James Peach wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70862/ > --- > > (Updated June 16, 2019, 7:13 p.m.) > > > Review request for mesos, Gilbert Song and Qian Zhang. > > > Repository: mesos > > > Description > --- > > Since the blkio cgroup tests immediately require the result of > `blkio_statistics()`, we should test for the presence of statistics > with `ASSERT_TRUE` rather than `EXPECT_TRUE`. > > > Diffs > - > > src/tests/containerizer/cgroups_isolator_tests.cpp > eef8eb28ea57ef0a2a3c626ac9aee202eb7231d9 > > > Diff: https://reviews.apache.org/r/70862/diff/1/ > > > Testing > --- > > make check (Fedora 30) > > > Thanks, > > James Peach > >
Re: Review Request 70741: Adopted container file operations for secrets volumes.
> On June 5, 2019, 12:16 a.m., Jiang Yan Xu wrote: > > src/slave/containerizer/mesos/launch.cpp > > Lines 516 (patched) > > <https://reviews.apache.org/r/70741/diff/2/?file=2147412#file2147412line524> > > > > Is it more explicit if you just name the operation > > `ContainerFileOperation::MV` or `ContainerFileOperation::MOVE` and > > implement it as such (rename + fallback) either here or in stout since you > > are essentially implementing how we use a `mv` command? > > James Peach wrote: > `rename` vs. `move` seems pretty similar. To me, `rename` seemed clear so > I prefer it. I didn't add a stout implementation because I didn't have the > time to handle all the corner cases and write proper tests for a general > utility. I think people tend to map them to linux syscall or commands so renaming doesn't equal potentially copying and deleting? Just my opinion. Still giving you the ship it. - Jiang Yan --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70741/#review215696 --- On June 5, 2019, 1:44 a.m., James Peach wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70741/ > --- > > (Updated June 5, 2019, 1:44 a.m.) > > > Review request for mesos, Xudong Ni, Gilbert Song, Jie Yu, Jacob Janco, and > Jiang Yan Xu. > > > Bugs: MESOS-9799 > https://issues.apache.org/jira/browse/MESOS-9799 > > > Repository: mesos > > > Description > --- > > Switched the `volumes/secrets` isolator from using container > pre-exec commands to using file operations. This allows us to > reduce the number of forked child processes, consolidate code > and leaves the `network/port_mapping` isolator as the last > consumer of pre-exec commands. > > > Diffs > - > > include/mesos/slave/containerizer.proto > b2e35cbf01caea6c1e4f45c7b7d833bc7f065099 > src/common/protobuf_utils.hpp 8aa9aff71cbbd8e98573d0095cfc007cdea163bc > src/common/protobuf_utils.cpp 870859de93b44a1debc6786a562b4bc28955ddab > src/slave/containerizer/mesos/isolators/volume/secret.cpp > 7a9bb82b23b35728408fb37bac53d79883c0a19f > src/slave/containerizer/mesos/launch.cpp > 0c482f46a97063133edfe29ae3c6a2721d29f6c6 > > > Diff: https://reviews.apache.org/r/70741/diff/4/ > > > Testing > --- > > sudo make check (Fedora 30) > > > Thanks, > > James Peach > >
Re: Review Request 70741: Adopted container file operations for secrets volumes.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70741/#review215714 --- Ship it! Ship It! - Jiang Yan Xu On June 5, 2019, 1:44 a.m., James Peach wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70741/ > --- > > (Updated June 5, 2019, 1:44 a.m.) > > > Review request for mesos, Xudong Ni, Gilbert Song, Jie Yu, Jacob Janco, and > Jiang Yan Xu. > > > Bugs: MESOS-9799 > https://issues.apache.org/jira/browse/MESOS-9799 > > > Repository: mesos > > > Description > --- > > Switched the `volumes/secrets` isolator from using container > pre-exec commands to using file operations. This allows us to > reduce the number of forked child processes, consolidate code > and leaves the `network/port_mapping` isolator as the last > consumer of pre-exec commands. > > > Diffs > - > > include/mesos/slave/containerizer.proto > b2e35cbf01caea6c1e4f45c7b7d833bc7f065099 > src/common/protobuf_utils.hpp 8aa9aff71cbbd8e98573d0095cfc007cdea163bc > src/common/protobuf_utils.cpp 870859de93b44a1debc6786a562b4bc28955ddab > src/slave/containerizer/mesos/isolators/volume/secret.cpp > 7a9bb82b23b35728408fb37bac53d79883c0a19f > src/slave/containerizer/mesos/launch.cpp > 0c482f46a97063133edfe29ae3c6a2721d29f6c6 > > > Diff: https://reviews.apache.org/r/70741/diff/4/ > > > Testing > --- > > sudo make check (Fedora 30) > > > Thanks, > > James Peach > >
Re: Review Request 70741: Adopted container file operations for secrets volumes.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70741/#review215696 --- src/common/protobuf_utils.hpp Lines 100 (patched) <https://reviews.apache.org/r/70741/#comment302508> Why this? The paragraphs above are still about `validateProtobufUnion` right? src/common/protobuf_utils.hpp Lines 436 (patched) <https://reviews.apache.org/r/70741/#comment302509> One empty line between the last element and the ending `}` src/slave/containerizer/mesos/isolators/volume/secret.cpp Line 138 (original), 147 (patched) <https://reviews.apache.org/r/70741/#comment302511> Do you know why `launchInfo.mounts` was not used (and for the same reason it couldn't be used here)? src/slave/containerizer/mesos/launch.cpp Lines 516 (patched) <https://reviews.apache.org/r/70741/#comment302510> Is it more explicit if you just name the operation `ContainerFileOperation::MV` or `ContainerFileOperation::MOVE` and implement it as such (rename + fallback) either here or in stout since you are essentially implementing how we use a `mv` command? - Jiang Yan Xu On May 28, 2019, 9:29 p.m., James Peach wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70741/ > --- > > (Updated May 28, 2019, 9:29 p.m.) > > > Review request for mesos, Xudong Ni, Gilbert Song, Jie Yu, Jacob Janco, and > Jiang Yan Xu. > > > Bugs: MESOS-9799 > https://issues.apache.org/jira/browse/MESOS-9799 > > > Repository: mesos > > > Description > --- > > Switched the `volumes/secrets` isolator from using container > pre-exec commands to using file operations. This allows us to > reduce the number of forked child processes, consolidate code > and leaves the `network/port_mapping` isolator as the last > consumer of pre-exec commands. > > > Diffs > - > > include/mesos/slave/containerizer.proto > b2e35cbf01caea6c1e4f45c7b7d833bc7f065099 > src/common/protobuf_utils.hpp 8aa9aff71cbbd8e98573d0095cfc007cdea163bc > src/common/protobuf_utils.cpp 870859de93b44a1debc6786a562b4bc28955ddab > src/slave/containerizer/mesos/isolators/volume/secret.cpp > 7a9bb82b23b35728408fb37bac53d79883c0a19f > src/slave/containerizer/mesos/launch.cpp > 0c482f46a97063133edfe29ae3c6a2721d29f6c6 > > > Diff: https://reviews.apache.org/r/70741/diff/2/ > > > Testing > --- > > sudo make check (Fedora 30) > > > Thanks, > > James Peach > >
Re: Review Request 70712: Added filesystem operations to the `ContainerLaunchInfo`.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70712/#review215515 --- include/mesos/slave/containerizer.proto Lines 207 (patched) <https://reviews.apache.org/r/70712/#comment302249> Nit: not all possible file operation have a source and a target? Would something like ``` message ContainerFileOperation { ... message Symlink { optional string target = 1; } optional string path = 2; ``` make more sense? - Jiang Yan Xu On May 23, 2019, 11:46 p.m., James Peach wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70712/ > --- > > (Updated May 23, 2019, 11:46 p.m.) > > > Review request for mesos, Xudong Ni, Gilbert Song, Jie Yu, Jacob Janco, and > Jiang Yan Xu. > > > Bugs: MESOS-9769 > https://issues.apache.org/jira/browse/MESOS-9769 > > > Repository: mesos > > > Description > --- > > The `filesystem/linux` isolator was using pre-exec commands > to set up Linux ABI symlinks. Not only is this inefficient, > it has the undesirable security property of running programs > in a user-controlled container image. > > The fix added a new `ContainerFileOperation` message to the > containerizer launch information. The containerizer executes > the requested file operation after performing the container > mounts. > > > Diffs > - > > include/mesos/slave/containerizer.proto > e9924489000efabebd55bf070f18149f23e4a510 > src/common/protobuf_utils.hpp 273ae270695db33b6c9d8b32cb38f8840a815787 > src/common/protobuf_utils.cpp 8b252cb11e17356836988dfc44a63953579a1def > src/slave/containerizer/mesos/isolators/cgroups/cgroups.cpp > 8f94453a7354927ae918d3f2fd746cdf5ef63cb7 > src/slave/containerizer/mesos/isolators/filesystem/linux.cpp > 190054c26b949aa9ba0f49377b77d9e472edb95a > src/slave/containerizer/mesos/launch.cpp > 5ddb4c7d998c17b59164825acc0627a1311b691b > > > Diff: https://reviews.apache.org/r/70712/diff/1/ > > > Testing > --- > > sudo make check (Fedora 30) > > > Thanks, > > James Peach > >
Re: Review Request 70660: Fix the XFS build for recent Fedora versions.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/70660/#review215327 --- Ship it! Ship It! - Jiang Yan Xu On May 16, 2019, 7:24 p.m., James Peach wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/70660/ > --- > > (Updated May 16, 2019, 7:24 p.m.) > > > Review request for mesos, Xudong Ni, Gilbert Song, Jie Yu, and Jiang Yan Xu. > > > Repository: mesos > > > Description > --- > > On Fedora 30, we need to include to get the right > types, since is no longer self-contained. For > earlier versions, it is still safe to include both headers if > they are available, though h> may need to be > obtained via the xfsprogs-qa-devel package. > > > Diffs > - > > configure.ac b4bad5716986e2f7c132c6515179a65ccbfdaeac > src/slave/containerizer/mesos/isolators/xfs/utils.hpp > db1829abbaac6113d39e71673403afe75b5ee738 > > > Diff: https://reviews.apache.org/r/70660/diff/1/ > > > Testing > --- > > make check (Fedota 30) > make (CentOS 6) > > > Thanks, > > James Peach > >
Re: Review Request 69615: Disable containerizer ptrace attach.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/69615/#review212822 --- src/slave/containerizer/mesos/launch.cpp Lines 556 (patched) <https://reviews.apache.org/r/69615/#comment298707> The main thing I wanted to confirm is what we chatted about offline but I didn't write down: the interpretation of this paragraph in http://man7.org/linux/man-pages/man2/prctl.2.html ``` PR_SET_DUMPABLE ... Normally, this flag is set to 1. However, it is reset to the current value contained in the file /proc/sys/fs/suid_dumpable (which by default has the value 0), in the following circumstances: * The process's effective user or group ID is changed. ... ``` This reads to me like, in a typical setup which runs Mesos agent as uid 0 and the task with another uid (hence the call `os::setuid(uid.get());` below does change uid): - If `/proc/sys/fs/suid_dumpable` is the default 0, then regardless of `--allow_ptrace` the result is that the launcher process becomes undumpable. - If `/proc/sys/fs/suid_dumpable` is 1, then regardless of `--allow_ptrace` the result is that the launcher process becomes dumpable. i.e., the `prctl()` call here is a noop. Maybe I misunderstood the doc but perhaps we can tweak the test to be a `ROOT_*` test and change to a different task user in order to verify the behavior? If you have tested it manually that's fine too just wanted to double check :) FWIW, I checked a sample container that we run (without this patch) and see that ``` # ls -l /proc/1/ ``` shows the files under it are all owned by root, which does appear to mean that the process' dumpable flag is not 1 according to http://man7.org/linux/man-pages/man5/proc.5.html? - Jiang Yan Xu On Feb. 8, 2019, 1:09 p.m., James Peach wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/69615/ > --- > > (Updated Feb. 8, 2019, 1:09 p.m.) > > > Review request for mesos, Xudong Ni, Gilbert Song, Jie Yu, and Jiang Yan Xu. > > > Bugs: MESOS-9349 > https://issues.apache.org/jira/browse/MESOS-9349 > > > Repository: mesos > > > Description > --- > > Use `prctl(PR_SET_DUMPABLE)` to disable the ability to attach to > the containerizer process(es) on Linux systems. This prevents > unprivileged containerized processes from reading information > about the containerizer process(es) from `/proc`. This gives an > additional layer of protection against leaking information to > untrusted container processes. > > > Diffs > - > > docs/configuration/agent.md a4015c409d00a4c117df6c869d5ba5072bcfe58e > src/launcher/default_executor.cpp 5837cfa4deba557cae43112092ff24b97137951f > src/launcher/executor.cpp f962e800f23d5582b1bc04a263253893492a5054 > src/slave/containerizer/mesos/containerizer.cpp > 35f51ad33da53b3e6a8eec275fbf3e77782b0fba > src/slave/containerizer/mesos/launch.hpp > 0a6394d56321948ad760ac69c05456319a254842 > src/slave/containerizer/mesos/launch.cpp > 7f401cdf481123b8c6cc500ac02bb7daf2613d2c > src/slave/flags.hpp 7346ba5b711a8353a4bc1d7dcd2f6184b777ddd0 > src/slave/flags.cpp 066b84f528b4c888dde399e0b5d5fe5531934de6 > src/slave/slave.cpp 0182dd2ca326723e96eef8c072696ad3c873de0b > src/tests/containerizer/mesos_containerizer_tests.cpp > 449928c10b897061642af8ad267f8b70695940e6 > src/tests/slave_tests.cpp 22a0295086ae4f4ec26df00a0e077eecfa27f1fb > > > Diff: https://reviews.apache.org/r/69615/diff/3/ > > > Testing > --- > > make check (Fedora 29) > > > Thanks, > > James Peach > >
Re: Review Request 69615: Disable containerizer ptrace attach.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/69615/#review211839 --- For consistency if we updated command excutor we should update the default executor? src/slave/containerizer/mesos/launch.cpp Lines 140 (patched) <https://reviews.apache.org/r/69615/#comment297764> s/to to/to/ src/slave/containerizer/mesos/launch.cpp Lines 505 (patched) <https://reviews.apache.org/r/69615/#comment297763> I haven't tried but I trust that you've verified that `PR_SET_DUMPABLE` is not persisted across fork but could you clarfiy via a comment? It's not clear from http://man7.org/linux/man-pages/man2/prctl.2.html. src/slave/flags.cpp Lines 338 (patched) <https://reviews.apache.org/r/69615/#comment297762> If disallow is the default, it's a breaking change right? Can the default be allowing it? - Jiang Yan Xu On Jan. 2, 2019, 9:15 a.m., James Peach wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/69615/ > --- > > (Updated Jan. 2, 2019, 9:15 a.m.) > > > Review request for mesos, Xudong Ni, Gilbert Song, Jie Yu, and Jiang Yan Xu. > > > Bugs: MESOS-9349 > https://issues.apache.org/jira/browse/MESOS-9349 > > > Repository: mesos > > > Description > --- > > Use `prctl(PR_SET_DUMPABLE)` to disable the ability to attach to > the containerizer process(es) on Linux systems. This prevents > unprivileged containerized processes from reading information > about the containerizer process(es) from `/proc`. This gives an > additional layer of protection against leaking information to > untrusted container processes. > > > Diffs > - > > docs/configuration/agent.md 330283f4e3957075dd4310de4a841feac23de36c > src/launcher/executor.cpp f962e800f23d5582b1bc04a263253893492a5054 > src/slave/containerizer/mesos/containerizer.cpp > a5cf2da55c046c5c45e0c2ca3400f64de12de62b > src/slave/containerizer/mesos/launch.hpp > 0a6394d56321948ad760ac69c05456319a254842 > src/slave/containerizer/mesos/launch.cpp > 2f1c9e7a8748c9d7eab25bc8567ca68308e680f9 > src/slave/flags.hpp 494ae02ab5eb365e2cda5017be573691107c3f28 > src/slave/flags.cpp 6bac8e1409f04d639204c45eda8a90c098e3dbd0 > src/slave/slave.cpp ad3b693a716cf6103345a157bf28dd60a7b07d32 > src/tests/containerizer/mesos_containerizer_tests.cpp > 449928c10b897061642af8ad267f8b70695940e6 > src/tests/slave_tests.cpp 4aed5d68e9a408821880ffaede482937be1999f4 > > > Diff: https://reviews.apache.org/r/69615/diff/2/ > > > Testing > --- > > make check (Fedora 29) > > > Thanks, > > James Peach > >
Re: Review Request 69373: Replaced a log consensus `CHECK()` with CHECK_GE()`.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/69373/#review210613 --- Ship it! Ship It! - Jiang Yan Xu On Nov. 16, 2018, 11:25 a.m., James Peach wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/69373/ > --- > > (Updated Nov. 16, 2018, 11:25 a.m.) > > > Review request for mesos and Jiang Yan Xu. > > > Repository: mesos > > > Description > --- > > In scale testing, this `CHECK` failed with the following message: > > F1116 17:50:04.868387 53766 consensus.cpp:771] Check failed: > highestNackProposal >= proposal > > Emitting the values for `highestNackProposal` and `proposal` may > help in debugging this failure. > > > Diffs > - > > src/log/consensus.cpp bce7facb0fe0d89ee96951bcf144332e41eb683e > > > Diff: https://reviews.apache.org/r/69373/diff/1/ > > > Testing > --- > > make check > > > Thanks, > > James Peach > >
Re: Review Request 68706: Added master failover reregistration progress metrics.
> On Oct. 18, 2018, 11:14 a.m., Jiang Yan Xu wrote: > > src/master/master.cpp > > Lines 1874-1875 (patched) > > <https://reviews.apache.org/r/68706/diff/5/?file=2098490#file2098490line1874> > > > > If we use equality operator and only set the timer when such a number > > of reregistered agent is reached we are guaranteed to only set each timer > > once (but we may need to set multiple timers in one call) right? This > > alleviates the need to check if the timer is already set! > > > > This should also work with boundary cases like the total recovered > > agents count being 0 or 1 (overlapping percentiles) etc. Right? > > Jiang Yan Xu wrote: > We should comment on the reasons for dropping issues. > > Xudong Ni wrote: > This is actaully fixed and marked as dropped somehow. We used equality > operator to 0.0 to check whether the percentile was reached or not; The > reason we used push gauge not timer is explained in push gauge vs timer > comments section The point here is "This alleviates the need to check if the timer is already set!". It's still in the new revision. I made another comment about it. > On Oct. 18, 2018, 11:14 a.m., Jiang Yan Xu wrote: > > src/master/metrics.hpp > > Lines 51 (patched) > > <https://reviews.apache.org/r/68706/diff/5/?file=2098491#file2098491line51> > > > > On using Timer (e.g., like > > [state_fetch](https://github.com/apache/mesos/blob/7f36ebc1775398a43b2aa3a81bb647fb296b8313/src/master/registrar.cpp#L172)) > > vs. PushGauge, after looking at how it'll be used I think the main > > advantage of Timer is that it doesn't export any value if you haven't set > > it. > > > > Consider the two cases: > > > > 1. There are 1000 recovered agents and 0 have reregsitered, should all > > the timers have zero values or should they be absent? > > > > 2. There are 0 recovered agents (e.g., brand new cluster), should all > > of the metrics be zero or non-existent? I feel like they should be zero, as > > in, e.g., 100% of all 0 agents are reregistered within 0 secs. > > > > So timer handles this natrually. Also it sets the `_secs` name for you > > but that's a minor conveninence. > > Jiang Yan Xu wrote: > We should comment on the reasons for dropping issues. > > Xudong Ni wrote: > Sorry about this, I did make the comments but it must be in one of draft > which was not saved. > > I agree that metrics should be zero but not absent when certain > percentige were not reached. I did tried both PushGauge and Timer > implementation and tested in our clusters. > > If we used the timer, when the value was not set, that metric is actually > missing. PushGauge is set with the initial value 0.0 and we can tell whether > it's set yet, the metric will always exist no matter that percentile reached > or not, and it has better performance. > > The brand new cluster case was tested and the results were included in > the test results. My point was the opposite. When a percentage is not reached, I feel that the value being absent is preferrable and I used the above two cases to demonstrate it. (There is a subtle difference between the two IMO). It's in fact the current practice used by metrics like `state_fetch`. - Jiang Yan --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/68706/#review209722 --- On Oct. 19, 2018, 4:56 p.m., Xudong Ni wrote: > > ----------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/68706/ > --- > > (Updated Oct. 19, 2018, 4:56 p.m.) > > > Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu. > > > Bugs: MESOS-9178 > https://issues.apache.org/jira/browse/MESOS-9178 > > > Repository: mesos > > > Description > --- > > During the master failover, the time that the master elected is > considered as the start of failover. In the progress of > reregistration, the percentile represents the time when such > percentile of agents finished registration again; The percentile of > these data as in this metrics can represent overall reregistration > progress; In case of degradation towards to the end of reregistration, > the high percentile can reflect it; In the case there are unreachable > agents in the failover, if certain percentile recovery
Re: Review Request 68706: Added master failover reregistration progress metrics.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/68706/#review210145 --- There were changes added for a test but it's gone in the latest revision? src/master/master.cpp Lines 1720 (patched) <https://reviews.apache.org/r/68706/#comment294924> You can construct the entire `Option agentReregistrations` struct here using the `expectedAgentCount` as well as the `electedTime` which is a class member (Rationale explained below). Initializing `agentReregistrations` is just one step in the list of steps taken in `Master::_recover` so you don't have to squeeze it between the two lines for allocator recover code. It's not appropriate as the block comment above clearly tries to explain the allocator reocvery. You can just put it somewhere below by itself (blank lines above and below the statement). The `expectedAgentCount` variable is already set and you can use; an frankly you can just use `registry.slaves().slaves().size()` as well. src/master/master.cpp Lines 2099 (patched) <https://reviews.apache.org/r/68706/#comment294925> - End sentences with `.` - s/master/the master/ src/master/metrics.hpp Lines 87 (patched) <https://reviews.apache.org/r/68706/#comment294779> - End sentences with `.`. - s/when/during/ src/master/metrics.hpp Lines 90 (patched) <https://reviews.apache.org/r/68706/#comment294778> No need for `explicit` as it's only for single argument constructors, you don't need it either with zero or two arguments (one proposal below). src/master/metrics.hpp Lines 90-123 (patched) <https://reviews.apache.org/r/68706/#comment294927> To be consistent let's put the implementation in the cpp file master/metrics.cpp. Here and for other methods as well. src/master/metrics.hpp Lines 92 (patched) <https://reviews.apache.org/r/68706/#comment294926> Indent two more spaces, we can see examples in master/metrics.cpp, here and elsewhere. src/master/metrics.hpp Lines 130 (patched) <https://reviews.apache.org/r/68706/#comment294928> We generally don't use one letter variable names. `t` here seemingly refers to `Time` but it's a `Duration`. It won't hurt to keep it strongly types until we need to convert it, i.e., Use `Duration duration = Clock::now() - electedTime;` should be fine. src/master/metrics.hpp Lines 132 (patched) <https://reviews.apache.org/r/68706/#comment294929> A space between `if` and `(`. Here and elsewhere. src/master/metrics.hpp Lines 133 (patched) <https://reviews.apache.org/r/68706/#comment294931> Let's `#include ` for ceil, in general include all needed headers without relying on transitive dependencies is a good pratice. src/master/metrics.hpp Lines 138 (patched) <https://reviews.apache.org/r/68706/#comment294932> As I commented in the previous revision, we don't need to check if a timer is already set. A percentage timer corresponds to one precise `reregisteredAgentCount` (and not vice versa) so we just need to check that. Does it make sense? src/master/metrics.hpp Lines 176-177 (patched) <https://reviews.apache.org/r/68706/#comment294923> Semantically it's hard to explain why `recoveredAgentCount` is an `Option` while `reregisteredAgentCount` is not? Why not have all three fields as Options? But then one may ask why not have the entire struct by an Option? It's true that we don't have a `recoveredAgentCount` before they are recovered and `electedTime` before the master is elected but we can just use `Option agentReregistrations` like I suggested? The semantics of `Option agentReregistrations` seems natural to me: it is optional because agent reregistrations only happen after the master is recovered. - Jiang Yan Xu On Oct. 19, 2018, 4:56 p.m., Xudong Ni wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/68706/ > --- > > (Updated Oct. 19, 2018, 4:56 p.m.) > > > Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu. > > > Bugs: MESOS-9178 > https://issues.apache.org/jira/browse/MESOS-9178 > > > Repository: mesos > > > Description > --- > > During the master failover, the time that the master elected is > considered as the start of failover. In the progress of > reregistration, the percentile represents the time when such > percentile of agents finished registration again; The percentile of > these data as in this metrics can represent overall reregistration > progress; In case of degradation towards to the end of reregistration,
Re: Review Request 69233: Fixed a GetContainers crash when using XFS disk isolation.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/69233/#review210261 --- Ship it! src/slave/containerizer/mesos/isolators/xfs/disk.cpp Lines 660 (patched) <https://reviews.apache.org/r/69233/#comment294910> Might be worth adding a comment about root disks not having `Source` since people often forget. :) - Jiang Yan Xu On Nov. 1, 2018, 1:33 p.m., James Peach wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/69233/ > --- > > (Updated Nov. 1, 2018, 1:33 p.m.) > > > Review request for mesos and Jiang Yan Xu. > > > Bugs: MESOS-9367 > https://issues.apache.org/jira/browse/MESOS-9367 > > > Repository: mesos > > > Description > --- > > Since the XFS disk isolator didn't check that the persistent volume > has a `DiskInfo::Source` before populating the corresponding field > in the `ResourceStatistics` message. This caused the JSON conversion > (triggered by operator API calls) to check fail. The fix is to simply > check that the volume has a `DiskInfo::Source` before propagating it. > > > Diffs > - > > src/slave/containerizer/mesos/isolators/xfs/disk.cpp > 44b7d0570b00d50e1fc454100ab73f4dfe5ca159 > src/tests/containerizer/xfs_quota_tests.cpp > b2f977f15e2ea8cbe5f1d60e046a8f7d3560 > > > Diff: https://reviews.apache.org/r/69233/diff/1/ > > > Testing > --- > > sudo make check (Fedora 28) > > > Thanks, > > James Peach > >
Re: Review Request 68706: Added master failover reregistration progress metrics.
> On Oct. 18, 2018, 11:14 a.m., Jiang Yan Xu wrote: > > src/master/master.cpp > > Lines 1874-1875 (patched) > > <https://reviews.apache.org/r/68706/diff/5/?file=2098490#file2098490line1874> > > > > If we use equality operator and only set the timer when such a number > > of reregistered agent is reached we are guaranteed to only set each timer > > once (but we may need to set multiple timers in one call) right? This > > alleviates the need to check if the timer is already set! > > > > This should also work with boundary cases like the total recovered > > agents count being 0 or 1 (overlapping percentiles) etc. Right? We should comment on the reasons for dropping issues. > On Oct. 18, 2018, 11:14 a.m., Jiang Yan Xu wrote: > > src/master/metrics.hpp > > Lines 51 (patched) > > <https://reviews.apache.org/r/68706/diff/5/?file=2098491#file2098491line51> > > > > On using Timer (e.g., like > > [state_fetch](https://github.com/apache/mesos/blob/7f36ebc1775398a43b2aa3a81bb647fb296b8313/src/master/registrar.cpp#L172)) > > vs. PushGauge, after looking at how it'll be used I think the main > > advantage of Timer is that it doesn't export any value if you haven't set > > it. > > > > Consider the two cases: > > > > 1. There are 1000 recovered agents and 0 have reregsitered, should all > > the timers have zero values or should they be absent? > > > > 2. There are 0 recovered agents (e.g., brand new cluster), should all > > of the metrics be zero or non-existent? I feel like they should be zero, as > > in, e.g., 100% of all 0 agents are reregistered within 0 secs. > > > > So timer handles this natrually. Also it sets the `_secs` name for you > > but that's a minor conveninence. We should comment on the reasons for dropping issues. - Jiang Yan --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/68706/#review209722 --- On Oct. 19, 2018, 4:56 p.m., Xudong Ni wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/68706/ > --- > > (Updated Oct. 19, 2018, 4:56 p.m.) > > > Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu. > > > Bugs: MESOS-9178 > https://issues.apache.org/jira/browse/MESOS-9178 > > > Repository: mesos > > > Description > --- > > During the master failover, the time that the master elected is > considered as the start of failover. In the progress of > reregistration, the percentile represents the time when such > percentile of agents finished registration again; The percentile of > these data as in this metrics can represent overall reregistration > progress; In case of degradation towards to the end of reregistration, > the high percentile can reflect it; In the case there are unreachable > agents in the failover, if certain percentile recovery couldn't be > reached, the intiail value of that percentile will not be updated. > > > Diffs > - > > src/master/master.cpp 868787bb2f9d879531402f83507b322462322efc > src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 > > > Diff: https://reviews.apache.org/r/68706/diff/7/ > > > Testing > --- > > Automation: > [ RUN ] MasterTest.MetricsInMetricsEndpoint > [ OK ] MasterTest.MetricsInMetricsEndpoint (42 ms) > > Real world cases: > > While the master is not elected or there is no agents recovered yet > "master/recovered_agents_100_percent_reregistered_secs": 0.0, > "master/recovered_agents_25_percent_reregistered_secs": 0.0, > "master/recovered_agents_50_percent_reregistered_secs": 0.0, > "master/recovered_agents_75_percent_reregistered_secs": 0.0, > "master/recovered_agents_90_percent_reregistered_secs": 0.0, > "master/recovered_agents_99_percent_reregistered_secs": 0.0, > "master/slave_reregistrations": 0.0, > > While reregistrations is in progress: 5 out of 6 completed: > "master/recovered_agents_100_percent_reregistered_secs": 0.0, > "master/recovered_agents_25_percent_reregistered_secs": 2.0, > "master/recovered_agents_50_percent_reregistered_secs": 3.0, > "master/recovered_agents_75_percent_reregistered_secs": 6.0, > "master/recovere
Re: Review Request 68706: Added master failover reregistration progress metrics.
> On Oct. 18, 2018, 11:14 a.m., Jiang Yan Xu wrote: > > Sorry for the delay. Feel free to chat if it's not clear! - Jiang Yan --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/68706/#review209722 --- On Oct. 16, 2018, 9:49 a.m., Xudong Ni wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/68706/ > --- > > (Updated Oct. 16, 2018, 9:49 a.m.) > > > Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu. > > > Bugs: MESOS-9178 > https://issues.apache.org/jira/browse/MESOS-9178 > > > Repository: mesos > > > Description > --- > > During the master failover, the time that the master elected is > considered as the start of failover. In the progress of > reregistration, the percentile represents the time when such > percentile of agents finished registration again; The percentile of > these data as in this metrics can represent overall reregistration > progress; In case of degradation towards to the end of reregistration, > the high percentile can reflect it; In the case there are unreachable > agents in the failover, if certain percentile recovery couldn't be > reached, the intiail value of that percentile will not be updated. > > > Diffs > - > > docs/monitoring.md 00c6ea94bcb73746aef740236632ede123f5b534 > src/master/master.hpp ea7e9242b62fe6c2cc0e717f9a9f2f0c1cc0a390 > src/master/master.cpp 868787bb2f9d879531402f83507b322462322efc > src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 > src/master/metrics.cpp 56a7eef2d279ad3248092d37d19013d3ac110757 > src/tests/master_tests.cpp 1db8ed7d81acbcd8bad4b7ca77c501d1d99cc135 > > > Diff: https://reviews.apache.org/r/68706/diff/5/ > > > Testing > --- > > Automation: > [ RUN ] MasterTest.MetricsInMetricsEndpoint > [ OK ] MasterTest.MetricsInMetricsEndpoint (42 ms) > > Real world cases: > While reregistrations is in progress: 3277 out of 3582 completed: > "master/slave_reregistrations": 3277.0, > "master/slaves_100_percent_reregistered_secs": 0.0, > "master/slaves_25_percent_reregistered_secs": 5.0, > "master/slaves_50_percent_reregistered_secs": 11.0, > "master/slaves_75_percent_reregistered_secs": 20.0, > "master/slaves_90_percent_reregistered_secs": 30.0, > "master/slaves_99_percent_reregistered_secs": 0.0, > > > While 3582 reregistrations were all completed: > "master/slave_reregistrations": 3582.0, > "master/slaves_100_percent_reregistered_secs": 54.0, > "master/slaves_25_percent_reregistered_secs": 5.0, > "master/slaves_50_percent_reregistered_secs": 11.0, > "master/slaves_75_percent_reregistered_secs": 20.0, > "master/slaves_90_percent_reregistered_secs": 30.0, > "master/slaves_99_percent_reregistered_secs": 39.0, > > > Thanks, > > Xudong Ni > >
Re: Review Request 68706: Added master failover reregistration progress metrics.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/68706/#review209722 --- src/master/master.hpp Lines 2346-2347 (patched) <https://reviews.apache.org/r/68706/#comment294277> I wasn't sure if we needed a dedicated struct / method for these metrics when we first discussed this approach but looking at the the additional method, math and class member state that we need to track I feel that we do. It's probably the best if we put such logic in the master `Metrics` class. It'll isolate the metric calculation logic from the master and you can use the struct's member state to help with tracking. Similar examples already exist here. e.g., https://github.com/apache/mesos/blob/f03f91545fec64a767f808c48a3b352dc86b1ac0/src/master/metrics.hpp#L89 I propose that we add something like this ``` struct AgentReregistrations { AgentReregistrations(int recoveredAgentCount); ~AgentReregistrations(); // If you use `Timer` you don't even need this. See comment about Timer below. process::Time electedTime, int recoveredAgentCount; int reregisteredAgentCount = 0; // ... Whatever else state you may need. process::metrics::PushGauge recovered_agents_25_percent_reregistered_secs; // ... other gauges. void incrementAgentReregistered(); }; // Option since you only set it after the master is elected. Option agentReregistrations; ``` src/master/master.cpp Lines 1866-1868 (patched) <https://reviews.apache.org/r/68706/#comment294282> But we naturally don't need float comparison, we are comparing the reregistered agents count (int) against the **minimum** number of agents that **satisfy** the percentage requirement (int). So if you take a `ceil` of the floating point multiplication result and cast to int you'll get the semantically correct number? e.g., if total = 5 and we are calculating 75% the `reregisteredAgentCount` we are looking for is 4. src/master/master.cpp Lines 1874-1875 (patched) <https://reviews.apache.org/r/68706/#comment294312> If we use equality operator and only set the timer when such a number of reregistered agent is reached we are guaranteed to only set each timer once (but we may need to set multiple timers in one call) right? This alleviates the need to check if the timer is already set! This should also work with boundary cases like the total recovered agents count being 0 or 1 (overlapping percentiles) etc. Right? src/master/metrics.hpp Lines 50 (patched) <https://reviews.apache.org/r/68706/#comment294275> Can we use the names agreed-upon with BenM in the slack channel which is e.g., `recovered_agents_50_percent_reregistered_secs`? src/master/metrics.hpp Lines 51 (patched) <https://reviews.apache.org/r/68706/#comment294313> On using Timer (e.g., like [state_fetch](https://github.com/apache/mesos/blob/7f36ebc1775398a43b2aa3a81bb647fb296b8313/src/master/registrar.cpp#L172)) vs. PushGauge, after looking at how it'll be used I think the main advantage of Timer is that it doesn't export any value if you haven't set it. Consider the two cases: 1. There are 1000 recovered agents and 0 have reregsitered, should all the timers have zero values or should they be absent? 2. There are 0 recovered agents (e.g., brand new cluster), should all of the metrics be zero or non-existent? I feel like they should be zero, as in, e.g., 100% of all 0 agents are reregistered within 0 secs. So timer handles this natrually. Also it sets the `_secs` name for you but that's a minor conveninence. - Jiang Yan Xu On Oct. 16, 2018, 9:49 a.m., Xudong Ni wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/68706/ > --- > > (Updated Oct. 16, 2018, 9:49 a.m.) > > > Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu. > > > Bugs: MESOS-9178 > https://issues.apache.org/jira/browse/MESOS-9178 > > > Repository: mesos > > > Description > --- > > During the master failover, the time that the master elected is > considered as the start of failover. In the progress of > reregistration, the percentile represents the time when such > percentile of agents finished registration again; The percentile of > these data as in this metrics can represent overall reregistration > progress; In case of degradation towards to the end of reregistration, > the high percentile can reflect it; In the case there are unreachable
Re: Review Request 68586: Add the output file to the hash on CommandInfo::URI.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/68586/#review209710 --- include/mesos/type_utils.hpp Lines 532 (patched) <https://reviews.apache.org/r/68586/#comment294257> So `cache` is counted here. Does it mean that if I specify two duplicate URIs with different `cache` values one will overwrite another in fetching? - Jiang Yan Xu On Aug. 31, 2018, 9:47 a.m., James Peach wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/68586/ > --- > > (Updated Aug. 31, 2018, 9:47 a.m.) > > > Review request for mesos, Gilbert Song and Joseph Wu. > > > Bugs: MESOS-9172 > https://issues.apache.org/jira/browse/MESOS-9172 > > > Repository: mesos > > > Description > --- > > The `output_file` field of the `CommandInfo::URI` is not combined > into the hash value. This means that the fetcher does not consider two > messages that differ only by the `output_file` as different. By adding > the `output_file` to the hash it is possible for a task to fetch the > same URI to multiple, distinct ouptuts. > > > Diffs > - > > include/mesos/type_utils.hpp 19ea81716496bcc0117a1b0ff157a0374f38bbfa > include/mesos/v1/mesos.hpp fda3eb42061f820869a2d8da939fccadc4e5ddfb > > > Diff: https://reviews.apache.org/r/68586/diff/1/ > > > Testing > --- > > make check (Fedora 28) > > > Thanks, > > James Peach > >
Re: Review Request 68586: Add the output file to the hash on CommandInfo::URI.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/68586/#review209706 --- Ship it! Ship It! - Jiang Yan Xu On Aug. 31, 2018, 9:47 a.m., James Peach wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/68586/ > --- > > (Updated Aug. 31, 2018, 9:47 a.m.) > > > Review request for mesos, Gilbert Song and Joseph Wu. > > > Bugs: MESOS-9172 > https://issues.apache.org/jira/browse/MESOS-9172 > > > Repository: mesos > > > Description > --- > > The `output_file` field of the `CommandInfo::URI` is not combined > into the hash value. This means that the fetcher does not consider two > messages that differ only by the `output_file` as different. By adding > the `output_file` to the hash it is possible for a task to fetch the > same URI to multiple, distinct ouptuts. > > > Diffs > - > > include/mesos/type_utils.hpp 19ea81716496bcc0117a1b0ff157a0374f38bbfa > include/mesos/v1/mesos.hpp fda3eb42061f820869a2d8da939fccadc4e5ddfb > > > Diff: https://reviews.apache.org/r/68586/diff/1/ > > > Testing > --- > > make check (Fedora 28) > > > Thanks, > > James Peach > >
Re: Review Request 68587: Fixed fetcher deadlock with duplicate URIs.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/68587/#review209704 --- Fix it, then Ship it! src/slave/containerizer/fetcher.cpp Lines 409-411 (patched) <https://reviews.apache.org/r/68587/#comment294254> The crux here is that URIs with the same `value` but are different in other fields like `executable` or `extract` can lead to multiple items in `entries` pointing to the same cache entry. Add it to the comment? src/slave/containerizer/fetcher.cpp Lines 421 (patched) <https://reviews.apache.org/r/68587/#comment294252> Using `newEntries.at()` is more idiomatic even though we did check that the entry exists. - Jiang Yan Xu On Aug. 31, 2018, 9:47 a.m., James Peach wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/68587/ > --- > > (Updated Aug. 31, 2018, 9:47 a.m.) > > > Review request for mesos, Gilbert Song and Joseph Wu. > > > Bugs: MESOS-9172 > https://issues.apache.org/jira/browse/MESOS-9172 > > > Repository: mesos > > > Description > --- > > If a fetch request contains duplicate URIs that are not already > in the cache, the fetcher would erroneously expect that some other > fetch process is going to download that cache entry. It will then > wait for a Future that will never complete. > > The fix is to track whether the cache entry was created in this > fetch, and in that case to simply allow the duplicate URI. In > the fetcher, we check the cache before downloading so that a URIs > can be fetched to distinct output files without being downloaded > multiple times. > > > Diffs > - > > src/launcher/fetcher.cpp ef8d7ebdcf6d5d15459fd3392cb99774f078347a > src/slave/containerizer/fetcher.cpp > 17f5388200c8341936cb4d7f8da67b5f286b727d > src/tests/fetcher_tests.cpp f3ea7092635c88b9dddcc2998a4c7350fb56110c > > > Diff: https://reviews.apache.org/r/68587/diff/1/ > > > Testing > --- > > make check (Fedora 28) > > > Thanks, > > James Peach > >
Re: Review Request 68999: Fixed the XFS project ID labeling to not cross mount points.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/68999/#review209500 --- Ship it! +1 on the issue you raised yourself as well. src/slave/containerizer/mesos/isolators/xfs/utils.cpp Lines 363 (patched) <https://reviews.apache.org/r/68999/#comment293968> I think to summarize the linked article here one key point worth mentioning is the "mount crossings" check applies even when the source and target are on the same filesystem (which `FTS_XDEV` doesn't give us)? - Jiang Yan Xu On Oct. 11, 2018, 3:23 p.m., James Peach wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/68999/ > --- > > (Updated Oct. 11, 2018, 3:23 p.m.) > > > Review request for mesos, Ilya Pronin, Jie Yu, and Jiang Yan Xu. > > > Bugs: MESOS-9300 > https://issues.apache.org/jira/browse/MESOS-9300 > > > Repository: mesos > > > Description > --- > > If we bind mount a directory on the same filesystem, the fts(3) > directory traversal that recursively assigns XFS project IDs isn't > able to detect when it crosses the mount point. This leads it to > erroneously rewrite the project IDs for persistent volumes if they > are bind mounted into the container by the `filesystem/linux` > isolator. The fix is to test whether we crossed a mount point > whenever we descend into a new directory so that we can avoid > assigning project ID beneath it. > > > Diffs > - > > src/slave/containerizer/mesos/isolators/xfs/utils.cpp > d5d03774b39ff6fada4a56e565d8c754790c42db > src/tests/containerizer/xfs_quota_tests.cpp > a615ab125cd0144c5da9445468be2aaa0dc05507 > > > Diff: https://reviews.apache.org/r/68999/diff/1/ > > > Testing > --- > > sudo make check (Fedora 28) > > > Thanks, > > James Peach > >
Re: Review Request 68706: Added master failover reregistration progress metrics.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/68706/#review208599 --- src/master/master.cpp Lines 7436 (patched) <https://reviews.apache.org/r/68706/#comment292676> There are a few considerations on using one timer for all agent reregistrations. 1. The full list of statistics only make sense when 100% of the agents are reregistered. See this example in the testing section: ``` "master/slaves_reregistration_secs":32.321583104, "master/slaves_reregistration_secs/count":7, "master/slaves_reregistration_secs/max":32.321583104, "master/slaves_reregistration_secs/min":3.35373696, "master/slaves_reregistration_secs/p50":8.774915072, "master/slaves_reregistration_secs/p90":30.8676036608, "master/slaves_reregistration_secs/p95":31.59459338236, "master/slaves_reregistration_secs/p99":32.176185159679996, "master/slaves_reregistration_secs/p999":32.307043309567995, "master/slaves_reregistration_secs/p":32.3201291245568, ``` What does /p50 mean when 200 of the 1000 total are reregistered? I think metrics always being meaningful is a requirement for continuous monitoring. 2. `TimeSeries` drops old values during [sparsification](https://github.com/apache/mesos/blob/91ca8e2b2071f7e4b89702ae7c807b074bdef31b/3rdparty/libprocess/include/process/timeseries.hpp#L191). Therefore if the total agent count in the cluster is bigger than the time series capacity (large Mesos deployments have O(1) agents, old values get dropped and the p25 is no longer "25% of the total agents" even after all agents are reregistered. This leads to the next point. 3. `Timer`'s semantics: it's suppose to work when each observation has equal semantics, e.g., each [state_store](http://mesos.apache.org/documentation/latest/monitoring/#registrar) has the same meaning, therefore it's OK to drop old values which may reduce precision but will not changing the meaning to the statistics. Let's start with how the monitoring system will use it and work backwards. I think my proposal in https://issues.apache.org/jira/browse/MESOS-9178 is worth considering but maybe there are better ones. - Jiang Yan Xu On Sept. 12, 2018, 2:42 p.m., Xudong Ni wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/68706/ > ----------- > > (Updated Sept. 12, 2018, 2:42 p.m.) > > > Review request for mesos, Benjamin Mahler, James Peach, and Jiang Yan Xu. > > > Bugs: MESOS-9178 > https://issues.apache.org/jira/browse/MESOS-9178 > > > Repository: mesos > > > Description > --- > > When an agent is reregistrated, the time delta from that moment to > the master elected time was saved; In the progress of reregistration, > each data entry represents the registration time delta from master > elected time; The percentile of these data as in this metrics can > represent overall reregistration progress; In case of degradation > towards to the end of reregistration, the high percentile will > reflect it. > > Note: These metrics only represent the completed reregistration; It > does not monitor agents were finally marked as unreachable that the > reregistration didn't actually happen, the unreachable agents were > already monitored by existing metrics. > > > Diffs > - > > docs/monitoring.md 00c6ea94bcb73746aef740236632ede123f5b534 > src/master/master.cpp 06d769aeba16586a020729d454f4d00688b78c78 > src/master/metrics.hpp e1da18e6ba2737f729e1e30653020538150ae898 > src/master/metrics.cpp 56a7eef2d279ad3248092d37d19013d3ac110757 > > > Diff: https://reviews.apache.org/r/68706/diff/1/ > > > Testing > --- > > Tested in mmaster with seven reregistration agents: > "master/slaves_reregistration_secs":32.321583104, > "master/slaves_reregistration_secs/count":7, > "master/slaves_reregistration_secs/max":32.321583104, > "master/slaves_reregistration_secs/min":3.35373696, > "master/slaves_reregistration_secs/p50":8.774915072, > "master/slaves_reregistration_secs/p90":30.8676036608, > "master/slaves_reregistration_secs/p95":31.59459338236, > "master/slaves_reregistration_secs/p99":32.176185159679996, > "master/slaves_reregistration_secs/p999":32.307043309567995, > "master/slaves_reregistration_secs/p":32.3201291245568, > > > Thanks, > > Xudong Ni > >
Re: Review Request 68448: Skip the container if there's no container network.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/68448/#review207693 --- Ship it! Ship It! - Jiang Yan Xu On Aug. 21, 2018, 12:16 p.m., Sergey Urbanovich wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/68448/ > --- > > (Updated Aug. 21, 2018, 12:16 p.m.) > > > Review request for mesos, Jie Yu and Jiang Yan Xu. > > > Bugs: MESOS-9171 > https://issues.apache.org/jira/browse/MESOS-9171 > > > Repository: mesos > > > Description > --- > > If the container wants to join host network it won't have container > networks. In that case `rootDir` could be not configured and > `rootDir.get()` in `usage()` will cause mesos agent crash. > > > Diffs > - > > src/slave/containerizer/mesos/isolators/network/cni/cni.cpp > 6df52a55b13675df593872559b5238e8a10fd666 > > > Diff: https://reviews.apache.org/r/68448/diff/1/ > > > Testing > --- > > bin/mesos-tests.sh --verbose > --gtest_filter="CniIsolatorTest.VETH_VerifyResourceStatistics" > --gtest_break_on_failure --gtest_repeat=100 > > > Thanks, > > Sergey Urbanovich > >
Re: Review Request 67365: Added MESOS-8340 to the 1.7.x CHANGELOG.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/67365/#review204104 --- Ship it! Ship It! - Jiang Yan Xu On May 30, 2018, 11:57 a.m., James Peach wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/67365/ > --- > > (Updated May 30, 2018, 11:57 a.m.) > > > Review request for mesos, Xudong Ni and Jiang Yan Xu. > > > Bugs: MESOS-8340 > https://issues.apache.org/jira/browse/MESOS-8340 > > > Repository: mesos > > > Description > --- > > Added MESOS-8340 to the 1.7.x CHANGELOG. > > > Diffs > - > > CHANGELOG 52c7fc3bcd73a0eead2934644f636b85c60e0061 > docs/upgrades.md 1a1705cd9c38583c591392dd154d136fce04129d > > > Diff: https://reviews.apache.org/r/67365/diff/2/ > > > Testing > --- > > Manual inspection of the dev website. > > > Thanks, > > James Peach > >
Re: Review Request 67365: Added MESOS-8340 to the 1.7.x CHANGELOG.
> On May 30, 2018, 10:56 a.m., Jiang Yan Xu wrote: > > docs/upgrades.md > > Lines 445 (patched) > > <https://reviews.apache.org/r/67365/diff/1/?file=2031690#file2031690line445> > > > > Is it TCP specific? > > James Peach wrote: > Yes. I see and I found the word TCP in code comments. So apps are free to listening on unassigned UDP ports? This is fine but I am just a bit surprised this is the place in the documentation that TCP is mentioned (not in network-ports.md either). Worth clarifying (separately as this RR is about CHANGELOG)? - Jiang Yan --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/67365/#review204058 --- On May 30, 2018, 11:57 a.m., James Peach wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/67365/ > --- > > (Updated May 30, 2018, 11:57 a.m.) > > > Review request for mesos, Xudong Ni and Jiang Yan Xu. > > > Bugs: MESOS-8340 > https://issues.apache.org/jira/browse/MESOS-8340 > > > Repository: mesos > > > Description > --- > > Added MESOS-8340 to the 1.7.x CHANGELOG. > > > Diffs > - > > CHANGELOG 52c7fc3bcd73a0eead2934644f636b85c60e0061 > docs/upgrades.md 1a1705cd9c38583c591392dd154d136fce04129d > > > Diff: https://reviews.apache.org/r/67365/diff/2/ > > > Testing > --- > > Manual inspection of the dev website. > > > Thanks, > > James Peach > >
Re: Review Request 67365: Added MESOS-8340 to the 1.7.x CHANGELOG.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/67365/#review204058 --- docs/upgrades.md Lines 445 (patched) <https://reviews.apache.org/r/67365/#comment286451> Is it TCP specific? - Jiang Yan Xu On May 29, 2018, 3:46 p.m., James Peach wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/67365/ > --- > > (Updated May 29, 2018, 3:46 p.m.) > > > Review request for mesos, Xudong Ni and Jiang Yan Xu. > > > Bugs: MESOS-8340 > https://issues.apache.org/jira/browse/MESOS-8340 > > > Repository: mesos > > > Description > --- > > Added MESOS-8340 to the 1.7.x CHANGELOG. > > > Diffs > - > > CHANGELOG 52c7fc3bcd73a0eead2934644f636b85c60e0061 > docs/upgrades.md 1a1705cd9c38583c591392dd154d136fce04129d > > > Diff: https://reviews.apache.org/r/67365/diff/1/ > > > Testing > --- > > Manual inspection of the dev website. > > > Thanks, > > James Peach > >
Review Request 67115: Added more diagnostic info for a CHECK.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/67115/ --- Review request for mesos and James Peach. Repository: mesos Description --- Added more diagnostic info for a CHECK. Diffs - src/master/allocator/mesos/hierarchical.cpp 1000968be6a2935a4cac571414d7f06d7df7acf0 Diff: https://reviews.apache.org/r/67115/diff/1/ Testing --- make check Thanks, Jiang Yan Xu
Re: Review Request 66919: Failure to update registry should abort the master process.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66919/#review202606 --- Fix it, then Ship it! FYI committed with some fixes on the commit message (which comes from the review summary). src/master/http.cpp Lines 4163 (patched) <https://reviews.apache.org/r/66919/#comment284492> You missed a space between `result)` and `{` which I didn't catch initially but fixed up in a subsequent commit. - Jiang Yan Xu On May 7, 2018, 11:10 a.m., Xudong Ni wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/66919/ > --- > > (Updated May 7, 2018, 11:10 a.m.) > > > Review request for mesos, Joseph Wu and Jiang Yan Xu. > > > Bugs: MESOS-8630 > https://issues.apache.org/jira/browse/MESOS-8630 > > > Repository: mesos > > > Description > --- > > When the registrar fails to update the registry it would abort the > actor and fail all future operations,However when the registrar > updates is requested by an operator API such as maintenance update, > the master process doesn't shut down (a 500 error is returned to the > client instead)and all subsequent operations will fail. > > Review: https://reviews.apache.org/r/66919 > > > Diffs > - > > src/master/http.cpp 135ae4389623385a6638158f5f23d6daca14a0ad > > > Diff: https://reviews.apache.org/r/66919/diff/3/ > > > Testing > --- > > xujyan made a local test change to verify this patch: > https://github.com/xujyan/mesos/commit/68051320a87431f6d2f3fbad6b0b97814200a731 > > Test output: > F0504 20:43:36.530680 1858991 http.cpp:4304] CHECK_READY(result): is FAILED: > Failed to update registry: Failed to perform store within 5secs > > > Thanks, > > Xudong Ni > >
Re: Review Request 66919: Failure to update registry should abort the master process.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66919/#review202496 --- The "process aborting" logic is a bit hard to write tests for but we can test it manually without checking in the code. e.g., I changed one test a bit to show the check failure: https://github.com/xujyan/mesos/commit/68051320a87431f6d2f3fbad6b0b97814200a731 Could you update the "Testing Done" section with the link? That should cover it. :) - Jiang Yan Xu On May 4, 2018, 12:46 p.m., Xudong Ni wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/66919/ > --- > > (Updated May 4, 2018, 12:46 p.m.) > > > Review request for mesos, Joseph Wu and Jiang Yan Xu. > > > Bugs: MESOS-8630 > https://issues.apache.org/jira/browse/MESOS-8630 > > > Repository: mesos > > > Description > --- > > When the registrar fails to update the registry it would abort the > actor and fail all future operations,However when the registrar > updates is requested by an operator API such as maintenance update, > the master process doesn't shut down (a 500 error is returned to the > client instead)and all subsequent operations will fail. > > Review: https://reviews.apache.org/r/66919 > > > Diffs > - > > src/master/http.cpp 135ae4389623385a6638158f5f23d6daca14a0ad > > > Diff: https://reviews.apache.org/r/66919/diff/1/ > > > Testing > --- > > No error found by running all unit tests(make check) > > > Thanks, > > Xudong Ni > >
Re: Review Request 66919: Failure to update registry should abort the master process.
> On May 4, 2018, 8:19 a.m., Jiang Yan Xu wrote: > > src/master/http.cpp > > Lines 4163-4165 (patched) > > <https://reviews.apache.org/r/66919/diff/1/?file=2016039#file2016039line4163> > > > > The following will be more idiomatic. > > > > ``` > > .onAny([](const Future& result) { > > CHECK_READY(result); > > }) > > ``` > > > > Note the spaces and const ref. > > > > > > It'll be helpful to get familiarize with > > > > https://github.com/apache/mesos/blob/master/docs/c%2B%2B-style-guide.md > > > > which in turn references > > > > https://google.github.io/styleguide/cppguide.html > > Jiang Yan Xu wrote: > It's also helpful to attach a message to the check failure: > > > ``` > CHECK_READY(result) > << "Failed to update maintenance schedule in the registry"; > ``` Also let's add a TODO above the `CHECK_READY`: ``` TODO(fiu): Consider changing/refactoring the registrar itself so the individual call sites don't need to handle this separately. All registrar failures that cause it to abort should instead abort the process. ``` - Jiang Yan --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66919/#review202416 --- On May 4, 2018, 12:46 p.m., Xudong Ni wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/66919/ > --- > > (Updated May 4, 2018, 12:46 p.m.) > > > Review request for mesos, Joseph Wu and Jiang Yan Xu. > > > Bugs: MESOS-8630 > https://issues.apache.org/jira/browse/MESOS-8630 > > > Repository: mesos > > > Description > --- > > When the registrar fails to update the registry it would abort the > actor and fail all future operations,However when the registrar > updates is requested by an operator API such as maintenance update, > the master process doesn't shut down (a 500 error is returned to the > client instead)and all subsequent operations will fail. > > Review: https://reviews.apache.org/r/66919 > > > Diffs > - > > src/master/http.cpp 135ae4389623385a6638158f5f23d6daca14a0ad > > > Diff: https://reviews.apache.org/r/66919/diff/1/ > > > Testing > --- > > No error found by running all unit tests(make check) > > > Thanks, > > Xudong Ni > >
Re: Review Request 66919: Failure to update registry should abort the master process.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66919/#review202416 --- LGTM It's hard to write a test for this but we can probably modify a test in master_maintenance_tests.cpp to observe the check failure. Let's also check Joseph with the comment here. https://github.com/apache/mesos/blob/520b729857223aeade345cbdf61209ec4f395ad9/src/master/maintenance.hpp#L39 src/master/http.cpp Lines 4163-4165 (patched) <https://reviews.apache.org/r/66919/#comment284233> The following will be more idiomatic. ``` .onAny([](const Future& result) { CHECK_READY(result); }) ``` Note the spaces and const ref. It'll be helpful to get familiarize with https://github.com/apache/mesos/blob/master/docs/c%2B%2B-style-guide.md which in turn references https://google.github.io/styleguide/cppguide.html - Jiang Yan Xu On May 3, 2018, 9:43 a.m., Xudong Ni wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/66919/ > --- > > (Updated May 3, 2018, 9:43 a.m.) > > > Review request for mesos and Jiang Yan Xu. > > > Bugs: MESOS-8630 > https://issues.apache.org/jira/browse/MESOS-8630 > > > Repository: mesos > > > Description > --- > > When the registrar fails to update the registry it would abort the > actor and fail all future operations,However when the registrar > updates is requested by an operator API such as maintenance update, > the master process doesn't shut down (a 500 error is returned to the > client instead)and all subsequent operations will fail. > > Review: https://reviews.apache.org/r/66919 > > > Diffs > - > > src/master/http.cpp 135ae4389623385a6638158f5f23d6daca14a0ad > > > Diff: https://reviews.apache.org/r/66919/diff/1/ > > > Testing > --- > > No error found by running all unit tests(make check) > > > Thanks, > > Xudong Ni > >
Re: Review Request 66644: Remove unknown unreachable tasks when agent reregisters.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66644/#review202368 --- Fix it, then Ship it! I'll commit with these minor fixes. src/master/master.hpp Lines 1928 (patched) <https://reviews.apache.org/r/66644/#comment284232> I'll quote the `unreachable` as it is code. src/master/master.cpp Lines 7131 (patched) <https://reviews.apache.org/r/66644/#comment284170> Space in between `foreachkey (` src/master/master.cpp Lines 7135 (patched) <https://reviews.apache.org/r/66644/#comment284171> Space in between `foreach (` src/tests/partition_tests.cpp Lines 168-169 (patched) <https://reviews.apache.org/r/66644/#comment284231> This clearly can fit in the previous line which would be less jagged. src/tests/partition_tests.cpp Lines 396 (patched) <https://reviews.apache.org/r/66644/#comment284230> Period to end the sentence. - Jiang Yan Xu On May 1, 2018, 10:18 p.m., Megha Sharma wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/66644/ > --- > > (Updated May 1, 2018, 10:18 p.m.) > > > Review request for mesos and Jiang Yan Xu. > > > Bugs: 8750 > https://issues.apache.org/jira/browse/8750 > > > Repository: mesos > > > Description > --- > > A RunTask messsage could get dropped for an agent while its > disconnected from the master and when such an agent goes unreachable > then this dropped task message gets added to the unreachable tasks. > When the agent reregisters, the master sends status updates for the > tasks that the agent reported when re-registering and these tasks are > also removed from the unreachableTasks on the framework but since the > agent doesn't know about the dropped task so it doesn't get removed > from the unreachableTasks leading to a check failure when > this inconsistency is detected during framework removal. > > > Diffs > - > > src/master/master.hpp 0d9620dd0c232dc1df83477e838eeb7313bf8828 > src/master/master.cpp 767ad8cfe142b47ef07172bcb2a4fb49fc3e833a > src/tests/partition_tests.cpp 9138e5c745cf354a3573e1ab0b251d46702833cc > > > Diff: https://reviews.apache.org/r/66644/diff/5/ > > > Testing > --- > > make check > > > Thanks, > > Megha Sharma > >
Re: Review Request 66644: Remove unknown unreachable tasks when agent reregisters.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66644/#review202215 --- LGTM sans the unresolved issue from the previous revision! src/tests/partition_tests.cpp Lines 167-169 (original), 167-170 (patched) <https://reviews.apache.org/r/66644/#comment283966> This reads like "the `RunTaskMessage` is remove correctly" but I think the subject is the `task`? Perhaps revise a bit so it says ``` // This test verifies that if a `RunTaskMessage` dropped en route to // an agent which later becomes unreachable, the task is removed correctly // from the master's unreachable task records when the agent // reregisters. ``` ? src/tests/partition_tests.cpp Lines 171 (patched) <https://reviews.apache.org/r/66644/#comment283967> This seems to be exceeded the 80 character limit. src/tests/partition_tests.cpp Lines 47 (patched) <https://reviews.apache.org/r/66644/#comment283968> It doesn't look like this is used? - Jiang Yan Xu On April 25, 2018, 8:46 a.m., Megha Sharma wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/66644/ > --- > > (Updated April 25, 2018, 8:46 a.m.) > > > Review request for mesos and Jiang Yan Xu. > > > Bugs: 8750 > https://issues.apache.org/jira/browse/8750 > > > Repository: mesos > > > Description > --- > > A RunTask messsage could get dropped for an agent while it's > disconnected from the master and when such an agent goes unreachable > then this dropped task gets added to the unreachable tasks. > When the agent reregisters, tasks reported by it are removed from the > unreachableTasks bookkeeping on the master but since the > agent doesn't know about the dropped task so it doesn't get removed > from the unreachableTasks leading to a master check failure when > this inconsistency is detected during framework removal. > > > Diffs > - > > src/master/master.hpp 0d9620dd0c232dc1df83477e838eeb7313bf8828 > src/master/master.cpp 767ad8cfe142b47ef07172bcb2a4fb49fc3e833a > src/tests/partition_tests.cpp 9138e5c745cf354a3573e1ab0b251d46702833cc > > > Diff: https://reviews.apache.org/r/66644/diff/4/ > > > Testing > --- > > make check > > > Thanks, > > Megha Sharma > >
Re: Review Request 66644: Remove unknown unreachable tasks when agent reregisters.
> On April 25, 2018, 3:39 p.m., Jiang Yan Xu wrote: > > src/master/master.cpp > > Lines 10577-10584 (patched) > > <https://reviews.apache.org/r/66644/diff/3/?file=2012148#file2012148line10577> > > > > This is the case we don't need to use `at()` as we are modifying the > > hashmap, we are not concerned about adding empty entries. We can just use > > `[]`: > > http://www.cplusplus.com/reference/unordered_map/unordered_map/operator[]/ > > > > The following should work for all the cases here. > > > > ``` > > slaves.unreachableTasks[slave->id].put(task->framework_id(), > > task->task_id()); > > ``` > > Megha Sharma wrote: > I agree we can just use the `unordered_map::operator[]` without the empty > entry adding concern but `unordered_map::at` here has the same behavior. I > don't mind changing it but just curious here as to why we prefer the > `operator[]` in this case? Because the 8 lines can be condensed into 1-2? It's not about `[]` vs. `at` per se but the whole condition checking. I'd say the `[]` behavior is precesely created for this case so we'd better use it. :) - Jiang Yan --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66644/#review201963 --- On April 25, 2018, 8:46 a.m., Megha Sharma wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/66644/ > ----------- > > (Updated April 25, 2018, 8:46 a.m.) > > > Review request for mesos and Jiang Yan Xu. > > > Bugs: 8750 > https://issues.apache.org/jira/browse/8750 > > > Repository: mesos > > > Description > --- > > A RunTask messsage could get dropped for an agent while it's > disconnected from the master and when such an agent goes unreachable > then this dropped task gets added to the unreachable tasks. > When the agent reregisters, tasks reported by it are removed from the > unreachableTasks bookkeeping on the master but since the > agent doesn't know about the dropped task so it doesn't get removed > from the unreachableTasks leading to a master check failure when > this inconsistency is detected during framework removal. > > > Diffs > - > > src/master/master.hpp 0d9620dd0c232dc1df83477e838eeb7313bf8828 > src/master/master.cpp 767ad8cfe142b47ef07172bcb2a4fb49fc3e833a > src/tests/partition_tests.cpp 9138e5c745cf354a3573e1ab0b251d46702833cc > > > Diff: https://reviews.apache.org/r/66644/diff/4/ > > > Testing > --- > > make check > > > Thanks, > > Megha Sharma > >
Review Request 66778: Fixed Master which generated updates with the latest task state.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66778/ --- Review request for mesos, Andrei Budnik, Alexander Rukletsov, and Megha Sharma. Bugs: MESOS-8824 https://issues.apache.org/jira/browse/MESOS-8824 Repository: mesos Description --- The master should use `Task.status_update_state` to generate status updates during agent reregistration instead of `Task.state` because using the latter could result in the scheduler seeing out-of-order task state transitions (e.g., TASK_FINISHED -> TASK_RUNNING). Diffs - src/master/master.cpp c723a291ed8d219ced4492bc905ac6b52683ae47 src/tests/partition_tests.cpp 606025a54dc54a777222c4dca5fd8c3c084031d2 Diff: https://reviews.apache.org/r/66778/diff/1/ Testing --- Make check. Thanks, Jiang Yan Xu
Review Request 66769: Fixed flaky ReconciliationTest.ReconcileStatusUpdateTaskState.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/66769/ --- Review request for mesos, Andrei Budnik, Alexander Rukletsov, and Megha Sharma. Bugs: MESOS-8618 https://issues.apache.org/jira/browse/MESOS-8618 Repository: mesos Description --- To simulate a master failover we need to use `replicated_log` as the registry otherwiser the master loses persisted info about the agents. Diffs - src/tests/reconciliation_tests.cpp b06244b5f3efad3a88e367cd8e26cebd3d9f2e43 Diff: https://reviews.apache.org/r/66769/diff/1/ Testing --- Make check. Thanks, Jiang Yan Xu
Re: Review Request 65967: Eliminated some copying in the master's Accept call handler path.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/65967/#review198911 --- Ship it! Sweet! Would it make sense to group JIRAs into a "copy elimination" epic? Completing a sweep is a large scale effort so an Epic with subtasks helps tracking remaining work? - Jiang Yan Xu On March 7, 2018, 7:52 p.m., Benjamin Mahler wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/65967/ > --- > > (Updated March 7, 2018, 7:52 p.m.) > > > Review request for mesos and Meng Zhu. > > > Bugs: MESOS-8628 > https://issues.apache.org/jira/browse/MESOS-8628 > > > Repository: mesos > > > Description > --- > > This eliminates two copies of the `Accept` message. > > > Diffs > - > > src/master/http.cpp acf10a0f396234ec9f444c9b41515d657379d6ad > src/master/master.hpp c9c8a968b6f56fe261ac9fe374e926a28d40fccb > src/master/master.cpp e666b664dd125317cda5d16285d444b9c21e1f35 > > > Diff: https://reviews.apache.org/r/65967/diff/1/ > > > Testing > --- > > make check > > > Thanks, > > Benjamin Mahler > >
Review Request 65438: Fixed flaky SlaveTest.RegisteredAgentReregisterAfterFailover.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/65438/ --- Review request for mesos, Alexander Rukletsov and James Peach. Bugs: MESOS-8232 https://issues.apache.org/jira/browse/MESOS-8232 Repository: mesos Description --- Previously due to a race condition a registration retry could be held in the authorizer and forwarded back to the master only after the agent's disconnection is detected, which leads to an remove registrar operation when the test asserts there's none. Diffs - src/tests/slave_tests.cpp 20b874481d3818574731fc30ba9df1fc2bcbe900 Diff: https://reviews.apache.org/r/65438/diff/1/ Testing --- Ran the test with 1000 iterations. Thanks, Jiang Yan Xu
Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64940/#review195350 --- Ship it! Will commit with a few touches like the following. src/master/master.cpp Lines 10937-10941 (original), 10935-10939 (patched) <https://reviews.apache.org/r/64940/#comment274514> Given our latest conclusion let's revert this back to ``` count += framework->unreachableTasks.size(); ``` - Jiang Yan Xu On Jan. 12, 2018, 11:24 a.m., James Peach wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/64940/ > --- > > (Updated Jan. 12, 2018, 11:24 a.m.) > > > Review request for mesos, Benjamin Mahler, Gaston Kleiman, Jie Yu, Vinod > Kone, and Jiang Yan Xu. > > > Bugs: MESOS-8337 > https://issues.apache.org/jira/browse/MESOS-8337 > > > Repository: mesos > > > Description > --- > > If an agent is lost, we try to remove all the tasks that might have > been lost. If a task is already terminal but has unacknowleged status > updates, it is expected that we track it in the unreachable tasks list > so we should remove the CHECK that prevents this. This also backs out > changes to how unreachable tasks are presented in the HTTP endpoints to > restore compatibility with previous Mesos releases. > > > Diffs > - > > src/master/http.cpp bc29fafb1f4b515aec3b77850f960c88a65c8362 > src/master/master.hpp f915c6f611eaffd5c2cb6cba6c43def679e8f92d > src/master/master.cpp c96cd7090875fbf1b11b1708390e88182c77655b > src/tests/mesos.hpp 93913f2e01898c73e09de58a975aa467e714d882 > src/tests/partition_tests.cpp 3813139f576ea01db0197f0fe8a73597db1bb69a > > > Diff: https://reviews.apache.org/r/64940/diff/7/ > > > Testing > --- > > make check (Fedora 27) > > > Thanks, > > James Peach > >
Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64940/#review195241 --- LGTM modulo Vinod's comments. Sorry we have gone back and forth on this but this time I think we are close to reach what's shippable to unblock 1.5. src/master/http.cpp Lines 360-362 (original) <https://reviews.apache.org/r/64940/#comment274470> So, as reminded by Vinod, for backward compatibility let's keep the checks. However here we should change the condition to ``` if (task->state() == TASK_UNREACHABLE) { continue; } ``` because other terminal tasks that are not TASK_LOST could also be in this map... src/master/master.hpp Lines 2326-2327 (original) <https://reviews.apache.org/r/64940/#comment274417> The following CHECK looks correct? ``` CHECK(Master::isRemovable(task->state()) << task->state(); ``` Although failing CHECKs are annoying, it helped us uncover the inconsistency in our task state transitions so it's better to correct it? src/master/master.cpp Lines 11023-11036 (original), 11021-11034 (patched) <https://reviews.apache.org/r/64940/#comment274415> I was going to suggest that we remove the `if (task->state() == TASK_UNREACHABLE)` check but per Vinod's comment we shouldn't. src/tests/mesos.hpp Lines 3511 (patched) <https://reviews.apache.org/r/64940/#comment274471> s/a/an/? src/tests/mesos.hpp Line 3511 (original), 3518 (patched) <https://reviews.apache.org/r/64940/#comment274472> s/a/an/? src/tests/partition_tests.cpp Lines 2401 (patched) <https://reviews.apache.org/r/64940/#comment274477> We have to s/unreachable/completed/ and s/but/and/ ? src/tests/partition_tests.cpp Lines 2402 (patched) <https://reviews.apache.org/r/64940/#comment274478> With all the discussions about teminology it seems s/Completed/Terminal/ is more accurate? src/tests/partition_tests.cpp Lines 2440-2442 (patched) <https://reviews.apache.org/r/64940/#comment274473> Nit: declare them in the order we are expecting on them? i.e., starting -> running -> finished. src/tests/partition_tests.cpp Lines 2474-2475 (patched) <https://reviews.apache.org/r/64940/#comment274474> Without a paused clock, there is a chance of race here right? If the same status update is resent after a timeout before we acknowledge, it's possible that the resent update is going to be caught by the 2nd expectation? Would it be possible to hoist the Clock::pause() statement? src/tests/partition_tests.cpp Lines 2527 (patched) <https://reviews.apache.org/r/64940/#comment274475> Remove this debugging log. src/tests/partition_tests.cpp Lines 2532-2534 (patched) <https://reviews.apache.org/r/64940/#comment274476> Sorry now you have to expect it to be in the completed_task section... - Jiang Yan Xu On Jan. 5, 2018, 11:37 a.m., James Peach wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/64940/ > --- > > (Updated Jan. 5, 2018, 11:37 a.m.) > > > Review request for mesos, Benjamin Mahler, Gaston Kleiman, Jie Yu, Vinod > Kone, and Jiang Yan Xu. > > > Bugs: MESOS-8337 > https://issues.apache.org/jira/browse/MESOS-8337 > > > Repository: mesos > > > Description > --- > > If an agent is lost, we try to remove all the tasks that might have > been lost. If a task is already terminal but has unacknowleged status > updates, it is expected that we track it in the unreachable tasks list > so we should remove the CHECK that prevents this. This also backs out > changes to how unreachable tasks are presented in the HTTP endpoints to > restore compatibility with previous Mesos releases. > > > Diffs > - > > src/master/http.cpp bc29fafb1f4b515aec3b77850f960c88a65c8362 > src/master/master.hpp 5e6ba53c075174a1e514a395ceb17c26201ec470 > src/master/master.cpp 6fc5de89e54ba0b9ae2c4fb475be9878910820d3 > src/tests/mesos.hpp 93913f2e01898c73e09de58a975aa467e714d882 > src/tests/partition_tests.cpp 3813139f576ea01db0197f0fe8a73597db1bb69a > > > Diff: https://reviews.apache.org/r/64940/diff/6/ > > > Testing > --- > > make check (Fedora 27) > > > Thanks, > > James Peach > >
Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.
> On Jan. 11, 2018, 6:24 p.m., Vinod Kone wrote: > > AFAICT, in 1.4.x we show unreachable terminal tasks on a removed agent as > > completed tasks. But now, we show them as unreachable tasks. If that's true > > it's an API change that needs to be called out. Is that really backwards > > compatible? Yeah it's true. Despite it being a bug that if the unreachable terminal tasks are considered completed and added to the completed list, it cannot be later removed when the agent reregisters and duplicates are shown in the webUI and APIs, it is indeed what 1.4.x gives you. 1.5 fixes the duplication problem but we did the extra work (the additional `if (task->state() != TASK_UNREACHABLE)` checks we added and this revision removes) to make it look like the 1.4.x version, I guess it's fine to keep it that way until we have a plan for an overhaul... So, probably let's not revert the code that involves the http endpoints (sorry for suggesting it earlier, there are small changes needed which I'll comment on). - Jiang Yan --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64940/#review195282 --- On Jan. 5, 2018, 11:37 a.m., James Peach wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/64940/ > --- > > (Updated Jan. 5, 2018, 11:37 a.m.) > > > Review request for mesos, Benjamin Mahler, Gaston Kleiman, Jie Yu, Vinod > Kone, and Jiang Yan Xu. > > > Bugs: MESOS-8337 > https://issues.apache.org/jira/browse/MESOS-8337 > > > Repository: mesos > > > Description > --- > > If an agent is lost, we try to remove all the tasks that might have > been lost. If a task is already terminal but has unacknowleged status > updates, it is expected that we track it in the unreachable tasks list > so we should remove the CHECK that prevents this. This also backs out > changes to how unreachable tasks are presented in the HTTP endpoints to > restore compatibility with previous Mesos releases. > > > Diffs > - > > src/master/http.cpp bc29fafb1f4b515aec3b77850f960c88a65c8362 > src/master/master.hpp 5e6ba53c075174a1e514a395ceb17c26201ec470 > src/master/master.cpp 6fc5de89e54ba0b9ae2c4fb475be9878910820d3 > src/tests/mesos.hpp 93913f2e01898c73e09de58a975aa467e714d882 > src/tests/partition_tests.cpp 3813139f576ea01db0197f0fe8a73597db1bb69a > > > Diff: https://reviews.apache.org/r/64940/diff/6/ > > > Testing > --- > > make check (Fedora 27) > > > Thanks, > > James Peach > >
Re: Review Request 64516: Improved documentation on resource reservations.
> On Jan. 9, 2018, 8:27 a.m., James Peach wrote: > > docs/reservation.md > > Lines 88 (patched) > > <https://reviews.apache.org/r/64516/diff/1/?file=1912814#file1912814line98> > > > > "Similarly, agents can register ..." Meaning s/by default// because of repetition? It doesn't feel too redundant to me because I wanted to emphasize the default behavior? - Jiang Yan --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64516/#review195041 ----------- On Jan. 10, 2018, 2:23 p.m., Jiang Yan Xu wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/64516/ > --- > > (Updated Jan. 10, 2018, 2:23 p.m.) > > > Review request for mesos, Alexander Rojas, James Peach, and Michael Park. > > > Bugs: MESOS-8306 > https://issues.apache.org/jira/browse/MESOS-8306 > > > Repository: mesos > > > Description > --- > > Added explanation for authorizations on static reservations. > > > Diffs > - > > docs/reservation.md fa1f309f2b770e37964af2d5599519875195dbec > > > Diff: https://reviews.apache.org/r/64516/diff/2/ > > > Testing > --- > > Used a markdown viewer. > > > Thanks, > > Jiang Yan Xu > >
Re: Review Request 64516: Improved documentation on resource reservations.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64516/ --- (Updated Jan. 10, 2018, 2:23 p.m.) Review request for mesos, Alexander Rojas, James Peach, and Michael Park. Changes --- Comments. Bugs: MESOS-8306 https://issues.apache.org/jira/browse/MESOS-8306 Repository: mesos Description --- Added explanation for authorizations on static reservations. Diffs (updated) - docs/reservation.md fa1f309f2b770e37964af2d5599519875195dbec Diff: https://reviews.apache.org/r/64516/diff/2/ Changes: https://reviews.apache.org/r/64516/diff/1-2/ Testing --- Used a markdown viewer. Thanks, Jiang Yan Xu
Re: Review Request 64515: Used `reserve_resources` ACL for static reservations.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64515/ --- (Updated Jan. 10, 2018, 2:22 p.m.) Review request for mesos, Alexander Rojas and James Peach. Changes --- Review comments. Bugs: MESOS-8306 https://issues.apache.org/jira/browse/MESOS-8306 Repository: mesos Description --- Used `reserve_resources` ACL for static reservations. Diffs (updated) - include/mesos/authorizer/acls.proto d869820491f1b81f81e1157493aa73e32735b9c3 src/master/master.hpp 5e6ba53c075174a1e514a395ceb17c26201ec470 src/master/master.cpp 6fc5de89e54ba0b9ae2c4fb475be9878910820d3 src/tests/master_authorization_tests.cpp 676543a5ad1bb5d47011fc2a8b05dfaaeef18c64 Diff: https://reviews.apache.org/r/64515/diff/4/ Changes: https://reviews.apache.org/r/64515/diff/3-4/ Testing --- make check. Thanks, Jiang Yan Xu
Re: Review Request 64515: Used `reserve_resources` ACL for static reservations.
> On Jan. 9, 2018, 8:45 a.m., James Peach wrote: > > What do you think about testing the re-registration case? There's no > > coverage here, but equally, it would be annoying to duplicate all these > > tests for re-registration ... Two other tests `AuthorizedToRegisterAndReregisterAgent` and `UnauthorizedToReregisterAgent` have covered the reregister case (both success and failure), even though the condition in these test differ slightly in that they don't include roles, I feel that they share the common logic around the handling the authorization success and failure of reregistering agents so they are adequate? > On Jan. 9, 2018, 8:45 a.m., James Peach wrote: > > src/tests/master_authorization_tests.cpp > > Lines 2516 (patched) > > <https://reviews.apache.org/r/64515/diff/3/?file=1929962#file1929962line2516> > > > > What happens with the "*" role? Is that covered by > > "AuthorizedToRegisterNoStaticReservations"? Yes. - Jiang Yan --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64515/#review195042 ----------- On Jan. 10, 2018, 2:22 p.m., Jiang Yan Xu wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/64515/ > --- > > (Updated Jan. 10, 2018, 2:22 p.m.) > > > Review request for mesos, Alexander Rojas and James Peach. > > > Bugs: MESOS-8306 > https://issues.apache.org/jira/browse/MESOS-8306 > > > Repository: mesos > > > Description > --- > > Used `reserve_resources` ACL for static reservations. > > > Diffs > - > > include/mesos/authorizer/acls.proto > d869820491f1b81f81e1157493aa73e32735b9c3 > src/master/master.hpp 5e6ba53c075174a1e514a395ceb17c26201ec470 > src/master/master.cpp 6fc5de89e54ba0b9ae2c4fb475be9878910820d3 > src/tests/master_authorization_tests.cpp > 676543a5ad1bb5d47011fc2a8b05dfaaeef18c64 > > > Diff: https://reviews.apache.org/r/64515/diff/4/ > > > Testing > --- > > make check. > > > Thanks, > > Jiang Yan Xu > >
Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.
> On Jan. 4, 2018, 5:25 p.m., Vinod Kone wrote: > > src/master/master.cpp > > Lines 10037-10056 (original), 10039-10062 (patched) > > <https://reviews.apache.org/r/64940/diff/1/?file=1930130#file1930130line10039> > > > > I think we shouldn't create a TASK_UNREACHABLE status update and call > > `updateTask` or `forward` for a terminal task at all. . Also, `forward` > > sends TASK_UNREACHABLE update for terminal task to the framework which > > looks incorrect. > > > > > > Ideally, we want terminal but unacknowledged tasks to still be marked > > unreachable in some way, either via task state being TASK_UNREACHABLE or > > task being present in `unreachableTasks`. This allows, for example, the > > WebUI to not show sandbox links for unreachable tasks irrespective of > > whether they were terminal or not before going unreachable. > > > > But doing this is tricky for various reasons: > > > > --> `updateTask()` doesn't allow a terminal state to be transitioned to > > TASK_UNREACHABLE. Right now when we call `updateTask` for a terminal task, > > it adds TASK_UNREACHABLE status to `Task.statuses` and also sends it to > > operator API stream subscribers which looks incorrect. The fact that > > `updateTask` internally deals with already terminal tasks is a bad design > > decision in retrospect. I think the callers shouldn't call it for terminal > > tasks instead. > > > > --> It's not clear to our users what a `completed` task means. The > > intention was for this to hold a cache of terminal and acknowledged tasks > > for storing recent history. The users of the WebUI probably equate > > "Completed Tasks" to terminal tasks irrespective of their acknowledgement > > status, which is why it is confusing for them to see terminal but > > unacknowledged tasks in the "Active tasks" section in the WebUI. > > > > --> When a framework reconciles the state of a task on an unreachable > > agent, master replies with TASK_UNREACHABLE irrespective of whether the > > task was in a non-terminal state or terminal but un-acknowledged state or > > terminal and acknowledged state when the agent went unreachable. > > > > I think the direction we want to go towards is > > > > --> Completed tasks should consist of terminal unacknowledged and > > terminal acknowled tasks, likely in two different data structures. > > --> Unreachable tasks should consist of all non-complete tasks on an > > unreachable agent. All the tasks in this map should be in TASK_UNREACHABLE > > state. > > > > > > Given all the above is a very involved change, I would recommend > > keeping what you have here but with a giant TODO (your current comment in > > #10058 doesn't go into enough detail about the complexity here) that talks > > about the above stuff. Your change at least keeps the parity with the > > (broken) semantics that we have in 1.4 and earlier so that's a bit better. > > Vinod Kone wrote: > Ignore the first line. Forgot to delete it. > > Jiang Yan Xu wrote: > Future direction > > 1. If completed == terminal unacknowledged + terminal acknowledged, then > completed == terminal right? Should we then unify the terminology and pick > one? > 2. Unreachable tasks == non-terminal tasks on an unreachable agent: this > is what this RR is going to do but IIUC you want a different behavior. > > Current semantics > > 1. In 1.4 the the master (in `updateTask` sends `TASK_UNREACHABLE` to the > operator API subsribers for terminal tasks), as it stands right now we are > going to send `TASK_UNREACHABLE` to the schedulers as well. Should we change > that? > 2. You also said above that "Ideally, we want terminal but unacknowledged > tasks to still be marked unreachable in some way" which seems to contradict > your later point that "Unreachable tasks should consist of all non-complete > (terminal) tasks", could you clarify? > > Overall it sounds to me that the most correct semantic is still to set > `TASK_UNREACHABLE` only for the tasks that are non-terminal (because > otherwise we know that the state is not going to change to something else > that we don't know yet) but perhaps we can use another field in the status to > signal the fact that the agent is partitioned? > > James Peach wrote: > https://issues.apache.org/jira/browse/MESOS-8405 > > Vinod Kone wrote: > Sorry for the confusion. In the future direct
Re: Review Request 64968: Removed the inaccurate `slaves.transitioning` function in the master.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64968/#review195094 --- Ship it! src/master/master.cpp Lines 8507 (patched) <https://reviews.apache.org/r/64968/#comment274241> Perhaps it'll be more informative to log the `slaveId` if it is present? i.e., distinguish the two sub-cases in logging? - Jiang Yan Xu On Jan. 4, 2018, 5:38 p.m., Benjamin Mahler wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/64968/ > --- > > (Updated Jan. 4, 2018, 5:38 p.m.) > > > Review request for mesos, Vinod Kone and Jiang Yan Xu. > > > Bugs: MESOS-8390 > https://issues.apache.org/jira/browse/MESOS-8390 > > > Repository: mesos > > > Description > --- > > This function used to return whether the agent was transitioning > between states, however it was updated in e510813f93e253480005ce > to return only when the agent is recovered. > > This patch removes the now confusing function in favor of just > directly checking the state we care about in the reconciliation > logic. Since there were no other usages, the function is removed. > > > Diffs > - > > src/master/master.hpp 130f6e28cc62a8912aac66ecfbf014fe1ee444e3 > src/master/master.cpp 28d8be3a4769b418b61cff0b95845e4232135bc7 > > > Diff: https://reviews.apache.org/r/64968/diff/1/ > > > Testing > --- > > make check > > > Thanks, > > Benjamin Mahler > >
Re: Review Request 65050: Added some defensive checks for persistent volume sharing check.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/65050/#review195075 --- Ship it! src/slave/containerizer/mesos/isolators/filesystem/linux.cpp Lines 604-605 (original), 611-612 (patched) <https://reviews.apache.org/r/65050/#comment274214> The code looks correct but the review summary seems not accurate: we are more benefitting from the fact that in these isolators we set `info->resources` only after we update the volumes on disk. Docker containerizer has the same `if (current.contains(resource))` check but that check doesn't work because we set `container->resources` first there. - Jiang Yan Xu On Jan. 9, 2018, 12:44 p.m., Jie Yu wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/65050/ > --- > > (Updated Jan. 9, 2018, 12:44 p.m.) > > > Review request for mesos, Gilbert Song, James Peach, and Jiang Yan Xu. > > > Bugs: MESOS-8356 > https://issues.apache.org/jira/browse/MESOS-8356 > > > Repository: mesos > > > Description > --- > > This patch adds some defensive checks for detecting if a persistent > volume is shared by multiple containers. It's currently not a bug > because of a `contains` check above in the loop body. But this is a bit > fragile and should be avoided. > > See more details in MESOS-8356. > > > Diffs > - > > src/slave/containerizer/mesos/isolators/filesystem/linux.cpp > 5200182970fa978e940d3f0e9cddce37608e84f7 > src/slave/containerizer/mesos/isolators/filesystem/posix.cpp > 6b9f4bb42621eec3eb648712fbfa41d94f253895 > > > Diff: https://reviews.apache.org/r/65050/diff/1/ > > > Testing > --- > > sudo make check > > > Thanks, > > Jie Yu > >
Re: Review Request 65049: Fixed the persistent volume permission issue in DockerContainerizer.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/65049/#review195068 --- Ship it! Ship It! - Jiang Yan Xu On Jan. 9, 2018, 11:31 a.m., Jie Yu wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/65049/ > --- > > (Updated Jan. 9, 2018, 11:31 a.m.) > > > Review request for mesos, Gilbert Song, James Peach, and Jiang Yan Xu. > > > Bugs: MESOS-8356 > https://issues.apache.org/jira/browse/MESOS-8356 > > > Repository: mesos > > > Description > --- > > This patch fixes MESOS-8356 by skipping the current container to be > launched when doing the shared volume check (`isVolumeInUse`). Prior to > this patch, the code is buggy because `isVolumeInUse` will always be set > to `true`. > > > Diffs > - > > src/slave/containerizer/docker.cpp 890cb2aba44fe76e891117833eac8ccca00b759b > > > Diff: https://reviews.apache.org/r/65049/diff/1/ > > > Testing > --- > > sudo make check > > > Thanks, > > Jie Yu > >
Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.
> On Jan. 4, 2018, 2:09 p.m., Jiang Yan Xu wrote: > > src/master/master.cpp > > Lines 10061 (patched) > > <https://reviews.apache.org/r/64940/diff/1/?file=1930130#file1930130line10062> > > > > Perhaps move all the logic around determining the unreachable task down > > here. > > > > ``` > > // We need to inform `removeTask` whether the task became unreachable > > due to the agent being removed. Note that terminal tasks are not considered > > unreachable. We can not rely on the task's state because it may have been > > set to `TASK_LOST` for backwards-compatibility. > > bool unreachable = > > unreachableTime.isSome() && !protobuf::isTerminalState(task->state()); > > ``` > > > > Inlining `unreachable` is probably fine too but separating it helps > > clarify it with comments? > > James Peach wrote: > You have to sample the task state before the update because it might > transition to a terminal state at that time (and LOST is terminal). Right, so put the line above `updateTask`? The main thing I am commenting on is that we can consolidate the logic without first defining `bool unreachable = true;` much far above? > On Jan. 4, 2018, 2:09 p.m., Jiang Yan Xu wrote: > > src/tests/master_tests.cpp > > Lines 7637 (patched) > > <https://reviews.apache.org/r/64940/diff/1/?file=1930131#file1930131line7637> > > > > You could just wait for the the status update (being droppped) instead > > of capaturing `Slave::executorTerminated` and risk the race condition of > > the master not receiving the status update before the agent is deemed > > partitioned? > > James Peach wrote: > Dealing with all the status updates was less reliable. Do you see the race condition I was trying to point out? > On Jan. 4, 2018, 2:09 p.m., Jiang Yan Xu wrote: > > src/tests/master_tests.cpp > > Lines 7686 (patched) > > <https://reviews.apache.org/r/64940/diff/1/?file=1930131#file1930131line7686> > > > > No need to settle? > > James Peach wrote: > This is settling to ensure that events following from marking the agent > down are fully processed. Oh ok that's right. - Jiang Yan --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64940/#review194777 --- On Jan. 5, 2018, 11:06 a.m., James Peach wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/64940/ > --- > > (Updated Jan. 5, 2018, 11:06 a.m.) > > > Review request for mesos, Benjamin Mahler, Gaston Kleiman, Jie Yu, Vinod > Kone, and Jiang Yan Xu. > > > Bugs: MESOS-8337 > https://issues.apache.org/jira/browse/MESOS-8337 > > > Repository: mesos > > > Description > --- > > If an agent is lost, we try to remove all the tasks that might > have been lost. However, if a task is already terminal, it hasn't > really been lost so we should not be tracking it in the framework's > unreachable tasks list. > > > Diffs > - > > src/master/master.hpp 130f6e28cc62a8912aac66ecfbf014fe1ee444e3 > src/master/master.cpp 28d8be3a4769b418b61cff0b95845e4232135bc7 > src/tests/partition_tests.cpp 3813139f576ea01db0197f0fe8a73597db1bb69a > > > Diff: https://reviews.apache.org/r/64940/diff/4/ > > > Testing > --- > > make check (Fedora 27) > > > Thanks, > > James Peach > >
Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.
> On Jan. 4, 2018, 5:25 p.m., Vinod Kone wrote: > > src/master/master.cpp > > Lines 10037-10056 (original), 10039-10062 (patched) > > <https://reviews.apache.org/r/64940/diff/1/?file=1930130#file1930130line10039> > > > > I think we shouldn't create a TASK_UNREACHABLE status update and call > > `updateTask` or `forward` for a terminal task at all. . Also, `forward` > > sends TASK_UNREACHABLE update for terminal task to the framework which > > looks incorrect. > > > > > > Ideally, we want terminal but unacknowledged tasks to still be marked > > unreachable in some way, either via task state being TASK_UNREACHABLE or > > task being present in `unreachableTasks`. This allows, for example, the > > WebUI to not show sandbox links for unreachable tasks irrespective of > > whether they were terminal or not before going unreachable. > > > > But doing this is tricky for various reasons: > > > > --> `updateTask()` doesn't allow a terminal state to be transitioned to > > TASK_UNREACHABLE. Right now when we call `updateTask` for a terminal task, > > it adds TASK_UNREACHABLE status to `Task.statuses` and also sends it to > > operator API stream subscribers which looks incorrect. The fact that > > `updateTask` internally deals with already terminal tasks is a bad design > > decision in retrospect. I think the callers shouldn't call it for terminal > > tasks instead. > > > > --> It's not clear to our users what a `completed` task means. The > > intention was for this to hold a cache of terminal and acknowledged tasks > > for storing recent history. The users of the WebUI probably equate > > "Completed Tasks" to terminal tasks irrespective of their acknowledgement > > status, which is why it is confusing for them to see terminal but > > unacknowledged tasks in the "Active tasks" section in the WebUI. > > > > --> When a framework reconciles the state of a task on an unreachable > > agent, master replies with TASK_UNREACHABLE irrespective of whether the > > task was in a non-terminal state or terminal but un-acknowledged state or > > terminal and acknowledged state when the agent went unreachable. > > > > I think the direction we want to go towards is > > > > --> Completed tasks should consist of terminal unacknowledged and > > terminal acknowled tasks, likely in two different data structures. > > --> Unreachable tasks should consist of all non-complete tasks on an > > unreachable agent. All the tasks in this map should be in TASK_UNREACHABLE > > state. > > > > > > Given all the above is a very involved change, I would recommend > > keeping what you have here but with a giant TODO (your current comment in > > #10058 doesn't go into enough detail about the complexity here) that talks > > about the above stuff. Your change at least keeps the parity with the > > (broken) semantics that we have in 1.4 and earlier so that's a bit better. > > Vinod Kone wrote: > Ignore the first line. Forgot to delete it. Future direction 1. If completed == terminal unacknowledged + terminal acknowledged, then completed == terminal right? Should we then unify the terminology and pick one? 2. Unreachable tasks == non-terminal tasks on an unreachable agent: this is what this RR is going to do but IIUC you want a different behavior. Current semantics 1. In 1.4 the the master (in `updateTask` sends `TASK_UNREACHABLE` to the operator API subsribers for terminal tasks), as it stands right now we are going to send `TASK_UNREACHABLE` to the schedulers as well. Should we change that? 2. You also said above that "Ideally, we want terminal but unacknowledged tasks to still be marked unreachable in some way" which seems to contradict your later point that "Unreachable tasks should consist of all non-complete (terminal) tasks", could you clarify? Overall it sounds to me that the most correct semantic is still to set `TASK_UNREACHABLE` only for the tasks that are non-terminal (because otherwise we know that the state is not going to change to something else that we don't know yet) but perhaps we can use another field in the status to signal the fact that the agent is partitioned? - Jiang Yan --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64940/#review194794 --- On Jan. 3, 2018, 5:35 p.m., James Peach wrote: > > -
Re: Review Request 64940: Prevented a crash when an agent with terminal tasks is lost.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64940/#review194777 --- src/master/master.cpp Lines 10058-10060 (patched) <https://reviews.apache.org/r/64940/#comment273769> Perhaps we just need to comment on the logic we use to determine if the task is `unreachable` and not what `removeTask` is going to do about it (because it's hidden here)? i.e., remove the comment here and use something like what I suggested below? src/master/master.cpp Lines 10061 (patched) <https://reviews.apache.org/r/64940/#comment273760> Perhaps move all the logic around determining the unreachable task down here. ``` // We need to inform `removeTask` whether the task became unreachable due to the agent being removed. Note that terminal tasks are not considered unreachable. We can not rely on the task's state because it may have been set to `TASK_LOST` for backwards-compatibility. bool unreachable = unreachableTime.isSome() && !protobuf::isTerminalState(task->state()); ``` Inlining `unreachable` is probably fine too but separating it helps clarify it with comments? src/master/master.cpp Line 10273 (original), 10279-10280 (patched) <https://reviews.apache.org/r/64940/#comment273766> So we had to use a boolean because we cannot rely on the task *state*, so "any of the unreachable states" is a bit misleading? I think the CHECK message is already clear enough for now? (Like you did in `removeTask`) I suggest we don't comment further on the removable-ness of the task because we might want to change it: https://issues.apache.org/jira/browse/MESOS-8389 src/tests/master_tests.cpp Lines 7601-7602 (patched) <https://reviews.apache.org/r/64940/#comment273805> This reads like "terminal tasks" are stored as "either finished or lost" when it's only the non-terminal task that is transitioned to lost. Can we just say that we are verifying "the terminal task is stored in its original state when parition happens"? src/tests/master_tests.cpp Lines 7603 (patched) <https://reviews.apache.org/r/64940/#comment273804> Most agent partition related tests are put in partition_tests.cpp, can we put this one there as well for better grouping? In particular there is a test named `TaskCompletedOnPartitionedAgent` there which is somewhat related. We can name this one `AgentWithCompletedTaskPartitioned` or `TaskCompletedBeforeAgentPartitioned` The main word I want to change is `Lost`. While I can understand you are using it interchangably with *partitioned*, in this context when it's call partitioned elsewhere and there's `TaskLost` to confuse with it, perhaps it's better to use the more precise word. src/tests/master_tests.cpp Lines 7621 (patched) <https://reviews.apache.org/r/64940/#comment273811> We typically name this `offers` since it's a vector. You'll be using the first and only one so later you should ``` AWAIT_READY(offers); ASSERT_FALSE(offers->empty()); const Offer& offer = offers->at(0); ``` src/tests/master_tests.cpp Lines 7637 (patched) <https://reviews.apache.org/r/64940/#comment273812> You could just wait for the the status update (being droppped) instead of capaturing `Slave::executorTerminated` and risk the race condition of the master not receiving the status update before the agent is deemed partitioned? src/tests/master_tests.cpp Lines 7641 (patched) <https://reviews.apache.org/r/64940/#comment273810> Instead of settling it's more direct to just wait for the offer before using it (like I mentioned above). src/tests/master_tests.cpp Lines 7675 (patched) <https://reviews.apache.org/r/64940/#comment273783> "fast fast" to emphasize? :) src/tests/master_tests.cpp Lines 7681 (patched) <https://reviews.apache.org/r/64940/#comment273784> If you wait for the ping, you don't need to settle? src/tests/master_tests.cpp Lines 7686 (patched) <https://reviews.apache.org/r/64940/#comment273785> No need to settle? - Jiang Yan Xu On Jan. 3, 2018, 5:35 p.m., James Peach wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/64940/ > ----------- > > (Updated Jan. 3, 2018, 5:35 p.m.) > > > Review request for mesos, Gaston Kleiman, Jie Yu, and Jiang Yan Xu. > > > Bugs: MESOS-8337 > https://issues.apache.org/jira/browse/MESOS-8337 > > > Repository: mesos > > > Description > --- > > If an agent is lost
Re: Review Request 64939: Abort libprocess when a Process throws an uncaught exception.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64939/#review194774 --- Ship it! Ship It! - Jiang Yan Xu On Jan. 3, 2018, 5:30 p.m., Benjamin Mahler wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/64939/ > --- > > (Updated Jan. 3, 2018, 5:30 p.m.) > > > Review request for mesos, Alexander Rukletsov, Benjamin Hindman, Ilya Pronin, > Jie Yu, and Jiang Yan Xu. > > > Repository: mesos > > > Description > --- > > Previously, we logged the exception, terminated the throwing > Process, and continued to run. However, currently there exists > no known user-level code that I'm aware of that handles the > unexpected termination due to an uncaught exception. > > Generally, this means that when an exception is thrown (e.g. > a bad call to std::map::at), the process terminates with a log > message but things get "stuck" and the user has to debug what > is wrong / kill the process. > > Libprocess would likely need to provide some primitives to > better support handling unexpected termination of a Process > in order for us to provide a strategy where we continue running. > > > Diffs > - > > 3rdparty/libprocess/src/process.cpp > 75cf1d3b6d3d257ba9bc81c68017a74a6511cebf > > > Diff: https://reviews.apache.org/r/64939/diff/1/ > > > Testing > --- > > make check > > > Thanks, > > Benjamin Mahler > >
Review Request 64933: Fixed flaky PartitionedSlaveReregistrationMasterFailover test.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64933/ --- Review request for mesos, Alexander Rukletsov and Megha Sharma. Bugs: MESOS-8334 https://issues.apache.org/jira/browse/MESOS-8334 Repository: mesos Description --- The test was flaky due to a race condition that causes status updates to be dropped because the agent reregisters before the schedulers. Diffs - src/tests/partition_tests.cpp 7f4b9ed938fd478fdf750b464c531bf76de7d798 Diff: https://reviews.apache.org/r/64933/diff/1/ Testing --- make check. Ran the affected test for 1000 iterations. Thanks, Jiang Yan Xu
Re: Review Request 64898: Removed duplicated code that tests for removable tasks.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64898/#review194706 --- src/master/master.cpp Lines 10171 (patched) <https://reviews.apache.org/r/64898/#comment273641> Would it be simpler to just use the new variable without a lambda? ``` TaskState newState = latestState.getOrElse(status.state()); ``` - Jiang Yan Xu On Jan. 2, 2018, 4:26 p.m., James Peach wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/64898/ > --- > > (Updated Jan. 2, 2018, 4:26 p.m.) > > > Review request for mesos and Jiang Yan Xu. > > > Repository: mesos > > > Description > --- > > When we get a task update and check whether the latest state > indicated that the task is removable, invoke a local lambda > function so that we don't duplicate the same code for both > the current and latest states. > > > Diffs > - > > src/master/master.cpp 03eb178fa1af7d55ae387e6cb42cdc8d721a2196 > > > Diff: https://reviews.apache.org/r/64898/diff/2/ > > > Testing > --- > > make check > > > Thanks, > > James Peach > >
Re: Review Request 64515: Used `reserve_resources` ACL for static reservations.
> On Dec. 13, 2017, 11:32 a.m., James Peach wrote: > > src/master/master.cpp > > Line 3876 (original), 3881 (patched) > > <https://reviews.apache.org/r/64515/diff/1/?file=1912811#file1912811line3881> > > > > You can now log the `SlaveID` here. The `SlaveInfo` here may not have a `SlaveID` if it's called from `registerSlave`. Here to consistently identity the agent, we would need to pass in the pid of the agent, (`SlaveInfo` also has the hostname). This info is already logged once the authorization is done. Other `Master::authorize*` methods all log the data passed to the authorizer here, I'll add the agent's resources for consistency. > On Dec. 13, 2017, 11:32 a.m., James Peach wrote: > > src/master/master.cpp > > Lines 3905 (patched) > > <https://reviews.apache.org/r/64515/diff/1/?file=1912811#file1912811line3905> > > > > So what are the semantics for dynamic resources? If this new ACL > > semantics is expressing that resources in role `Foo` may only be provided > > on agents which principal `Bar`, it doesn't seem correct that we would > > allow dynamic reservations to violate that. The semantics of `reserve_resources` ACL is still the same: it determines whether a principal can reserve resources of certain roles, regardless of whether it's static or dynamic. What's new is the usage that considers the static reservation to be initiated by the agent's principal and we are authorizing it here. I think in reality we use ACLs to prevent "bad actors" to do harmful things so in the case of dynamic reservations we don't need to addtinoally check against the agent's principal because the agent is not initiating the reservation. In any case, it shouldn't be authorized at agent registration time but reservation request time. I adjusted the comments. Do you think it's clearer? - Jiang Yan --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64515/#review193690 --- On Jan. 3, 2018, 11 a.m., Jiang Yan Xu wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/64515/ > --- > > (Updated Jan. 3, 2018, 11 a.m.) > > > Review request for mesos, Alexander Rojas and James Peach. > > > Bugs: MESOS-8306 > https://issues.apache.org/jira/browse/MESOS-8306 > > > Repository: mesos > > > Description > --- > > Used `reserve_resources` ACL for static reservations. > > > Diffs > - > > include/mesos/authorizer/acls.proto > d869820491f1b81f81e1157493aa73e32735b9c3 > src/master/master.hpp 8fe9420dbe03ea2cefc6a40b0f64284aa9fe7915 > src/master/master.cpp 282fdf8ac38e3613c621c1c8e5c50f27bde9dafd > src/tests/master_authorization_tests.cpp > 676543a5ad1bb5d47011fc2a8b05dfaaeef18c64 > > > Diff: https://reviews.apache.org/r/64515/diff/3/ > > > Testing > --- > > make check. > > > Thanks, > > Jiang Yan Xu > >
Re: Review Request 64515: Used `reserve_resources` ACL for static reservations.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64515/ --- (Updated Jan. 3, 2018, 11 a.m.) Review request for mesos, Alexander Rojas and James Peach. Changes --- Addressed comments. Bugs: MESOS-8306 https://issues.apache.org/jira/browse/MESOS-8306 Repository: mesos Description --- Used `reserve_resources` ACL for static reservations. Diffs (updated) - include/mesos/authorizer/acls.proto d869820491f1b81f81e1157493aa73e32735b9c3 src/master/master.hpp 8fe9420dbe03ea2cefc6a40b0f64284aa9fe7915 src/master/master.cpp 282fdf8ac38e3613c621c1c8e5c50f27bde9dafd src/tests/master_authorization_tests.cpp 676543a5ad1bb5d47011fc2a8b05dfaaeef18c64 Diff: https://reviews.apache.org/r/64515/diff/3/ Changes: https://reviews.apache.org/r/64515/diff/2-3/ Testing --- make check. Thanks, Jiang Yan Xu
Re: Review Request 64515: Used `reserve_resources` ACL for static reservations.
> On Dec. 13, 2017, 3:54 a.m., Alexander Rojas wrote: > > src/master/master.cpp > > Lines 6138-6141 (original), 6169-6173 (patched) > > <https://reviews.apache.org/r/64515/diff/1/?file=1912811#file1912811line6169> > > > > How about: > > > > _Not authorized to register agent with(out) principal XXX providing the > > resrouces _ It now reads "Not authorized to register agent providing resources with principal " or "Not authorized to register agent providing resources with without a principal" I am putting "agent" together with "providing resources" because they together are the objects. > On Dec. 13, 2017, 3:54 a.m., Alexander Rojas wrote: > > src/tests/master_authorization_tests.cpp > > Lines 2538-2539 (patched) > > <https://reviews.apache.org/r/64515/diff/1/?file=1912812#file1912812line2538> > > > > Instead of the things in the parenthesis you can add: > > > > ```c++ > > { > > mesos::ACL::ReserveResources* acl = acls.add_reserve_resources(); > > acl->mutable_principals()->add_values(DEFAULT_CREDENTIAL.principal()); > > acl->mutable_roles()->set_type(ACL::Entity::NONE); > > } > > ``` > > > > And add a comment when starting the agent indicating that the agent is > > registering with `DEFAULT_CREDENTIAL.principal()` The test above `UnauthorizedToStaticallyReserveResources` has exactly this. I am just using different scenarios to broaden the test coverage. > On Dec. 13, 2017, 3:54 a.m., Alexander Rojas wrote: > > src/tests/master_authorization_tests.cpp > > Lines 2571 (patched) > > <https://reviews.apache.org/r/64515/diff/1/?file=1912812#file1912812line2571> > > > > Please add a test where the agent tries to register with > > `high-security-role` and succeeds. Added test `AuthorizedToStaticallyReserveRole`. > On Dec. 13, 2017, 3:54 a.m., Alexander Rojas wrote: > > src/tests/master_authorization_tests.cpp > > Lines 2577 (patched) > > <https://reviews.apache.org/r/64515/diff/1/?file=1912812#file1912812line2577> > > > > this test can be merged with > > `MasterAuthorizationTest.UnauthorizedToStaticallyReserveResources` and you > > save a restart of a master. You mean because the ACLs are set up the same way, we can have one test with two scenarios, agent1 unable to register for reason1, agent2 able to register for reason2? I think that's OK but in this case it's easier to understand the two cases individually with smaller and independent tests? - Jiang Yan --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64515/#review193670 --- On Jan. 3, 2018, 11 a.m., Jiang Yan Xu wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/64515/ > --- > > (Updated Jan. 3, 2018, 11 a.m.) > > > Review request for mesos, Alexander Rojas and James Peach. > > > Bugs: MESOS-8306 > https://issues.apache.org/jira/browse/MESOS-8306 > > > Repository: mesos > > > Description > --- > > Used `reserve_resources` ACL for static reservations. > > > Diffs > - > > include/mesos/authorizer/acls.proto > d869820491f1b81f81e1157493aa73e32735b9c3 > src/master/master.hpp 8fe9420dbe03ea2cefc6a40b0f64284aa9fe7915 > src/master/master.cpp 282fdf8ac38e3613c621c1c8e5c50f27bde9dafd > src/tests/master_authorization_tests.cpp > 676543a5ad1bb5d47011fc2a8b05dfaaeef18c64 > > > Diff: https://reviews.apache.org/r/64515/diff/3/ > > > Testing > --- > > make check. > > > Thanks, > > Jiang Yan Xu > >
Re: Review Request 64514: Refactor out `authorizeReserveResources` that takes a `Resources`.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64514/ --- (Updated Dec. 13, 2017, 11:53 a.m.) Review request for mesos, Alexander Rojas and James Peach. Bugs: MESOS-8306 https://issues.apache.org/jira/browse/MESOS-8306 Repository: mesos Description (updated) --- This allows authorization of static resource reservations. These reservations do not arrived to the master through an `Offer::Operation::Reserve` API call but during agent registration. Diffs - src/master/master.hpp 232cc3758f240db626c4fdaf852163fa48af4dd7 src/master/master.cpp b10d0341276090bfa70aaa4fd6317a560e3334ea Diff: https://reviews.apache.org/r/64514/diff/1/ Testing --- make check. Thanks, Jiang Yan Xu
Re: Review Request 64098: Send status updates when agent re-registers.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64098/#review193557 --- Ship it! Ship It! - Jiang Yan Xu On Dec. 12, 2017, 4:54 a.m., Megha Sharma wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/64098/ > --- > > (Updated Dec. 12, 2017, 4:54 a.m.) > > > Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu. > > > Bugs: MESOS-6406 > https://issues.apache.org/jira/browse/MESOS-6406 > > > Repository: mesos > > > Description > --- > > Master will send task status updates to frameworks when an agent > which has been previously removed by the master for being unreachable > or is unknown to the master due to the garbage collection of > the unreachable and gone agents in the registry re-registers. > > > Diffs > - > > src/master/master.cpp b10d0341276090bfa70aaa4fd6317a560e3334ea > src/tests/master_allocator_tests.cpp > 10de6f0104a6a93ff5037d7e2ab5cf21d57fbec8 > src/tests/master_tests.cpp 7b087675200fe0ca69319b715997700a91d9f6b2 > src/tests/partition_tests.cpp 54ccf783682e7e6db0847b9a6313489b7f8181f8 > src/tests/persistent_volume_tests.cpp > f255382af957cfa66f5efaffcaf1082b83b35e58 > src/tests/slave_recovery_tests.cpp 253b0fc2ff7ec1f00937d42636151553c46d5175 > src/tests/upgrade_tests.cpp 0efaa586153564b20f7884023946a11635425ee4 > > > Diff: https://reviews.apache.org/r/64098/diff/15/ > > > Testing > --- > > with make check > > > Thanks, > > Megha Sharma > >
Review Request 64516: Improved documentation on resource reservations.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64516/ --- Review request for mesos, Alexander Rojas, James Peach, and Michael Park. Bugs: MESOS-8306 https://issues.apache.org/jira/browse/MESOS-8306 Repository: mesos Description --- Added explanation for authorizations on static reservations. Diffs - docs/reservation.md fa1f309f2b770e37964af2d5599519875195dbec Diff: https://reviews.apache.org/r/64516/diff/1/ Testing --- Used a markdown viewer. Thanks, Jiang Yan Xu
Review Request 64514: Refactor out `authorizeReserveResources` that takes a `Resources`.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64514/ --- Review request for mesos, Alexander Rojas and James Peach. Bugs: MESOS-8306 https://issues.apache.org/jira/browse/MESOS-8306 Repository: mesos Description --- This allows us to authorize static resource reservations that don't come from `Offer::Operation::Reserve`. Diffs - src/master/master.hpp 232cc3758f240db626c4fdaf852163fa48af4dd7 src/master/master.cpp b10d0341276090bfa70aaa4fd6317a560e3334ea Diff: https://reviews.apache.org/r/64514/diff/1/ Testing --- make check. Thanks, Jiang Yan Xu
Review Request 64515: Used `reserve_resources` ACL for static reservations.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64515/ --- Review request for mesos, Alexander Rojas and James Peach. Bugs: MESOS-8306 https://issues.apache.org/jira/browse/MESOS-8306 Repository: mesos Description --- Used `reserve_resources` ACL for static reservations. Diffs - include/mesos/authorizer/acls.proto 40a1425ca51c5bb70f7af2e17d605f2125dcb4cb src/master/master.hpp 232cc3758f240db626c4fdaf852163fa48af4dd7 src/master/master.cpp b10d0341276090bfa70aaa4fd6317a560e3334ea src/tests/master_authorization_tests.cpp 676543a5ad1bb5d47011fc2a8b05dfaaeef18c64 Diff: https://reviews.apache.org/r/64515/diff/1/ Testing --- make check. Thanks, Jiang Yan Xu
Re: Review Request 64098: Send status updates when agent re-registers.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64098/#review193460 --- LGTM sans Ilya's comment about only use `replicated_log` flag where necessary. src/master/master.cpp Line 6807 (original), 6803-6804 (patched) <https://reviews.apache.org/r/64098/#comment272010> Move `Framework* framework = getFramework(frameworkId);` down into the if condition since it's used only in there? src/tests/partition_tests.cpp Lines 402-415 (original), 403-407 (patched) <https://reviews.apache.org/r/64098/#comment272016> Expectations above `detector.appoint(master.get()->pid);` everywhere as suggested in the last review. src/tests/partition_tests.cpp Lines 716-730 (original), 710-714 (patched) <https://reviews.apache.org/r/64098/#comment272017> Ditto. src/tests/partition_tests.cpp Lines 1030-1032 (original), 1015-1025 (patched) <https://reviews.apache.org/r/64098/#comment272018> Ditto. src/tests/partition_tests.cpp Lines 2297 (patched) <https://reviews.apache.org/r/64098/#comment272024> `statusFromAgent` sounds like it's actually sent by the agent. Maybe `finsihedStatusFromMaster` for this update and rename the other `finshedStatus` to `finsihedStatusFromAgent` makes it a bit easier to distinguish? (I do understand that you meant that the status is from the agent's reregister message but I think `finsihedStatusFromMaster` vs `finsihedStatusFromAgent` is clearer). Also, can we add a comment too? e.g., the first status is sent as part of the reregistation and the second is by the agent's status update manager after it is reregsitered. - Jiang Yan Xu On Dec. 8, 2017, 10:42 a.m., Megha Sharma wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/64098/ > --- > > (Updated Dec. 8, 2017, 10:42 a.m.) > > > Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu. > > > Bugs: MESOS-6406 > https://issues.apache.org/jira/browse/MESOS-6406 > > > Repository: mesos > > > Description > --- > > Master will send task status updates to frameworks when an agent > which has been previously removed by the master for being unreachable > or is unknown to the master due to the garbage collection of > the unreachable and gone agents in the registry re-registers. > > > Diffs > - > > src/master/master.cpp dfe60ef670edcaefa0c1241df2e2870f650fcf9e > src/tests/master_allocator_tests.cpp > 3400d70bb0ba564eac43c4639eee0efd4d8059e6 > src/tests/master_tests.cpp 57eae320a7a398527cd3623c89bf67f319a8e955 > src/tests/partition_tests.cpp 31ebfe1655438eceae74d72a223df03a9dbd282d > src/tests/persistent_volume_tests.cpp > 4aa3c2e8b0f461cd78053707cff8bcb2e6f2b0d7 > src/tests/slave_recovery_tests.cpp f14c6ef69eb20a03454c8197df79b572a3c6d050 > src/tests/upgrade_tests.cpp 7f434dbba858f636719eec24e92b306b76430c4c > > > Diff: https://reviews.apache.org/r/64098/diff/12/ > > > Testing > --- > > with make check > > > Thanks, > > Megha Sharma > >
Re: Review Request 64429: Fixed skipping of completed frameworks in the master failover benchmark.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64429/#review193192 --- Ship it! Ship It! - Jiang Yan Xu On Dec. 7, 2017, 2:33 p.m., Benjamin Mahler wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/64429/ > --- > > (Updated Dec. 7, 2017, 2:33 p.m.) > > > Review request for mesos and Jiang Yan Xu. > > > Repository: mesos > > > Description > --- > > This benchmark was previously accidentally skipping the completed > frameworks, due to an incorrect loop. > > > Diffs > - > > src/tests/master_benchmarks.cpp 6c3a8e514a452f68a0b420b98f5af0dccb01ab68 > > > Diff: https://reviews.apache.org/r/64429/diff/1/ > > > Testing > --- > > Ran the benchmark > > > Thanks, > > Benjamin Mahler > >
Re: Review Request 64098: Send status updates when agent re-registers.
Setting `WillRepeatedly` here is probably racy due to ``` EXPECT_CALL(sched, statusUpdate(, _)) .WillOnce(FutureArg<1>()); ``` below, not to mention the case could be interesting to test. We can set the expectation explicitly: The first update will come from the master, the reason being "agent_reregistered". The second update comes from the agent. - Jiang Yan Xu On Dec. 1, 2017, 4:29 p.m., Megha Sharma wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/64098/ > --- > > (Updated Dec. 1, 2017, 4:29 p.m.) > > > Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu. > > > Bugs: MESOS-6406 > https://issues.apache.org/jira/browse/MESOS-6406 > > > Repository: mesos > > > Description > --- > > Master will send task status updates to frameworks when an agent > which has been previously removed by the master for being unreachable > or is unknown to the master due to the garbage collection of > the unreachable and gone agents in the registry re-registers. > > > Diffs > - > > src/master/master.cpp dfe60ef670edcaefa0c1241df2e2870f650fcf9e > src/tests/master_allocator_tests.cpp > 3400d70bb0ba564eac43c4639eee0efd4d8059e6 > src/tests/master_tests.cpp 57eae320a7a398527cd3623c89bf67f319a8e955 > src/tests/partition_tests.cpp 31ebfe1655438eceae74d72a223df03a9dbd282d > src/tests/persistent_volume_tests.cpp > 4aa3c2e8b0f461cd78053707cff8bcb2e6f2b0d7 > src/tests/slave_recovery_tests.cpp f14c6ef69eb20a03454c8197df79b572a3c6d050 > src/tests/upgrade_tests.cpp 7f434dbba858f636719eec24e92b306b76430c4c > > > Diff: https://reviews.apache.org/r/64098/diff/11/ > > > Testing > --- > > with make check > > > Thanks, > > Megha Sharma > >
Re: Review Request 64098: Send status updates when agent re-registers.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64098/#review192733 --- Mainly looked at the reason for the test changes. Will take another look at the whole patch once it's fixed. src/tests/master_allocator_tests.cpp Lines 1452-1454 (original), 1452-1459 (patched) <https://reviews.apache.org/r/64098/#comment270989> So these tests had to be changed becauase the master is not using the `replicated_log` registry flag so it cannot remember the agent... I think the right change is to use `replicated_log` for the tests that don't make sense to change for this patch. Note that the use of replicated log registrar breaks on Windows (see MESOS-5932) so we need to make sure such tests are disabled on Windows. The way to do it is to check whether the file is in this list: https://github.com/apache/mesos/blob/42952093c081e494d36721bc6c56ab73069a82e4/src/tests/CMakeLists.txt#L152 If it is, then it is already not built on Windows. If not, we can use `TEST_F_TEMP_DISABLED_ON_WINDOWS` as part of the test name to disable it. Please review all the tests in this RR for this. - Jiang Yan Xu On Dec. 1, 2017, 4:29 p.m., Megha Sharma wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/64098/ > --- > > (Updated Dec. 1, 2017, 4:29 p.m.) > > > Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu. > > > Bugs: MESOS-6406 > https://issues.apache.org/jira/browse/MESOS-6406 > > > Repository: mesos > > > Description > --- > > Master will send task status updates to frameworks when an agent > which has been previously removed by the master for being unreachable > or is unknown to the master due to the garbage collection of > the unreachable and gone agents in the registry re-registers. > > > Diffs > - > > src/master/master.cpp dfe60ef670edcaefa0c1241df2e2870f650fcf9e > src/tests/master_allocator_tests.cpp > 3400d70bb0ba564eac43c4639eee0efd4d8059e6 > src/tests/master_tests.cpp 57eae320a7a398527cd3623c89bf67f319a8e955 > src/tests/partition_tests.cpp 31ebfe1655438eceae74d72a223df03a9dbd282d > src/tests/persistent_volume_tests.cpp > 4aa3c2e8b0f461cd78053707cff8bcb2e6f2b0d7 > src/tests/slave_recovery_tests.cpp f14c6ef69eb20a03454c8197df79b572a3c6d050 > src/tests/upgrade_tests.cpp 7f434dbba858f636719eec24e92b306b76430c4c > > > Diff: https://reviews.apache.org/r/64098/diff/8/ > > > Testing > --- > > with make check > > > Thanks, > > Megha Sharma > >
Re: Review Request 63831: Fixed a bug that removed the suppressed framework from sorter.
> On Nov. 21, 2017, 10:27 a.m., Alexander Rukletsov wrote: > > src/master/allocator/mesos/hierarchical.cpp > > Lines 430-437 (original), 431-438 (patched) > > <https://reviews.apache.org/r/63831/diff/1/?file=1894169#file1894169line431> > > > > Do we need a lambda here? > > Jiang Yan Xu wrote: > I kept the original use of lambdas. > > I think the benefit of labmdas here is more about readability: compared > to > > ``` > set removedRoles = oldRoles; > foreach (const string& role, newRoles) { > result.erase(role); > } > ``` > > The constness and construction of the variable is clearly isolated into a > small block in a long method with many similar variables/code blocks. > > However as pointed out by the TODO, we should probably just implement a > set difference operator so all these become one-liners. I'll do it later. > > Alexander Rukletsov wrote: > Sounds like a wrong tool for the job and hence confusing. I would have > used an explicit scope then. But maybe this is just me. I'll ask MPark for > the reasoning here. Since it has been here before, it's fine to leave it for > now. Regading introducing the set operator, FYI: > https://reviews.apache.org/r/57110/#comment239761 Oh ok, thanks for the pointer! - Jiang Yan --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/63831/#review191613 --- On Nov. 27, 2017, 4:45 p.m., Jiang Yan Xu wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/63831/ > --- > > (Updated Nov. 27, 2017, 4:45 p.m.) > > > Review request for mesos and Alexander Rukletsov. > > > Bugs: MESOS-8223 > https://issues.apache.org/jira/browse/MESOS-8223 > > > Repository: mesos > > > Description > --- > > In `updateFramework`, we currently treat frameworks suppressing a role the > same way as frameworks moving off a role and this could lead to the framework > being removed from the framework sorter, which is unexpected. We should only > remove the framework from a framework sorter when it's moving away from the > corresponding role. > > > Diffs > - > > src/master/allocator/mesos/hierarchical.cpp > 5ce9ceaa3a5f84a1e076d45448863c418531cc2b > src/tests/scheduler_tests.cpp 45fc9c0cfccdb22c2e3e8d5de30c04575814a0e9 > > > Diff: https://reviews.apache.org/r/63831/diff/2/ > > > Testing > --- > > make check. > > Added a test. > > > Thanks, > > Jiang Yan Xu > >
Re: Review Request 63830: Fixed 'NoOffersWithAllRolesSuppressed' test.
> On Nov. 21, 2017, 9:53 a.m., Alexander Rukletsov wrote: > > src/tests/scheduler_tests.cpp > > Lines 1515-1516 (original), 1534-1535 (patched) > > <https://reviews.apache.org/r/63830/diff/1/?file=1892884#file1892884line1542> > > > > `.WillRepeatedly(Return());` > > Jiang Yan Xu wrote: > There's only one agent with no tasks launched so I don't expect > additional offers? > > Alexander Rukletsov wrote: > I see it in a different way. The default behaviour is ignore subsequent > offers and only if you need to verify that in some particular case no more > offers will be made, you should omit `.WillRepeatedly(Return());`. In this > case what you want to check is that "scheduler starts getting offers" saying > nothing about its quantity or periodicity. Oh ok, but if I am intentionally testing no more offers will be made I'll probably need to settle the clock etc to prevent race conditions (and comment as such). Are you suggesting that when writing tests we should by default always add `.WillRepeatedly(Return());`? I am not sure if I have seen this being adopted and it seems to result in necessary code. So my approach has been that, if the test is pretty deterministic, write the test asserting the precise case. If there are certain unpredicable race condidtions (maybe not worth preventing because you don't care), then use these `WillRepeatedly // Ignore because don't care` calls. Reasonable? :) - Jiang Yan --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/63830/#review191602 ----------- On Nov. 21, 2017, 9:46 a.m., Jiang Yan Xu wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/63830/ > --- > > (Updated Nov. 21, 2017, 9:46 a.m.) > > > Review request for mesos and Alexander Rukletsov. > > > Bugs: MESOS-7996 > https://issues.apache.org/jira/browse/MESOS-7996 > > > Repository: mesos > > > Description > --- > > The previous iteration of the test failed to reveal the bug in MESOS-8200 due > to a race condition. > > > Diffs > - > > src/tests/scheduler_tests.cpp 45fc9c0cfccdb22c2e3e8d5de30c04575814a0e9 > > > Diff: https://reviews.apache.org/r/63830/diff/1/ > > > Testing > --- > > As expected, this test would reliably fail without /r/63741/ with the > following > > ``` > I1115 07:48:28.536093 3117031 scheduler.cpp:741] Enqueuing event SUBSCRIBED > received from http://17.121.128.13:38423/master/api/v1/scheduler > I1115 07:48:28.536224 3117031 scheduler.cpp:741] Enqueuing event HEARTBEAT > received from http://17.121.128.13:38423/master/api/v1/scheduler > I1115 07:48:28.536244 3117029 master.cpp:8115] Sending 1 offers to framework > 560bd0e1-8ab7-4e7e-be7a-54cb5f399531- (default) > I1115 07:48:28.536650 3117029 scheduler.cpp:741] Enqueuing event OFFERS > received from http://17.121.128.13:38423/master/api/v1/scheduler > ../../src/tests/scheduler_tests.cpp:1497: Failure > Mock function called more times than expected - returning directly. > Function call: offers(0x7fffabd197a0, @0x7f7d8c005460 48-byte object > <50-98 0B-B0 7D-7F 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 > 00-00 00-00 00-00 00-00 01-00 00-00 04-00 00-00 A0-54 00-8C 7D-7F 00-00>) > Expected: to be never called >Actual: called once - over-saturated and active > ``` > > The test passes with /r/63741/. > > > Thanks, > > Jiang Yan Xu > >
Re: Review Request 64250: Added new reasons task status update.
> On Dec. 1, 2017, 1:55 p.m., Ilya Pronin wrote: > > docs/task-state-reasons.md > > Lines 474-477 (patched) > > <https://reviews.apache.org/r/64250/diff/2/?file=1906002#file1906002line474> > > > > I don't quite follow this note. A modified copy of which update? Should > > we just say that these updates reflect the states of the tasks reported by > > the agent upon its re-registration? > > Megha Sharma wrote: > Here, we are saying that master takes the actual state reported by the > agent and enriches it with reason and message before sending to the > framework. But I am open to changing it if you or Yan feel it doesn't convery > the idea. I was suggesting it because this status is generated by the master the same way they are generated upon reconciliation. For `REASON_RECONCILIATION` in the same doc there is this note: ``` Note: Status updates with this reason are not the original ones, but rather a modified copy that is re-sent from the master. In particular, the original data and message fields are erased and the original reason field is overwritten by REASON_RECONCILIATION . ``` It is the same for `REASON_SLAVE_REREGISTERED`. Megha I just noticed that you didn't mention the erasure of `data` and `message` fields, but perhaps this is not worth repeating. I was originally suggesting refering to `REASON_RECONCILIATION` for details: something like: `Status updates with this reason are a modified copies re-sent by the master. See comments for REASON_RECONCILIATION.` Would this be clearer? "reflect the states of the tasks reported by the agent upon its re-registration" this sentence is good too. - Jiang Yan --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64250/#review192567 --- On Dec. 1, 2017, 1:06 p.m., Megha Sharma wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/64250/ > --- > > (Updated Dec. 1, 2017, 1:06 p.m.) > > > Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu. > > > Repository: mesos > > > Description > --- > > Added new reasons `REASON_SLAVE_REREGISTERED` and > `REASON_AGENT_REREGISTERED` for v0 and v1 task status update. > The new reason will be used when master starts to send status > update during the re-registration of an agent which was > previously removed by the master because of being unreachable > or is unknown to the master due to the garbage collection of > the unreachable and gone agents in the registry. > > > Diffs > - > > docs/task-state-reasons.md 07f7927e28fe5624a6c9b74de731483aaa729ca3 > include/mesos/mesos.proto 11089b7d6f48f001ae64e8b0c67a2732684aa2fa > include/mesos/v1/mesos.proto c496da927903f472c78d455b7fdf886d8fb692c4 > > > Diff: https://reviews.apache.org/r/64250/diff/4/ > > > Testing > --- > > code changes verified with make check and the documents changes with markdown > viewer > > > Thanks, > > Megha Sharma > >
Re: Review Request 64250: Added new reasons task status update.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64250/#review192563 --- Fix it, then Ship it! docs/task-state-reasons.md Lines 476 (patched) <https://reviews.apache.org/r/64250/#comment270801> What is the `/written` part about? `overwritten` covers all cases? docs/task-state-reasons.md Lines 477 (patched) <https://reviews.apache.org/r/64250/#comment270805> Given the discussion in /r/64098/, here perhaps add another `Note: Due to garbage collection of the unreachable and gone agents in the registry, Mesos also sends such status updates for agents unknown to the master`. docs/task-state-reasons.md Lines 478 (patched) <https://reviews.apache.org/r/64250/#comment270804> Should there be a ``? Also when it involves doc changes, could you in the testing done section mention that you have used a markdown viewer to review the doc change? - Jiang Yan Xu On Dec. 1, 2017, 1:06 p.m., Megha Sharma wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/64250/ > --- > > (Updated Dec. 1, 2017, 1:06 p.m.) > > > Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu. > > > Repository: mesos > > > Description > --- > > Added new reasons `REASON_SLAVE_REREGISTERED` and > `REASON_AGENT_REREGISTERED` for v0 and v1 task status update. > The new reason will be used when master starts to send status > update during the re-registeration of an agent which was > previosuly removed by the master because of being unreachable. > > > Diffs > - > > docs/task-state-reasons.md 07f7927e28fe5624a6c9b74de731483aaa729ca3 > include/mesos/mesos.proto 11089b7d6f48f001ae64e8b0c67a2732684aa2fa > include/mesos/v1/mesos.proto c496da927903f472c78d455b7fdf886d8fb692c4 > > > Diff: https://reviews.apache.org/r/64250/diff/2/ > > > Testing > --- > > with make check > > > Thanks, > > Megha Sharma > >
Re: Review Request 64098: Send status updates when agent re-registers.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64098/#review192557 --- src/master/master.cpp Line 6808 (original), 6808 (patched) <https://reviews.apache.org/r/64098/#comment270793> When considering the comment by Ilya in MESOS-6406 (i.e., what if agents GCed from the unreachable or gone list reregster?), seems like we can do this: 1. Move down the line `slaves.recovered.erase(slaveInfo.id());` to after we process `recoveredTasks`. 2. Instead of checking `slaves.unreachable.contains(slaveInfo.id()` we could check `!slaves.recovered.contains(slaveInfo.id()` 3. Now we are sending status updates for two cases: reregistering unreachable agents or unknown agents (which could have been marked either unreachable or gone but we can't distiguish) - We can distinguish unreachable and unknown in the task status message. - We can probably log a warning about tasks from unknown agents. src/master/master.cpp Lines 6818 (patched) <https://reviews.apache.org/r/64098/#comment270794> s/REASON_AGENT_REREGISTERED/REASON_SLAVE_REREGISTERED/ - Jiang Yan Xu On Nov. 27, 2017, 4:55 p.m., Megha Sharma wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/64098/ > --- > > (Updated Nov. 27, 2017, 4:55 p.m.) > > > Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu. > > > Bugs: MESOS-6406 > https://issues.apache.org/jira/browse/MESOS-6406 > > > Repository: mesos > > > Description > --- > > Master will send task status updates to frameworks when an agent > which has been previously removed by the master for being unreachable > re-registers. > > > Diffs > - > > src/master/master.cpp dfe60ef670edcaefa0c1241df2e2870f650fcf9e > src/tests/master_allocator_tests.cpp > 3400d70bb0ba564eac43c4639eee0efd4d8059e6 > src/tests/master_tests.cpp 57eae320a7a398527cd3623c89bf67f319a8e955 > src/tests/partition_tests.cpp 31ebfe1655438eceae74d72a223df03a9dbd282d > src/tests/persistent_volume_tests.cpp > 4aa3c2e8b0f461cd78053707cff8bcb2e6f2b0d7 > src/tests/slave_recovery_tests.cpp f14c6ef69eb20a03454c8197df79b572a3c6d050 > src/tests/upgrade_tests.cpp 7f434dbba858f636719eec24e92b306b76430c4c > > > Diff: https://reviews.apache.org/r/64098/diff/5/ > > > Testing > --- > > with make check > > > Thanks, > > Megha Sharma > >
Re: Review Request 64250: Added new reasons task status update.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64250/#review192531 --- docs/task-state-reasons.md Lines 370-372 (patched) <https://reviews.apache.org/r/64250/#comment270738> Should it be in the "Reasons for Non-Terminal Status Updates" section and is there a reason to refer to TASK_LOST? docs/task-state-reasons.md Lines 479 (patched) <https://reviews.apache.org/r/64250/#comment270740> "The task was part of an accepted offer": this is irrevelent to either TASK_UNREACHABLE or the reason REASON_SLAVE_REREGISTERED. docs/task-state-reasons.md Lines 480-482 (patched) <https://reviews.apache.org/r/64250/#comment270741> We don't need to put too much details on TASK_UNREACHABLE since it's described elsewhere. docs/task-state-reasons.md Lines 483-485 (patched) <https://reviews.apache.org/r/64250/#comment270744> I think we can just focus on: - This is sent when an agent previously marked as unreachable re-registers. - Status updates with this reason are not the original ones, see the comments for `REASON_RECONCILIATION`. include/mesos/mesos.proto Lines 2406 (patched) <https://reviews.apache.org/r/64250/#comment270736> I think has convention used here is to order the reasons alphabetically. include/mesos/v1/mesos.proto Lines 2387 (patched) <https://reviews.apache.org/r/64250/#comment270737> Ditto. - Jiang Yan Xu On Dec. 1, 2017, 8:20 a.m., Megha Sharma wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/64250/ > --- > > (Updated Dec. 1, 2017, 8:20 a.m.) > > > Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu. > > > Repository: mesos > > > Description > --- > > Added new reasons `REASON_SLAVE_REREGISTERED` and > `REASON_AGENT_REREGISTERED` for v0 and v1 task status update. > The new reason will be used when master starts to send status > update during the re-registeration of an agent which was > previosuly removed by the master because of being unreachable. > > > Diffs > - > > docs/task-state-reasons.md 07f7927e28fe5624a6c9b74de731483aaa729ca3 > include/mesos/mesos.proto 11089b7d6f48f001ae64e8b0c67a2732684aa2fa > include/mesos/v1/mesos.proto c496da927903f472c78d455b7fdf886d8fb692c4 > > > Diff: https://reviews.apache.org/r/64250/diff/1/ > > > Testing > --- > > > Thanks, > > Megha Sharma > >
Re: Review Request 64098: Send status updates when agent re-registers.
> On Nov. 28, 2017, 11:22 a.m., Jiang Yan Xu wrote: > > src/master/master.cpp > > Lines 6788 (patched) > > <https://reviews.apache.org/r/64098/diff/3/?file=1902267#file1902267line6788> > > > > I think this type of status updates could benefit from a distinct > > reason for to make it more strongly typed. > > > > Could you add > > > > ``` > > v1/mesos.proto: REASON_AGENT_REREGISTERED > > mesos.proto: REASON_SLAVE_REREGISTERED > > ``` > > > > and use it? > > > > Note that the tag numbers in the proto enum need to match. > > > > Also note that there are also documentation related changes for these > > fields: > > https://github.com/apache/mesos/blob/master/docs/task-state-reasons.md > > > > It's fine to make it a separate patch preceding this one. > > Ilya Pronin wrote: > Won't the new reason break any frameworks? > > Jiang Yan Xu wrote: > How will it break? We add new reasons all the time. This is indeed a new > sitaution that Mesos master sends status updates, how do you think the > framework will assert the reason otherwise? > > Ilya Pronin wrote: > Sure, the new reason reflects what's going on much better. I was thinking > that a framework that uses the old libmesos will get a "default" status > update reason, which is `REASON_COMMAND_EXECUTOR_FAILED`, and may act > incorrectly. But it seems that Aurora will be fine with such change. > > Megha Sharma wrote: > Yeah it seems that the frameworks should be fine with the addition of > another reason, I guess that's how we have been adding new reasons all along. > The proto default behavior to interpret the unknown enum value as first one > defined, is of course a problem but I believe the frameworks must already > have some guards in place to avoid that. > > Megha Sharma wrote: > +1 to add the new reasons: > > v1/mesos.proto: REASON_AGENT_REREGISTERED > mesos.proto: REASON_SLAVE_REREGISTERED I see. Yeah the fact that there's not a sensible default value for the reason field is terrible. At least `REASON_COMMAND_EXECUTOR_FAILED` is deprecated: https://github.com/apache/mesos/blob/master/docs/task-state-reasons.md ``` The reason `REASON_COMMAND_EXECUTOR_FAILED` is deprecated and will be removed in the future. It should not be referenced by newly written code. ``` Hopefully we get rid of it soon and replace it with a `REASON_UNKONWN_REASON`. - Jiang Yan --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64098/#review192041 --- On Nov. 27, 2017, 4:55 p.m., Megha Sharma wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/64098/ > --- > > (Updated Nov. 27, 2017, 4:55 p.m.) > > > Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu. > > > Bugs: MESOS-6406 > https://issues.apache.org/jira/browse/MESOS-6406 > > > Repository: mesos > > > Description > --- > > Master will send task status updates to frameworks when an agent > re-registers. > > > Diffs > - > > src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f > src/tests/master_allocator_tests.cpp > 3400d70bb0ba564eac43c4639eee0efd4d8059e6 > src/tests/master_tests.cpp 9c450b9f592d9e09a468f537d9b500e97acc636b > src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 > src/tests/persistent_volume_tests.cpp > acfeac16884b00581a3523607ff26f44f6dca53a > src/tests/slave_recovery_tests.cpp c864aa92d9ff128a89dbc25653385de25653f56a > src/tests/upgrade_tests.cpp 7f434dbba858f636719eec24e92b306b76430c4c > > > Diff: https://reviews.apache.org/r/64098/diff/4/ > > > Testing > --- > > with make check > > > Thanks, > > Megha Sharma > >
Re: Review Request 61473: Do not kill non partition aware tasks.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/61473/#review192185 --- Ship it! Committing with some small tweaks below and for the commit description. Please resolve all issues from other reviewers with comments. include/mesos/mesos.proto Line 345 (original), 343 (patched) <https://reviews.apache.org/r/61473/#comment270233> Use a `NOTE: `. include/mesos/v1/mesos.proto Line 343 (original), 341 (patched) <https://reviews.apache.org/r/61473/#comment270234> Use a `NOTE: `. src/master/http.cpp Lines 336-338 (patched) <https://reviews.apache.org/r/61473/#comment270235> Put it above `if (!authorizeTask_->accept(*task, framework_->info)) {` line so the order is consistent with the block below. Also add a small comment about this check: ``` // There could be TASK_LOST tasks in this map. See comment for // `unreachableTasks`. ``` src/master/master.cpp Lines 9322-9323 (patched) <https://reviews.apache.org/r/61473/#comment270231> Tweak comment so it's less jagged. src/master/master.cpp Lines 9614-9616 (original), 9565-9568 (patched) <https://reviews.apache.org/r/61473/#comment270232> Reorder the comments and the CHECK. - Jiang Yan Xu On Nov. 28, 2017, 4:59 p.m., Megha Sharma wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/61473/ > --- > > (Updated Nov. 28, 2017, 4:59 p.m.) > > > Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu. > > > Bugs: MESOS-7215 > https://issues.apache.org/jira/browse/MESOS-7215 > > > Repository: mesos > > > Description > --- > > Master will not kill the tasks for non-Partition aware frameworks > when an unreachable agent re-registers with the master. > Master used to send a ShutdownFrameworkMessages to the agent > to kill the tasks from non partition aware frameworks including > the ones that are still registered which was problematic because > the offer from this agent could still go to the same framework which > could then launch new tasks. The agent would then receive tasks > of the same framework and ignore them because it thinks the > framework is shutting down. The framework is not shutting down of > course, so from the master and the scheduler's perspective the task > is pending in STAGING forever until the next agent reregistration, > which could happen much later. This commit fixes the problem by > not shutting down the non-partition aware frameworks on such an > agent. > > > Diffs > - > > include/mesos/mesos.proto b1ebfe25301549397a48468a02882e971213d45c > include/mesos/v1/mesos.proto d535eb40b205fc176730937eed4ce84ea7a369af > src/master/http.cpp 9dcdcbeeea6135091db5aa21dd54bc14d84f33fc > src/master/master.hpp 1c6a86fb37dee7a2ff4d564f4641a42af6206bb2 > src/master/master.cpp 7bcdb743659435847db6cdea917afc497e641582 > src/tests/partition_tests.cpp 067529acc2b3a1d7f0713c602d5f680ea19b6de8 > > > Diff: https://reviews.apache.org/r/61473/diff/29/ > > > Testing > --- > > make check > > > Thanks, > > Megha Sharma > >
Re: Review Request 61473: Do not kill non partition aware tasks.
> On Nov. 27, 2017, 4:25 p.m., Ilya Pronin wrote: > > src/master/http.cpp > > Lines 343-345 (patched) > > <https://reviews.apache.org/r/61473/diff/20/?file=1902099#file1902099line343> > > > > I may be ignorant of the discussion behind this, but since we don't > > treat these tasks as completed anymore, do we really need to maintain > > backwards compatibility here, in stats and metrics? This kind of hides the > > real state of things. > > Megha Sharma wrote: > Copying @xujyan to the discussion. It's true that this isn't perfect and my first intuition was to export the real state. We chatted with Vinod as well. Ultimately we think it's worse to export information inconsistently. More than metrics there are v0 http endpoints, the webUI and v1 operator APIs that expose the task's state (in addition to the scheduler API). It would be weird for the task's state to show up differently via different APIs. Secondly, the mechanism implemented by Mesos to handle the backwards compatibility w.r.t TASK_LOST vs. new states is to decide which to use when the task status is **created**, unlike how we handle old/new reservation formats and framework API devolve/evolve, for example. This means we lose the original state in upon creation and have no way to export it later. We had to use a `bool unreachable` to *remember* the original state for unreachable tasks but this cannot be generalized to all "new states" and I don't think it's worth treating unreachable tasks specically. Does it make sense? - Jiang Yan --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/61473/#review191674 --- On Nov. 28, 2017, 9:28 a.m., Megha Sharma wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/61473/ > --- > > (Updated Nov. 28, 2017, 9:28 a.m.) > > > Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu. > > > Bugs: MESOS-7215 > https://issues.apache.org/jira/browse/MESOS-7215 > > > Repository: mesos > > > Description > --- > > Master will not kill the tasks for non-Partition aware frameworks > when an unreachable agent re-registers with the master. > Master used to send a ShutdownFrameworkMessages to the agent > to kill the tasks from non partition aware frameworks including > the ones that are still registered which was problematic because > the offer from this agent could still go to the same framework which > could then launch new tasks. The agent would then receive tasks > of the same framework and ignore them because it thinks the > framework is shutting down. The framework is not shutting down of > course, so from the master and the scheduler's perspective the task > is pending in STAGING forever until the next agent reregistration, > which could happen much later. This commit fixes the problem by > not shutting down the non-partition aware frameworks on such an > agent. > > > Diffs > - > > include/mesos/mesos.proto e194093e490741acc552fd3ad328fd710b4b4435 > include/mesos/v1/mesos.proto 6fb1139683952877667abbcf8bf84b5b31bcd29e > src/master/http.cpp 10084125deb839a9846a4f64d2e433ff02754c02 > src/master/master.hpp a309fc78ee2613762f3d5d22ac7559afc7aac4a3 > src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f > src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 > > > Diff: https://reviews.apache.org/r/61473/diff/25/ > > > Testing > --- > > make check > > > Thanks, > > Megha Sharma > >
Re: Review Request 64098: Send status updates when agent re-registers.
> On Nov. 28, 2017, 11:22 a.m., Jiang Yan Xu wrote: > > src/master/master.cpp > > Lines 6788 (patched) > > <https://reviews.apache.org/r/64098/diff/3/?file=1902267#file1902267line6788> > > > > I think this type of status updates could benefit from a distinct > > reason for to make it more strongly typed. > > > > Could you add > > > > ``` > > v1/mesos.proto: REASON_AGENT_REREGISTERED > > mesos.proto: REASON_SLAVE_REREGISTERED > > ``` > > > > and use it? > > > > Note that the tag numbers in the proto enum need to match. > > > > Also note that there are also documentation related changes for these > > fields: > > https://github.com/apache/mesos/blob/master/docs/task-state-reasons.md > > > > It's fine to make it a separate patch preceding this one. > > Ilya Pronin wrote: > Won't the new reason break any frameworks? How will it break? We add new reasons all the time. This is indeed a new sitaution that Mesos master sends status updates, how do you think the framework will assert the reason otherwise? - Jiang Yan --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64098/#review192041 --- On Nov. 27, 2017, 4:55 p.m., Megha Sharma wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/64098/ > ------- > > (Updated Nov. 27, 2017, 4:55 p.m.) > > > Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu. > > > Bugs: MESOS-6406 > https://issues.apache.org/jira/browse/MESOS-6406 > > > Repository: mesos > > > Description > --- > > Master will send task status updates to frameworks when an agent > re-registers. > > > Diffs > - > > src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f > src/tests/master_allocator_tests.cpp > 3400d70bb0ba564eac43c4639eee0efd4d8059e6 > src/tests/master_tests.cpp 9c450b9f592d9e09a468f537d9b500e97acc636b > src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 > src/tests/persistent_volume_tests.cpp > acfeac16884b00581a3523607ff26f44f6dca53a > src/tests/slave_recovery_tests.cpp c864aa92d9ff128a89dbc25653385de25653f56a > src/tests/upgrade_tests.cpp 7f434dbba858f636719eec24e92b306b76430c4c > > > Diff: https://reviews.apache.org/r/64098/diff/4/ > > > Testing > --- > > with make check > > > Thanks, > > Megha Sharma > >
Re: Review Request 64098: Send status updates when agent re-registers.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/64098/#review192041 --- Reviewed the non-testing portion. Will continue to review the tests. src/master/master.cpp Lines 6783 (patched) <https://reviews.apache.org/r/64098/#comment270036> 4 space indentation for function arguments. src/master/master.cpp Lines 6788 (patched) <https://reviews.apache.org/r/64098/#comment270046> I think this type of status updates could benefit from a distinct reason for to make it more strongly typed. Could you add ``` v1/mesos.proto: REASON_AGENT_REREGISTERED mesos.proto: REASON_SLAVE_REREGISTERED ``` and use it? Note that the tag numbers in the proto enum need to match. Also note that there are also documentation related changes for these fields: https://github.com/apache/mesos/blob/master/docs/task-state-reasons.md It's fine to make it a separate patch preceding this one. src/master/master.cpp Lines 6789 (patched) <https://reviews.apache.org/r/64098/#comment270044> s/Agent/agent/ src/master/master.cpp Lines 6791-6792 (patched) <https://reviews.apache.org/r/64098/#comment270043> I know that the styling of the ternary operator is terribly inconsistent but could you do this which I think it more widely used? ``` (task.has_executor_id() ? Option(task.executor_id()) : None()), ``` There will be a day when we just use clang tidy to fix everything. :) src/master/master.cpp Lines 6793-6797 (patched) <https://reviews.apache.org/r/64098/#comment270041> Should we actually assign these values? ``` protobuf::getTaskHealth(*task), protobuf::getTaskCheckStatus(*task), None(), protobuf::getTaskContainerStatus(*task) ``` Note that the `None`s are default values so you don't have to specify trailing Nones. - Jiang Yan Xu On Nov. 27, 2017, 4:55 p.m., Megha Sharma wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/64098/ > --- > > (Updated Nov. 27, 2017, 4:55 p.m.) > > > Review request for mesos, Ilya Pronin, James Peach, and Jiang Yan Xu. > > > Bugs: MESOS-6406 > https://issues.apache.org/jira/browse/MESOS-6406 > > > Repository: mesos > > > Description > --- > > Master will send task status updates to frameworks when an agent > re-registers. > > > Diffs > - > > src/master/master.cpp 2ddd67ada3731803b00883b6a1f32b20c1bb238f > src/tests/master_allocator_tests.cpp > 3400d70bb0ba564eac43c4639eee0efd4d8059e6 > src/tests/master_tests.cpp 9c450b9f592d9e09a468f537d9b500e97acc636b > src/tests/partition_tests.cpp e49c474167076b4136a161ed29b11db9a13455a7 > src/tests/persistent_volume_tests.cpp > acfeac16884b00581a3523607ff26f44f6dca53a > src/tests/slave_recovery_tests.cpp c864aa92d9ff128a89dbc25653385de25653f56a > src/tests/upgrade_tests.cpp 7f434dbba858f636719eec24e92b306b76430c4c > > > Diff: https://reviews.apache.org/r/64098/diff/3/ > > > Testing > --- > > with make check > > > Thanks, > > Megha Sharma > >
Re: Review Request 61473: Do not kill non partition aware tasks.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/61473/#review191976 --- Almost there! include/mesos/mesos.proto Lines 344-346 (original), 342 (patched) <https://reviews.apache.org/r/61473/#comment269950> Given the potential confusion about the chagned behavior, at the API level I think there's value in explaining the history a little bit: ``` Prior to Mesos 1.5, such tasks will be killed by Mesos when the agent reregisters (unless the master has failed over). However due to the lack of benefit in maintaining different behaviors depending on whether the master has failed over (see MESOS-7215), as of 1.5, Mesos will not kill these tasks in either case. ``` How does it soud? include/mesos/v1/mesos.proto Lines 342-344 (original), 340 (patched) <https://reviews.apache.org/r/61473/#comment269957> Ditto. src/master/http.cpp Line 322 (original), 323 (patched) <https://reviews.apache.org/r/61473/#comment269983> Do we need a ``` if (task.get()->state() != TASK_UNREACHABLE) { continue; } ``` here? src/master/master.hpp Line 873 (original), 873 (patched) <https://reviews.apache.org/r/61473/#comment269994> We need the comment about `unreachable` here as well? src/master/master.hpp Line 2644 (original), 2645 (patched) <https://reviews.apache.org/r/61473/#comment269995> Are there two cases here or `unreachable` is the only variable that matters? We can assert the following invariant inside an `if (unreachable) {}`: ``` CHECK(task->state() == TASK_UNREACHABLE || task->state() == TASK_LOST) << task->state(); ``` src/master/master.hpp Line 3160 (original), 3157 (patched) <https://reviews.apache.org/r/61473/#comment269996> s/all tasks/tasks runing/ src/master/master.cpp Lines 6804-6807 (original), 6779-6782 (patched) <https://reviews.apache.org/r/61473/#comment269997> Fix indentation. src/master/master.cpp Line 6849 (original), 6820 (patched) <https://reviews.apache.org/r/61473/#comment269998> s/Determine which frameworks on the slave to shutdown, if any.// Previously when there are two cases, this sentence is a summary. Now that there is only one case, this sentence feels redudant. src/master/master.cpp Line 9370 (original), 9322 (patched) <https://reviews.apache.org/r/61473/#comment27> I have the similar feeling as Ilya, without context it's hard to understand this line. Although the context is not too far above, why not just define it within the context? ``` TaskState newTaskState = TASK_UNREACHABLE; TaskStatus::Reason newTaskReason = TaskStatus::REASON_SLAVE_REMOVED; // Needed to convey task unreachablility because we lose this information // from the task state if `TASK_LOST` is used. bool unreachable = true; if (!framework->capabilities.partitionAware) { newTaskState = TASK_LOST; } else if (unreachableTime.isNone()) { unreachable = false; newTaskState = TASK_GONE_BY_OPERATOR; newTaskReason = TaskStatus::REASON_SLAVE_REMOVED_BY_OPERATOR; } ``` Note that I simplied the nested if else structure. Does it look right? src/master/master.cpp Line 9582 (original), 9535 (patched) <https://reviews.apache.org/r/61473/#comment270001> Assert this invariant inside this `if`? ``` CHECK(!unreachable) << task->task_id(); ``` src/master/master.cpp Line 10318 (original), 10271 (patched) <https://reviews.apache.org/r/61473/#comment270002> Use `foreachvalue`. `values()` call will copy out the values. - Jiang Yan Xu On Nov. 27, 2017, 4:56 p.m., Megha Sharma wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/61473/ > --- > > (Updated Nov. 27, 2017, 4:56 p.m.) > > > Review request for mesos, James Peach, Vinod Kone, and Jiang Yan Xu. > > > Bugs: MESOS-7215 > https://issues.apache.org/jira/browse/MESOS-7215 > > > Repository: mesos > > > Description > --- > > Master will not kill the tasks for non-Partition aware frameworks > when an unreachable agent re-registers with the master. > Master used to send a ShutdownFrameworkMessages to the agent > to kill the tasks from non partition aware frameworks including > the ones that are still registered which was problematic because > the offer from this agent could still go to the same framework which > could then launch new t
Re: Review Request 63831: Fixed a bug that removed the suppressed framework from sorter.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/63831/ --- (Updated Nov. 27, 2017, 4:45 p.m.) Review request for mesos and Alexander Rukletsov. Changes --- Addressed comments. Bugs: MESOS-8223 https://issues.apache.org/jira/browse/MESOS-8223 Repository: mesos Description --- In `updateFramework`, we currently treat frameworks suppressing a role the same way as frameworks moving off a role and this could lead to the framework being removed from the framework sorter, which is unexpected. We should only remove the framework from a framework sorter when it's moving away from the corresponding role. Diffs (updated) - src/master/allocator/mesos/hierarchical.cpp 5ce9ceaa3a5f84a1e076d45448863c418531cc2b src/tests/scheduler_tests.cpp 45fc9c0cfccdb22c2e3e8d5de30c04575814a0e9 Diff: https://reviews.apache.org/r/63831/diff/2/ Changes: https://reviews.apache.org/r/63831/diff/1-2/ Testing --- make check. Added a test. Thanks, Jiang Yan Xu
Re: Review Request 63831: Fixed a bug that removed the suppressed framework from sorter.
> On Nov. 21, 2017, 10:27 a.m., Alexander Rukletsov wrote: > > src/master/allocator/mesos/hierarchical.cpp > > Lines 430-437 (original), 431-438 (patched) > > <https://reviews.apache.org/r/63831/diff/1/?file=1894169#file1894169line431> > > > > Do we need a lambda here? I kept the original use of lambdas. I think the benefit of labmdas here is more about readability: compared to ``` set removedRoles = oldRoles; foreach (const string& role, newRoles) { result.erase(role); } ``` The constness and construction of the variable is clearly isolated into a small block in a long method with many similar variables/code blocks. However as pointed out by the TODO, we should probably just implement a set difference operator so all these become one-liners. I'll do it later. > On Nov. 21, 2017, 10:27 a.m., Alexander Rukletsov wrote: > > src/tests/scheduler_tests.cpp > > Lines 1554-1556 (patched) > > <https://reviews.apache.org/r/63831/diff/1/?file=1894170#file1894170line1554> > > > > How about something like this: > > // This test verifies that if a framework decides to suppress its roles > > on reregisteration, no offers will be made. The main difference between `NoOffersOnReregistrationWithAllRolesSuppressed` and `NoOffersWithAllRolesSuppressed` is the initial condition "a framework initially with no roles suppressing offers". Although your suggestion implies it, I was emphasizing it. How about: ``` This test verifies that if a framework (initially with no roles suppressed) decides to suppress offers for its roles on reregisteration, no offers will be made. ``` It's probably simpler to read than my original version. - Jiang Yan --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/63831/#review191613 --- On Nov. 15, 2017, 11:05 p.m., Jiang Yan Xu wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/63831/ > --- > > (Updated Nov. 15, 2017, 11:05 p.m.) > > > Review request for mesos and Alexander Rukletsov. > > > Bugs: MESOS-8223 > https://issues.apache.org/jira/browse/MESOS-8223 > > > Repository: mesos > > > Description > --- > > In `updateFramework`, we currently treat frameworks suppressing a role the > same way as frameworks moving off a role and this could lead to the framework > being removed from the framework sorter, which is unexpected. We should only > remove the framework from a framework sorter when it's moving away from the > corresponding role. > > > Diffs > - > > src/master/allocator/mesos/hierarchical.cpp > f0fa7754e5968288edb15929efc9d244b177 > src/tests/scheduler_tests.cpp 45fc9c0cfccdb22c2e3e8d5de30c04575814a0e9 > > > Diff: https://reviews.apache.org/r/63831/diff/1/ > > > Testing > --- > > make check. > > Added a test. > > > Thanks, > > Jiang Yan Xu > >
Re: Review Request 63741: Fixed a bug in devolving framework subscription with suppressed roles.
> On Nov. 21, 2017, 10:04 a.m., Alexander Rukletsov wrote: > > src/internal/devolve.cpp > > Lines 194-196 (original), 194-196 (patched) > > <https://reviews.apache.org/r/63741/diff/1/?file=1892879#file1892879line194> > > > > What about `evolve()`? > > Jiang Yan Xu wrote: > I suppose I need to handle it there too since `v1::scheduler::Call > evolve(const scheduler::Call& call)` is defined there but I have no idea > where this is ever used/useful? I added the custom converstion to `v1::scheduler::Call evolve(const scheduler::Call& call)` with a TODO. > On Nov. 21, 2017, 10:04 a.m., Alexander Rukletsov wrote: > > src/internal/devolve.cpp > > Lines 199-200 (patched) > > <https://reviews.apache.org/r/63741/diff/1/?file=1892879#file1892879line199> > > > > Maybe a simple `if(){}` for now? Maybe the next special case will not > > be type dependent? SGTM. > On Nov. 21, 2017, 10:04 a.m., Alexander Rukletsov wrote: > > src/internal/devolve.cpp > > Lines 203-204 (patched) > > <https://reviews.apache.org/r/63741/diff/1/?file=1892879#file1892879line203> > > > > I believe we prefer writing `CopyFrom()` explicitly, no? I don't think it's estalished as a rule even though we do use CopyFrom most often previously. Now that we have upgraded to protobuf 3.5 with move semantics, I think we can probably standarize on prefering the assignment operator as the choice between copy and move can be determined simply by the rvalue-ness of the operand unless copying is explicitly intended. I understand that this particular case wouldn't be moved but styling wise it's more futre-proof. We should further this discussion. For now I think this is OK as this style is already used in the codebase. - Jiang Yan --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/63741/#review191605 --- On Nov. 15, 2017, 10:41 a.m., Jiang Yan Xu wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/63741/ > --- > > (Updated Nov. 15, 2017, 10:41 a.m.) > > > Review request for mesos and Alexander Rukletsov. > > > Bugs: MESOS-8200 > https://issues.apache.org/jira/browse/MESOS-8200 > > > Repository: mesos > > > Description > --- > > Also improved logging to show the suppressed roles. > > > Diffs > - > > src/internal/devolve.cpp 3a02490d734ef4a558b448871f9a1054d832f457 > src/master/master.cpp 81192d5ce7c095495a5faa534d4792601e540e0e > > > Diff: https://reviews.apache.org/r/63741/diff/1/ > > > Testing > --- > > Tested together with /r/63830/. > > > Thanks, > > Jiang Yan Xu > >
Re: Review Request 63741: Fixed a bug in devolving framework subscription with suppressed roles.
> On Nov. 21, 2017, 10:04 a.m., Alexander Rukletsov wrote: > > src/internal/devolve.cpp > > Lines 194-196 (original), 194-196 (patched) > > <https://reviews.apache.org/r/63741/diff/1/?file=1892879#file1892879line194> > > > > What about `evolve()`? I suppose I need to handle it there too since `v1::scheduler::Call evolve(const scheduler::Call& call)` is defined there but I have no idea where this is ever used/useful? - Jiang Yan --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/63741/#review191605 ------- On Nov. 15, 2017, 10:41 a.m., Jiang Yan Xu wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/63741/ > --- > > (Updated Nov. 15, 2017, 10:41 a.m.) > > > Review request for mesos and Alexander Rukletsov. > > > Bugs: MESOS-8200 > https://issues.apache.org/jira/browse/MESOS-8200 > > > Repository: mesos > > > Description > --- > > Also improved logging to show the suppressed roles. > > > Diffs > - > > src/internal/devolve.cpp 3a02490d734ef4a558b448871f9a1054d832f457 > src/master/master.cpp 81192d5ce7c095495a5faa534d4792601e540e0e > > > Diff: https://reviews.apache.org/r/63741/diff/1/ > > > Testing > --- > > Tested together with /r/63830/. > > > Thanks, > > Jiang Yan Xu > >
Re: Review Request 63830: Fixed 'NoOffersWithAllRolesSuppressed' test.
> On Nov. 21, 2017, 9:53 a.m., Alexander Rukletsov wrote: > > src/tests/scheduler_tests.cpp > > Lines 1515-1516 (original), 1534-1535 (patched) > > <https://reviews.apache.org/r/63830/diff/1/?file=1892884#file1892884line1542> > > > > `.WillRepeatedly(Return());` There's only one agent with no tasks launched so I don't expect additional offers? - Jiang Yan --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/63830/#review191602 --- On Nov. 21, 2017, 9:46 a.m., Jiang Yan Xu wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/63830/ > --- > > (Updated Nov. 21, 2017, 9:46 a.m.) > > > Review request for mesos and Alexander Rukletsov. > > > Bugs: MESOS-7996 > https://issues.apache.org/jira/browse/MESOS-7996 > > > Repository: mesos > > > Description > --- > > The previous iteration of the test failed to reveal the bug in MESOS-8200 due > to a race condition. > > > Diffs > - > > src/tests/scheduler_tests.cpp 45fc9c0cfccdb22c2e3e8d5de30c04575814a0e9 > > > Diff: https://reviews.apache.org/r/63830/diff/1/ > > > Testing > --- > > As expected, this test would reliably fail without /r/63741/ with the > following > > ``` > I1115 07:48:28.536093 3117031 scheduler.cpp:741] Enqueuing event SUBSCRIBED > received from http://17.121.128.13:38423/master/api/v1/scheduler > I1115 07:48:28.536224 3117031 scheduler.cpp:741] Enqueuing event HEARTBEAT > received from http://17.121.128.13:38423/master/api/v1/scheduler > I1115 07:48:28.536244 3117029 master.cpp:8115] Sending 1 offers to framework > 560bd0e1-8ab7-4e7e-be7a-54cb5f399531- (default) > I1115 07:48:28.536650 3117029 scheduler.cpp:741] Enqueuing event OFFERS > received from http://17.121.128.13:38423/master/api/v1/scheduler > ../../src/tests/scheduler_tests.cpp:1497: Failure > Mock function called more times than expected - returning directly. > Function call: offers(0x7fffabd197a0, @0x7f7d8c005460 48-byte object > <50-98 0B-B0 7D-7F 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 > 00-00 00-00 00-00 00-00 01-00 00-00 04-00 00-00 A0-54 00-8C 7D-7F 00-00>) > Expected: to be never called >Actual: called once - over-saturated and active > ``` > > The test passes with /r/63741/. > > > Thanks, > > Jiang Yan Xu > >
Re: Review Request 63830: Fixed 'NoOffersWithAllRolesSuppressed' test.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/63830/ --- (Updated Nov. 21, 2017, 5:46 p.m.) Review request for mesos and Alexander Rukletsov. Bugs: MESOS-7996 https://issues.apache.org/jira/browse/MESOS-7996 Repository: mesos Description --- The previous iteration of the test failed to reveal the bug in MESOS-8200 due to a race condition. Diffs - src/tests/scheduler_tests.cpp 45fc9c0cfccdb22c2e3e8d5de30c04575814a0e9 Diff: https://reviews.apache.org/r/63830/diff/1/ Testing --- As expected, this test would reliably fail without /r/63741/ with the following ``` I1115 07:48:28.536093 3117031 scheduler.cpp:741] Enqueuing event SUBSCRIBED received from http://17.121.128.13:38423/master/api/v1/scheduler I1115 07:48:28.536224 3117031 scheduler.cpp:741] Enqueuing event HEARTBEAT received from http://17.121.128.13:38423/master/api/v1/scheduler I1115 07:48:28.536244 3117029 master.cpp:8115] Sending 1 offers to framework 560bd0e1-8ab7-4e7e-be7a-54cb5f399531- (default) I1115 07:48:28.536650 3117029 scheduler.cpp:741] Enqueuing event OFFERS received from http://17.121.128.13:38423/master/api/v1/scheduler ../../src/tests/scheduler_tests.cpp:1497: Failure Mock function called more times than expected - returning directly. Function call: offers(0x7fffabd197a0, @0x7f7d8c005460 48-byte object <50-98 0B-B0 7D-7F 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 01-00 00-00 04-00 00-00 A0-54 00-8C 7D-7F 00-00>) Expected: to be never called Actual: called once - over-saturated and active ``` The test passes with /r/63741/. Thanks, Jiang Yan Xu
Re: Review Request 63175: Do not generate UnavailableResources for inactive frameworks.
> On Nov. 7, 2017, 5:09 p.m., Benjamin Mahler wrote: > > src/master/allocator/mesos/hierarchical.hpp > > Lines 319 (patched) > > <https://reviews.apache.org/r/63175/diff/1/?file=1864532#file1864532line319> > > > > I think this should be `connected`? > > > > Right now there's some confusion around `active` and `connected`. > > `active` in the past meant that the framework was receiving offers, but we > > never added the ability for frameworks to change this state. When we added > > suppression, that was equivalent to deactivation (note that activation has > > become a per-role thing). > > > > Mostly `connected` == `active`. But I think the point of this boolean > > is to track whether we can talk to the framework? Thoughts? > > Jiang Yan Xu wrote: > `active` is already used in the class and this review doesn't change its > meaning so I think we can defer this for later? > > For activeness vs. connectedness, I feel it's a larger effort to address > it consistently. I'll follow up on this. And we'd better keep it consistent between master and allocator, [agents](https://github.com/apache/mesos/blob/94ef10c2d4572cbe8c70b40fe3c59524186d6236/src/master/master.hpp#L196-L202) and frameworks, method names ([e.g.](https://github.com/apache/mesos/blob/94ef10c2d4572cbe8c70b40fe3c59524186d6236/src/master/allocator/mesos/hierarchical.hpp#L123)) and variables Right? - Jiang Yan --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/63175/#review190412 --- On Nov. 7, 2017, 5:02 p.m., Jiang Yan Xu wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/63175/ > --- > > (Updated Nov. 7, 2017, 5:02 p.m.) > > > Review request for mesos, Benjamin Mahler, James Peach, and Joseph Wu. > > > Bugs: MESOS-8085 > https://issues.apache.org/jira/browse/MESOS-8085 > > > Repository: mesos > > > Description > --- > > Do not generate UnavailableResources for inactive frameworks. > > > Diffs > - > > src/master/allocator/mesos/hierarchical.hpp > 2f7755afd056455ce6bddfbe2619a359f41d8e6a > src/master/allocator/mesos/hierarchical.cpp > f0fa7754e5968288edb15929efc9d244b177 > > > Diff: https://reviews.apache.org/r/63175/diff/2/ > > > Testing > --- > > make check. > > I didn' write a new test as the externally observable behavior doesn't change. > > > Thanks, > > Jiang Yan Xu > >
Re: Review Request 63175: Do not generate UnavailableResources for inactive frameworks.
> On Nov. 7, 2017, 5:09 p.m., Benjamin Mahler wrote: > > src/master/allocator/mesos/hierarchical.hpp > > Lines 319 (patched) > > <https://reviews.apache.org/r/63175/diff/1/?file=1864532#file1864532line319> > > > > I think this should be `connected`? > > > > Right now there's some confusion around `active` and `connected`. > > `active` in the past meant that the framework was receiving offers, but we > > never added the ability for frameworks to change this state. When we added > > suppression, that was equivalent to deactivation (note that activation has > > become a per-role thing). > > > > Mostly `connected` == `active`. But I think the point of this boolean > > is to track whether we can talk to the framework? Thoughts? `active` is already used in the class and this review doesn't change its meaning so I think we can defer this for later? For activeness vs. connectedness, I feel it's a larger effort to address it consistently. I'll follow up on this. - Jiang Yan --- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/63175/#review190412 ----------- On Nov. 7, 2017, 5:02 p.m., Jiang Yan Xu wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/63175/ > --- > > (Updated Nov. 7, 2017, 5:02 p.m.) > > > Review request for mesos, Benjamin Mahler, James Peach, and Joseph Wu. > > > Bugs: MESOS-8085 > https://issues.apache.org/jira/browse/MESOS-8085 > > > Repository: mesos > > > Description > --- > > Do not generate UnavailableResources for inactive frameworks. > > > Diffs > - > > src/master/allocator/mesos/hierarchical.hpp > c2346054b2c98516f15ab8ce2dc798224ff4def4 > src/master/allocator/mesos/hierarchical.cpp > 5b6efe5faa3c3b10f1f714f582a155b368f8ccaf > > > Diff: https://reviews.apache.org/r/63175/diff/1/ > > > Testing > --- > > make check. > > I didn' write a new test as the externally observable behavior doesn't change. > > > Thanks, > > Jiang Yan Xu > >
Review Request 63900: Downgraded the logging level of socket shutdown failures.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/63900/ --- Review request for mesos, Joseph Wu and Zhitao Li. Repository: mesos Description --- It often merely shows that a connection is closed before libprocess shuts it down so it should be logged at INFO level. Diffs - 3rdparty/libprocess/src/process.cpp 732aff86bb6f4c56d4eec7107b2cc2aff139c06a Diff: https://reviews.apache.org/r/63900/diff/1/ Testing --- make check. Thanks, Jiang Yan Xu
Review Request 63895: Fixed flaky test UnreachableAgentReregisterAfterFailover test.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/63895/ --- Review request for mesos, Alexander Rukletsov and James Peach. Bugs: MESOS-8178 https://issues.apache.org/jira/browse/MESOS-8178 Repository: mesos Description --- Other tests in this file could probably use the same treatment which I can do later. Diffs - src/tests/slave_tests.cpp a75bb260df223b5b86f31e91eec1b6ba8db00cb2 Diff: https://reviews.apache.org/r/63895/diff/1/ Testing --- make check. I couldn't repro the failure locally but I think the fix makes sense logically so it probably is the fix. ;) Thanks, Jiang Yan Xu
Review Request 63831: Fixed a bug that removed the suppressed framework from sorter.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/63831/ --- Review request for mesos and Alexander Rukletsov. Bugs: MESOS-8223 https://issues.apache.org/jira/browse/MESOS-8223 Repository: mesos Description --- In `updateFramework`, we currently treat frameworks suppressing a role the same way as frameworks moving off a role and this could lead to the framework being removed from the framework sorter, which is unexpected. We should only remove the framework from a framework sorter when it's moving away from the corresponding role. Diffs - src/master/allocator/mesos/hierarchical.cpp f0fa7754e5968288edb15929efc9d244b177 src/tests/scheduler_tests.cpp 45fc9c0cfccdb22c2e3e8d5de30c04575814a0e9 Diff: https://reviews.apache.org/r/63831/diff/1/ Testing --- make check. Added a test. Thanks, Jiang Yan Xu
Review Request 63830: Fixed 'NoOffersWithAllRolesSuppressed' test.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/63830/ --- Review request for mesos and Alexander Rukletsov. Repository: mesos Description --- The previous iteration of the test failed to reveal the bug in MESOS-8200 due to a race condition. Diffs - src/tests/scheduler_tests.cpp 45fc9c0cfccdb22c2e3e8d5de30c04575814a0e9 Diff: https://reviews.apache.org/r/63830/diff/1/ Testing --- As expected, this test would reliably fail without /r/63741/ with the following ``` I1115 07:48:28.536093 3117031 scheduler.cpp:741] Enqueuing event SUBSCRIBED received from http://17.121.128.13:38423/master/api/v1/scheduler I1115 07:48:28.536224 3117031 scheduler.cpp:741] Enqueuing event HEARTBEAT received from http://17.121.128.13:38423/master/api/v1/scheduler I1115 07:48:28.536244 3117029 master.cpp:8115] Sending 1 offers to framework 560bd0e1-8ab7-4e7e-be7a-54cb5f399531- (default) I1115 07:48:28.536650 3117029 scheduler.cpp:741] Enqueuing event OFFERS received from http://17.121.128.13:38423/master/api/v1/scheduler ../../src/tests/scheduler_tests.cpp:1497: Failure Mock function called more times than expected - returning directly. Function call: offers(0x7fffabd197a0, @0x7f7d8c005460 48-byte object <50-98 0B-B0 7D-7F 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 01-00 00-00 04-00 00-00 A0-54 00-8C 7D-7F 00-00>) Expected: to be never called Actual: called once - over-saturated and active ``` The test passes with /r/63741/. Thanks, Jiang Yan Xu
Review Request 63741: Fixed framework subscription with suppressed roles.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/63741/ --- Review request for mesos and Alexander Rukletsov. Repository: mesos Description --- Also improved logging to show the suppressed roles. Diffs - src/internal/devolve.cpp 3a02490d734ef4a558b448871f9a1054d832f457 src/master/master.cpp 81192d5ce7c095495a5faa534d4792601e540e0e Diff: https://reviews.apache.org/r/63741/diff/1/ Testing --- Tested together with /r/63830/. Thanks, Jiang Yan Xu
Re: Review Request 63642: Added a test for ExecutorID validation in ReregisterSlaveMessage.
--- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/63642/#review190371 --- src/tests/master_validation_tests.cpp Lines 4215 (patched) <https://reviews.apache.org/r/63642/#comment267704> 2 space indentation. src/tests/master_validation_tests.cpp Lines 4225-4226 (patched) <https://reviews.apache.org/r/63642/#comment267705> Nit: I feel the grouping looks clearer to have: ``` frameworks.push_back(DEFAULT_FRAMEWORK_INFO); frameworks[0].set_name("framework1"); frameworks[0].mutable_id()->set_value("framework1"); frameworks.push_back(DEFAULT_FRAMEWORK_INFO); frameworks[1].set_name("framework2"); frameworks[1].mutable_id()->set_value("framework2"); ``` BTW you could use `back()` as well. Same for the executors. Up to you. src/tests/master_validation_tests.cpp Lines 4241-4242 (patched) <https://reviews.apache.org/r/63642/#comment267726> These are `ASSERT_*`s? Same below. src/tests/master_validation_tests.cpp Lines 4243 (patched) <https://reviews.apache.org/r/63642/#comment267727> These are `EXPECT_*`? Same below. - Jiang Yan Xu On Nov. 7, 2017, 10:47 a.m., James Peach wrote: > > --- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/63642/ > --- > > (Updated Nov. 7, 2017, 10:47 a.m.) > > > Review request for mesos, James DeFelice and Jiang Yan Xu. > > > Bugs: MESOS-8169 > https://issues.apache.org/jira/browse/MESOS-8169 > > > Repository: mesos > > > Description > --- > > Added a test to ensure that the ReregisterSlaveMessage validation > correctly allows duplicate ExecutorIDs as long as they are scoped > to different frameworks. > > > Diffs > - > > src/tests/master_validation_tests.cpp > be8c8920fdd6eb72fc77247844a731070db5fa1c > > > Diff: https://reviews.apache.org/r/63642/diff/3/ > > > Testing > --- > > make check (Fedora 26) with PR #248 applied. > > > Thanks, > > James Peach > >