Re: Review Request 71742: Mesos agent shouldn't respond pings if no master is registered

2019-11-22 Thread Mesos Reviewbot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71742/#review218780
---



Patch looks great!

Reviews applied: [71742]

Passed command: export OS='ubuntu:14.04' BUILDTOOL='autotools' COMPILER='gcc' 
CONFIGURATION='--verbose --disable-libtool-wrappers 
--disable-parallel-test-execution' ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; 
./support/docker-build.sh

- Mesos Reviewbot


On Nov. 22, 2019, 8:02 p.m., Xudong Ni wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71742/
> ---
> 
> (Updated Nov. 22, 2019, 8:02 p.m.)
> 
> 
> Review request for mesos, Benjamin Mahler, Greg Mann, and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-10032
> https://issues.apache.org/jira/browse/MESOS-10032
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> In the case agents lost ZooKeeper connections and resetting its
> master to none and beginning to dropping control messages from the
> master, agent should not respond pings from master.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.cpp 3839a120446339fea8aa857f431a2dba28ed4002 
>   src/tests/slave_recovery_tests.cpp 0efd3a6ac09ad06d9365b7bb2295157b5175e6b8 
> 
> 
> Diff: https://reviews.apache.org/r/71742/diff/4/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>



Re: Review Request 71742: Mesos agent shouldn't respond pings if no master is registered

2019-11-22 Thread Jiang Yan Xu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71742/#review218775
---



As we are discussing on slack how to deal with the broken test and how to 
change it to be more meaningful I think it won't be unreasonable to just delete 
it for now. We can always add it in its new form back.

- Jiang Yan Xu


On Nov. 22, 2019, noon, Xudong Ni wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71742/
> ---
> 
> (Updated Nov. 22, 2019, noon)
> 
> 
> Review request for mesos, Benjamin Mahler and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-10032
> https://issues.apache.org/jira/browse/MESOS-10032
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> In the case agents lost ZooKeeper connections and resetting its
> master to none and beginning to dropping control messages from the
> master, agent should not respond pings from master.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.cpp 3839a120446339fea8aa857f431a2dba28ed4002 
>   src/tests/slave_recovery_tests.cpp 0efd3a6ac09ad06d9365b7bb2295157b5175e6b8 
> 
> 
> Diff: https://reviews.apache.org/r/71742/diff/4/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>



Re: Review Request 71742: Mesos agent shouldn't respond pings if no master is registered

2019-11-22 Thread Xudong Ni via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71742/
---

(Updated Nov. 22, 2019, 8 p.m.)


Review request for mesos, Benjamin Mahler and Jiang Yan Xu.


Changes
---

remove the test no longer applicable in the new scenario


Bugs: MESOS-10032
https://issues.apache.org/jira/browse/MESOS-10032


Repository: mesos


Description
---

In the case agents lost ZooKeeper connections and resetting its
master to none and beginning to dropping control messages from the
master, agent should not respond pings from master.


Diffs (updated)
-

  src/slave/slave.cpp 3839a120446339fea8aa857f431a2dba28ed4002 
  src/tests/slave_recovery_tests.cpp 0efd3a6ac09ad06d9365b7bb2295157b5175e6b8 


Diff: https://reviews.apache.org/r/71742/diff/4/

Changes: https://reviews.apache.org/r/71742/diff/3-4/


Testing (updated)
---

make check


Thanks,

Xudong Ni



Re: Review Request 71742: Mesos agent shouldn't respond pings if no master is registered

2019-11-19 Thread Mesos Reviewbot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71742/#review218713
---



Bad patch!

Reviews applied: [71742]

Failed command: ['bash', '-c', "set -o pipefail; export OS='ubuntu:14.04' 
BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose 
--disable-libtool-wrappers --disable-parallel-test-execution' 
ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker-build.sh 2>&1 | tee 
build_71742"]

Error:
..
principal"},"source":{"id":"/tmp/CSIVersion_StorageLocalResourceProviderTest_OperatorOperationsWithResourceProviderResources_v1_3yGpKu/2GB-da412f9a-c645-426e-830f-36efccaf33ab","mount":{"root":"./csi/org.apache.mesos.csi.test/local/mounts"},"profile":"test","type":"MOUNT","vendor":"org.apache.mesos.csi.test.local"},"volume":{"container_path":"volume","mode":"RW"}},"name":"disk","provider_id":{"value":"4238df44-e758-45fa-a640-2b50ff3f9d9b"},"reservations":[{"role":"storage","type":"DYNAMIC"},{"principal":"test-principal","role":"storage/default-role","type":"DYNAMIC"}],"scalar":{"value":2048.0},"type":"SCALAR"}]'
I1120 06:06:59.865097 18662 sched.cpp:960] Rescinded offer 
a9e4592f-64ab-4112-9e88-7566facc2bed-O3
I1120 06:06:59.865196 18662 sched.cpp:971] Scheduler::offerRescinded took 
40528ns
I1120 06:06:59.865566 18648 hierarchical.cpp:1576] Recovered ports(allocated: 
storage/default-role):[31000-32000]; disk(allocated: 
storage/default-role)(reservations: 
[(DYNAMIC,storage),(DYNAMIC,storage/default-role,test-principal)])[MOUNT(org.apache.mesos.csi.test.local,/tmp/CSIVersion_StorageLocalResourceProviderTest_OperatorOperationsWithResourceProviderResources_v1_3yGpKu/2GB-da412f9a-c645-426e-830f-36efccaf33ab,test)]:2048;
 cpus(allocated: storage/default-role):2; mem(allocated: 
storage/default-role):1024; disk(allocated: storage/default-role):1024 (total: 
cpus:2; mem:1024; disk:1024; ports:[31000-32000]; disk(reservations: 
[(DYNAMIC,storage),(DYNAMIC,storage/default-role,test-principal)])[MOUNT(org.apache.mesos.csi.test.local,/tmp/CSIVersion_StorageLocalResourceProviderTest_OperatorOperationsWithResourceProviderResources_v1_3yGpKu/2GB-da412f9a-c645-426e-830f-36efccaf33ab,test)]:2048,
 offered or allocated: {}) on agent a9e4592f-64ab-4112-9e88-7566facc2bed-S0 
from framework a9e45
 92f-64ab-4112-9e88-7566facc2bed-
I1120 06:06:59.865661 18670 master.cpp:12794] Removing offer 
a9e4592f-64ab-4112-9e88-7566facc2bed-O3
I1120 06:06:59.866827 18648 hierarchical.cpp:1625] Framework 
a9e4592f-64ab-4112-9e88-7566facc2bed- filtered agent 
a9e4592f-64ab-4112-9e88-7566facc2bed-S0 for 5secs
I1120 06:06:59.869273 18651 master.cpp:12659] Sending operation '' (uuid: 
432806fd-224b-4139-aa48-1abc35ccafeb) to agent 
a9e4592f-64ab-4112-9e88-7566facc2bed-S0 at slave(1250)@172.17.0.2:42302 
(7ccc8b834dd0)
I1120 06:06:59.869983 18649 slave.cpp:4352] Ignoring new checkpointed resources 
and operations identical to the current version
I1120 06:06:59.873303 18671 provider.cpp:498] Received APPLY_OPERATION event
I1120 06:06:59.873347 18671 provider.cpp:1351] Received CREATE operation '' 
(uuid: 432806fd-224b-4139-aa48-1abc35ccafeb)
I1120 06:06:59.879556 18657 master.cpp:6482] Processing REVIVE call for 
framework a9e4592f-64ab-4112-9e88-7566facc2bed- (default) at 
scheduler-5eaf4119-5232-42ab-89d3-bcfa3995da6d@172.17.0.2:42302
I1120 06:06:59.880115 18657 hierarchical.cpp:1721] Unsuppressed offers and 
cleared filters for roles { storage/default-role } of framework 
a9e4592f-64ab-4112-9e88-7566facc2bed-
I1120 06:06:59.881371 18657 hierarchical.cpp:1853] Performed allocation for 1 
agents in 1.047256ms
I1120 06:06:59.881721 18657 hierarchical.cpp:1853] Performed allocation for 1 
agents in 151686ns
I1120 06:06:59.881968 18670 master.cpp:10497] Sending offers [ 
a9e4592f-64ab-4112-9e88-7566facc2bed-O4 ] to framework 
a9e4592f-64ab-4112-9e88-7566facc2bed- (default) at 
scheduler-5eaf4119-5232-42ab-89d3-bcfa3995da6d@172.17.0.2:42302
I1120 06:06:59.882639 18655 sched.cpp:934] Scheduler::resourceOffers took 
101940ns
I1120 06:06:59.893764 18661 http.cpp:1115] HTTP POST for 
/slave(1250)/api/v1/resource_provider from 172.17.0.2:44786
I1120 06:06:59.894723 18649 slave.cpp:8495] Handling resource provider message 
'UPDATE_OPERATION_STATUS: (uuid: 886bf308-26b8-4c87-a3d8-aaf636839fab) for 
framework  (latest state: OPERATION_FINISHED, status update state: 
OPERATION_FINISHED)'
I1120 06:06:59.894937 18649 slave.cpp:8948] Updating the state of operation 
with no ID (uuid: 886bf308-26b8-4c87-a3d8-aaf636839fab) for an operation API 
call (latest state: OPERATION_FINISHED, status update state: OPERATION_FINISHED)
I1120 06:06:59.894990 18649 slave.cpp:8702] Forwarding status update of 
operation with no ID (operation_uuid: 886bf308-26b8-4c87-a3d8-aaf636839fab) for 
an operator API call
I1120 06:06:59.895336 18664 master.cpp:12311] Updating the state of operation 

Re: Review Request 71742: Mesos agent shouldn't respond pings if no master is registered

2019-11-19 Thread Xudong Ni via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71742/
---

(Updated Nov. 19, 2019, 10:39 p.m.)


Review request for mesos, Benjamin Mahler and Jiang Yan Xu.


Changes
---

Addressed the comments except the test which will be updated later


Bugs: MESOS-10032
https://issues.apache.org/jira/browse/MESOS-10032


Repository: mesos


Description
---

In the case agents lost ZooKeeper connections and resetting its
master to none and beginning to dropping control messages from the
master, agent should not respond pings from master.


Diffs (updated)
-

  src/slave/slave.cpp 3839a120446339fea8aa857f431a2dba28ed4002 


Diff: https://reviews.apache.org/r/71742/diff/3/

Changes: https://reviews.apache.org/r/71742/diff/2-3/


Testing
---

==] 2322 tests from 222 test cases ran. (1038166 ms total)
[  PASSED  ] 2321 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] SlaveRecoveryTest/0.PingTimeoutDuringRecovery, where TypeParam = 
mesos::internal::slave::MesosContainerizer

This failed test verifies that the agent responds to pings from the master 
while the agent is performing recovery, this PR will break this scenario.


Thanks,

Xudong Ni



Re: Review Request 71742: Mesos agent shouldn't respond pings if no master is registered

2019-11-15 Thread Jiang Yan Xu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71742/#review218653
---



Chatted on https://mesos.slack.com/archives/C8PDVCDE3/p1573688798006800.

The approach LGTM. Let's fix the tests. :)


src/slave/slave.cpp
Lines 6415 (patched)


s/respond/respond to/

But the comment is basically reiterating the behavior which is pretty 
straightforward. We can add a bit of rationale for the behavior here: 

```
// Don't respond to pings if the agent cannot detect the master (e.g., due 
to failing to connect to ZK) because it stops responding to control messages 
such as run/kill tasks. It's better to have the master eventually mark the 
agent as partitioned after prolonged ZK disconnection than to silently drop 
messages.
```



src/slave/slave.cpp
Lines 6419 (patched)


"registered master" is not a valid concept here, let's say "because the 
master cannot be detected"?


- Jiang Yan Xu


On Nov. 13, 2019, 2:39 p.m., Xudong Ni wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71742/
> ---
> 
> (Updated Nov. 13, 2019, 2:39 p.m.)
> 
> 
> Review request for mesos and Jiang Yan Xu.
> 
> 
> Bugs: MESOS-10032
> https://issues.apache.org/jira/browse/MESOS-10032
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> In the case agents lost ZooKeeper connections and resetting its
> master to none and beginning to dropping control messages from the
> master, agent should not respond pings from master.
> 
> 
> Diffs
> -
> 
>   src/slave/slave.cpp 3839a120446339fea8aa857f431a2dba28ed4002 
> 
> 
> Diff: https://reviews.apache.org/r/71742/diff/2/
> 
> 
> Testing
> ---
> 
> ==] 2322 tests from 222 test cases ran. (1038166 ms total)
> [  PASSED  ] 2321 tests.
> [  FAILED  ] 1 test, listed below:
> [  FAILED  ] SlaveRecoveryTest/0.PingTimeoutDuringRecovery, where TypeParam = 
> mesos::internal::slave::MesosContainerizer
> 
> This failed test verifies that the agent responds to pings from the master 
> while the agent is performing recovery, this PR will break this scenario.
> 
> 
> Thanks,
> 
> Xudong Ni
> 
>



Re: Review Request 71742: Mesos agent shouldn't respond pings if no master is registered

2019-11-13 Thread Xudong Ni via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71742/
---

(Updated Nov. 13, 2019, 10:39 p.m.)


Review request for mesos and Jiang Yan Xu.


Bugs: MESOS-10032
https://issues.apache.org/jira/browse/MESOS-10032


Repository: mesos


Description
---

In the case agents lost ZooKeeper connections and resetting its
master to none and beginning to dropping control messages from the
master, agent should not respond pings from master.


Diffs
-

  src/slave/slave.cpp 3839a120446339fea8aa857f431a2dba28ed4002 


Diff: https://reviews.apache.org/r/71742/diff/2/


Testing (updated)
---

==] 2322 tests from 222 test cases ran. (1038166 ms total)
[  PASSED  ] 2321 tests.
[  FAILED  ] 1 test, listed below:
[  FAILED  ] SlaveRecoveryTest/0.PingTimeoutDuringRecovery, where TypeParam = 
mesos::internal::slave::MesosContainerizer

This failed test verifies that the agent responds to pings from the master 
while the agent is performing recovery, this PR will break this scenario.


Thanks,

Xudong Ni



Re: Review Request 71742: Mesos agent shouldn't respond pings if no master is registered

2019-11-13 Thread Mesos Reviewbot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71742/#review218616
---



Bad patch!

Reviews applied: [71742]

Failed command: ['bash', '-c', "set -o pipefail; export OS='ubuntu:14.04' 
BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose 
--disable-libtool-wrappers --disable-parallel-test-execution' 
ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker-build.sh 2>&1 | tee 
build_71742"]

Error:
..
b4d240c68- (default) at 
scheduler-eb169004-2da5-4544-af21-45d8770cf0be@172.17.0.2:43725
I1113 08:55:24.192509 18861 sched.cpp:934] Scheduler::resourceOffers took 
66119ns
I1113 08:55:24.195724 18868 process.cpp:3671] Handling HTTP event for process 
'master' with path: '/master/api/v1'
I1113 08:55:24.197166 18869 http.cpp:1115] HTTP POST for /master/api/v1 from 
172.17.0.2:58524
I1113 08:55:24.197444 18869 http.cpp:263] Processing call CREATE_VOLUMES
I1113 08:55:24.198328 18869 master.cpp:3982] Authorizing principal 
'test-principal' to create volumes 
'[{"disk":{"persistence":{"id":"1110bef5-ec74-4476-b2a6-fa12de70281b","principal":"test-principal"},"source":{"id":"/tmp/CSIVersion_StorageLocalResourceProviderTest_OperatorOperationsWithResourceProviderResources_v1_NrTXk1/2GB-580a2ec3-f46e-4118-af89-66e1d0a0b173","mount":{"root":"./csi/org.apache.mesos.csi.test/local/mounts"},"profile":"test","type":"MOUNT","vendor":"org.apache.mesos.csi.test.local"},"volume":{"container_path":"volume","mode":"RW"}},"name":"disk","provider_id":{"value":"875458c8-52f4-4f30-a00c-2f8c583e4a4d"},"reservations":[{"role":"storage","type":"DYNAMIC"},{"principal":"test-principal","role":"storage/default-role","type":"DYNAMIC"}],"scalar":{"value":2048.0},"type":"SCALAR"}]'
I1113 08:55:24.199939 18863 sched.cpp:960] Rescinded offer 
f31dc3e2-d030-4fe5-b8f8-324b4d240c68-O3
I1113 08:55:24.200021 18863 sched.cpp:971] Scheduler::offerRescinded took 
29618ns
I1113 08:55:24.200551 18861 hierarchical.cpp:1576] Recovered ports(allocated: 
storage/default-role):[31000-32000]; disk(allocated: 
storage/default-role)(reservations: 
[(DYNAMIC,storage),(DYNAMIC,storage/default-role,test-principal)])[MOUNT(org.apache.mesos.csi.test.local,/tmp/CSIVersion_StorageLocalResourceProviderTest_OperatorOperationsWithResourceProviderResources_v1_NrTXk1/2GB-580a2ec3-f46e-4118-af89-66e1d0a0b173,test)]:2048;
 cpus(allocated: storage/default-role):2; mem(allocated: 
storage/default-role):1024; disk(allocated: storage/default-role):1024 (total: 
cpus:2; mem:1024; disk:1024; ports:[31000-32000]; disk(reservations: 
[(DYNAMIC,storage),(DYNAMIC,storage/default-role,test-principal)])[MOUNT(org.apache.mesos.csi.test.local,/tmp/CSIVersion_StorageLocalResourceProviderTest_OperatorOperationsWithResourceProviderResources_v1_NrTXk1/2GB-580a2ec3-f46e-4118-af89-66e1d0a0b173,test)]:2048,
 offered or allocated: {}) on agent f31dc3e2-d030-4fe5-b8f8-324b4d240c68-S0 
from framework f31dc
 3e2-d030-4fe5-b8f8-324b4d240c68-
I1113 08:55:24.200685 18867 master.cpp:12794] Removing offer 
f31dc3e2-d030-4fe5-b8f8-324b4d240c68-O3
I1113 08:55:24.202123 18861 hierarchical.cpp:1625] Framework 
f31dc3e2-d030-4fe5-b8f8-324b4d240c68- filtered agent 
f31dc3e2-d030-4fe5-b8f8-324b4d240c68-S0 for 5secs
I1113 08:55:24.204321 18865 master.cpp:12659] Sending operation '' (uuid: 
d5f2f136-ec48-42af-8ac8-89485110e831) to agent 
f31dc3e2-d030-4fe5-b8f8-324b4d240c68-S0 at slave(1250)@172.17.0.2:43725 
(2c8a034ce049)
I1113 08:55:24.204887 18866 slave.cpp:4352] Ignoring new checkpointed resources 
and operations identical to the current version
I1113 08:55:24.207716 18872 provider.cpp:498] Received APPLY_OPERATION event
I1113 08:55:24.207767 18872 provider.cpp:1351] Received CREATE operation '' 
(uuid: d5f2f136-ec48-42af-8ac8-89485110e831)
I1113 08:55:24.213966 18865 master.cpp:6482] Processing REVIVE call for 
framework f31dc3e2-d030-4fe5-b8f8-324b4d240c68- (default) at 
scheduler-eb169004-2da5-4544-af21-45d8770cf0be@172.17.0.2:43725
I1113 08:55:24.214371 18865 hierarchical.cpp:1721] Unsuppressed offers and 
cleared filters for roles { storage/default-role } of framework 
f31dc3e2-d030-4fe5-b8f8-324b4d240c68-
I1113 08:55:24.215639 18865 hierarchical.cpp:1853] Performed allocation for 1 
agents in 1.124093ms
I1113 08:55:24.216162 18865 hierarchical.cpp:1853] Performed allocation for 1 
agents in 247643ns
I1113 08:55:24.216593 18873 master.cpp:10497] Sending offers [ 
f31dc3e2-d030-4fe5-b8f8-324b4d240c68-O4 ] to framework 
f31dc3e2-d030-4fe5-b8f8-324b4d240c68- (default) at 
scheduler-eb169004-2da5-4544-af21-45d8770cf0be@172.17.0.2:43725
I1113 08:55:24.217276 18864 sched.cpp:934] Scheduler::resourceOffers took 
86870ns
I1113 08:55:24.232483 18871 http.cpp:1115] HTTP POST for 
/slave(1250)/api/v1/resource_provider from 172.17.0.2:58518
I1113 08:55:24.233323 18860 slave.cpp:8491] Handling resource provider message 

Re: Review Request 71742: Mesos agent shouldn't respond pings if no master is registered

2019-11-12 Thread Xudong Ni via Review Board

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71742/
---

(Updated Nov. 13, 2019, 12:18 a.m.)


Review request for mesos and Jiang Yan Xu.


Summary (updated)
-

Mesos agent shouldn't respond pings if no master is registered


Bugs: MESOS-10032
https://issues.apache.org/jira/browse/MESOS-10032


Repository: mesos


Description (updated)
---

In the case agents lost ZooKeeper connections and resetting its
master to none and beginning to dropping control messages from the
master, agent should not respond pings from master.


Diffs (updated)
-

  src/slave/slave.cpp 3839a120446339fea8aa857f431a2dba28ed4002 


Diff: https://reviews.apache.org/r/71742/diff/2/

Changes: https://reviews.apache.org/r/71742/diff/1-2/


Testing (updated)
---

WIP


Thanks,

Xudong Ni