Re: Review Request 71275: Fixed agent draining when returning from unreachable state.

2019-08-13 Thread Joseph Wu


> On Aug. 13, 2019, 4:11 p.m., Greg Mann wrote:
> > src/tests/api_tests.cpp
> > Lines 5953 (patched)
> > 
> >
> > Is this necessary?

Oh, good catch.  This line doesn't add anything, since we don't expect any 
further updates in this sequence.

I shall delete the line.


- Joseph


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71275/#review217187
---


On Aug. 13, 2019, 2:13 p.m., Joseph Wu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71275/
> ---
> 
> (Updated Aug. 13, 2019, 2:13 p.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Benno Evers, and Greg Mann.
> 
> 
> Bugs: MESOS-9934
> https://issues.apache.org/jira/browse/MESOS-9934
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This logic was missing from the initial implementation of agent
> draining.  When an agent became unreachable, and then reregistered
> with the master, the master would not properly deactivate or drain
> the agent.
> 
> This also fixes a potential problem with checking the agent
> drain state too early in the case of pending operations.
> Operations are not reported to the master until after the agent
> reregisters, so agents with the RESOURCE_PROVIDER capability
> cannot be considered DRAINED until after the first 
> UpdateSlaveMessage has arrived.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp 31c7c97abecb92591369b417e2ef38a99cc09bac 
>   src/tests/api_tests.cpp c2099674e742eaa08134c5e0a7cdab1734808119 
> 
> 
> Diff: https://reviews.apache.org/r/71275/diff/2/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Joseph Wu
> 
>



Re: Review Request 71275: Fixed agent draining when returning from unreachable state.

2019-08-13 Thread Greg Mann

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71275/#review217187
---


Fix it, then Ship it!





src/tests/api_tests.cpp
Lines 5953 (patched)


Is this necessary?


- Greg Mann


On Aug. 13, 2019, 9:13 p.m., Joseph Wu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71275/
> ---
> 
> (Updated Aug. 13, 2019, 9:13 p.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Benno Evers, and Greg Mann.
> 
> 
> Bugs: MESOS-9934
> https://issues.apache.org/jira/browse/MESOS-9934
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This logic was missing from the initial implementation of agent
> draining.  When an agent became unreachable, and then reregistered
> with the master, the master would not properly deactivate or drain
> the agent.
> 
> This also fixes a potential problem with checking the agent
> drain state too early in the case of pending operations.
> Operations are not reported to the master until after the agent
> reregisters, so agents with the RESOURCE_PROVIDER capability
> cannot be considered DRAINED until after the first 
> UpdateSlaveMessage has arrived.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp 31c7c97abecb92591369b417e2ef38a99cc09bac 
>   src/tests/api_tests.cpp c2099674e742eaa08134c5e0a7cdab1734808119 
> 
> 
> Diff: https://reviews.apache.org/r/71275/diff/2/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Joseph Wu
> 
>



Re: Review Request 71275: Fixed agent draining when returning from unreachable state.

2019-08-13 Thread Joseph Wu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71275/
---

(Updated Aug. 13, 2019, 2:13 p.m.)


Review request for mesos, Benjamin Bannier, Benno Evers, and Greg Mann.


Changes
---

Fixed the problem with marking stuff drained too early if operations exist.
Fixed race in unit test.


Bugs: MESOS-9934
https://issues.apache.org/jira/browse/MESOS-9934


Repository: mesos


Description (updated)
---

This logic was missing from the initial implementation of agent
draining.  When an agent became unreachable, and then reregistered
with the master, the master would not properly deactivate or drain
the agent.

This also fixes a potential problem with checking the agent
drain state too early in the case of pending operations.
Operations are not reported to the master until after the agent
reregisters, so agents with the RESOURCE_PROVIDER capability
cannot be considered DRAINED until after the first 
UpdateSlaveMessage has arrived.


Diffs (updated)
-

  src/master/master.cpp 31c7c97abecb92591369b417e2ef38a99cc09bac 
  src/tests/api_tests.cpp c2099674e742eaa08134c5e0a7cdab1734808119 


Diff: https://reviews.apache.org/r/71275/diff/2/

Changes: https://reviews.apache.org/r/71275/diff/1-2/


Testing
---

make check


Thanks,

Joseph Wu



Re: Review Request 71275: Fixed agent draining when returning from unreachable state.

2019-08-13 Thread Greg Mann


> On Aug. 13, 2019, 1:10 p.m., Benno Evers wrote:
> > src/master/master.cpp
> > Lines 7959 (patched)
> > 
> >
> > Can we already decide at this point if the agent is drained? E.g. 
> > `checkAndTransitionDrainingAgent()` seems to look at pending operations, 
> > which only get sent in the first `UpdateSlaveMessage` after the 
> > reregistration is completed.

Good catch Benno - this is also an issue with the existing reregistration code 
in `___reregisterSlave()`. Unfortunately it's even more annoying than waiting 
for the first `UpdateSlaveMessage`, since it's possible that resource providers 
may subscribe with the agent later on. Not yet sure what the best way to handle 
this is.


- Greg


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71275/#review217177
---


On Aug. 13, 2019, 2:35 a.m., Joseph Wu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71275/
> ---
> 
> (Updated Aug. 13, 2019, 2:35 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Benno Evers, and Greg Mann.
> 
> 
> Bugs: MESOS-9934
> https://issues.apache.org/jira/browse/MESOS-9934
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This logic was missing from the initial implementation of agent
> draining.  When an agent became unreachable, and then reregistered
> with the master, the master would not properly deactivate or drain
> the agent.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp 31c7c97abecb92591369b417e2ef38a99cc09bac 
>   src/tests/api_tests.cpp c2099674e742eaa08134c5e0a7cdab1734808119 
> 
> 
> Diff: https://reviews.apache.org/r/71275/diff/1/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Joseph Wu
> 
>



Re: Review Request 71275: Fixed agent draining when returning from unreachable state.

2019-08-13 Thread Benno Evers

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71275/#review217177
---




src/master/master.cpp
Lines 7959 (patched)


Can we already decide at this point if the agent is drained? E.g. 
`checkAndTransitionDrainingAgent()` seems to look at pending operations, which 
only get sent in the first `UpdateSlaveMessage` after the reregistration is 
completed.


- Benno Evers


On Aug. 13, 2019, 2:35 a.m., Joseph Wu wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/71275/
> ---
> 
> (Updated Aug. 13, 2019, 2:35 a.m.)
> 
> 
> Review request for mesos, Benjamin Bannier, Benno Evers, and Greg Mann.
> 
> 
> Bugs: MESOS-9934
> https://issues.apache.org/jira/browse/MESOS-9934
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> This logic was missing from the initial implementation of agent
> draining.  When an agent became unreachable, and then reregistered
> with the master, the master would not properly deactivate or drain
> the agent.
> 
> 
> Diffs
> -
> 
>   src/master/master.cpp 31c7c97abecb92591369b417e2ef38a99cc09bac 
>   src/tests/api_tests.cpp c2099674e742eaa08134c5e0a7cdab1734808119 
> 
> 
> Diff: https://reviews.apache.org/r/71275/diff/1/
> 
> 
> Testing
> ---
> 
> make check
> 
> 
> Thanks,
> 
> Joseph Wu
> 
>



Re: Review Request 71275: Fixed agent draining when returning from unreachable state.

2019-08-12 Thread Mesos Reviewbot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71275/#review217167
---



Bad patch!

Reviews applied: [71275]

Failed command: ['bash', '-c', "set -o pipefail; export OS='ubuntu:14.04' 
BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose 
--disable-libtool-wrappers --disable-parallel-test-execution' 
ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/docker-build.sh 2>&1 | tee 
build_71275"]

Error:
..
ent 83a2b4de-3aa8-40cf-a9bf-5c17535ba026-S0
I0813 04:34:17.243963 18630 status_update_manager_process.hpp:528] Forwarding 
operation status update OPERATION_FINISHED (Status UUID: 
670785ed-72f2-4489-8fb7-18158c3e9339) for operation UUID 
b211a4e0-d0da-458e-acf5-064b21305ef2 on agent 
83a2b4de-3aa8-40cf-a9bf-5c17535ba026-S0
I0813 04:34:17.244907 18625 http_connection.hpp:131] Sending 2 call to 
http://172.17.0.2:39253/slave(1222)/api/v1/resource_provider
I0813 04:34:17.246160 18636 process.cpp:3671] Handling HTTP event for process 
'slave(1222)' with path: '/slave(1222)/api/v1/resource_provider'
I0813 04:34:17.249538 18631 process.cpp:3671] Handling HTTP event for process 
'master' with path: '/master/api/v1'
I0813 04:34:17.251186 18634 http.cpp:1115] HTTP POST for /master/api/v1 from 
172.17.0.2:57282
I0813 04:34:17.251458 18634 http.cpp:263] Processing call UNRESERVE_RESOURCES
I0813 04:34:17.252104 18634 master.cpp:3888] Authorizing principal 
'test-principal' to unreserve resources 
'[{"disk":{"source":{"id":"/tmp/CSIVersion_StorageLocalResourceProviderTest_OperatorOperationsWithResourceProviderResources_v0_sSnqSg/2GB-8c3d0180-f55f-4527-af97-b8a1990fabdd","mount":{"root":"./csi/org.apache.mesos.csi.test/local/mounts"},"profile":"test","type":"MOUNT","vendor":"org.apache.mesos.csi.test.local"}},"name":"disk","provider_id":{"value":"b814e382-322a-4e6a-b9ec-69384e0b7818"},"reservations":[{"role":"storage","type":"DYNAMIC"},{"principal":"test-principal","role":"storage/default-role","type":"DYNAMIC"}],"scalar":{"value":2048.0},"type":"SCALAR"}]'
I0813 04:34:17.254338 18647 master.cpp:12706] Removing offer 
83a2b4de-3aa8-40cf-a9bf-5c17535ba026-O5
I0813 04:34:17.254495 18637 sched.cpp:960] Rescinded offer 
83a2b4de-3aa8-40cf-a9bf-5c17535ba026-O5
I0813 04:34:17.254575 18637 sched.cpp:971] Scheduler::offerRescinded took 
23100ns
I0813 04:34:17.255210 18632 hierarchical.cpp:1218] Recovered disk(allocated: 
storage/default-role)(reservations: 
[(DYNAMIC,storage),(DYNAMIC,storage/default-role,test-principal)])[MOUNT(org.apache.mesos.csi.test.local,/tmp/CSIVersion_StorageLocalResourceProviderTest_OperatorOperationsWithResourceProviderResources_v0_sSnqSg/2GB-8c3d0180-f55f-4527-af97-b8a1990fabdd,test)]:2048;
 cpus(allocated: storage/default-role):2; mem(allocated: 
storage/default-role):1024; disk(allocated: storage/default-role):1024; 
ports(allocated: storage/default-role):[31000-32000] (total: cpus:2; mem:1024; 
disk:1024; ports:[31000-32000]; disk(reservations: 
[(DYNAMIC,storage),(DYNAMIC,storage/default-role,test-principal)])[MOUNT(org.apache.mesos.csi.test.local,/tmp/CSIVersion_StorageLocalResourceProviderTest_OperatorOperationsWithResourceProviderResources_v0_sSnqSg/2GB-8c3d0180-f55f-4527-af97-b8a1990fabdd,test)]:2048,
 allocated: {}) on agent 83a2b4de-3aa8-40cf-a9bf-5c17535ba026-S0 from framework 
83a2b4de-3aa8-40
 cf-a9bf-5c17535ba026-
I0813 04:34:17.255329 18632 hierarchical.cpp:1264] Framework 
83a2b4de-3aa8-40cf-a9bf-5c17535ba026- filtered agent 
83a2b4de-3aa8-40cf-a9bf-5c17535ba026-S0 for 5secs
I0813 04:34:17.257719 18628 master.cpp:12597] Sending operation '' (uuid: 
37ad7baf-b1a0-4f60-b02a-06f5de64ca0d) to agent 
83a2b4de-3aa8-40cf-a9bf-5c17535ba026-S0 at slave(1222)@172.17.0.2:39253 
(21cdb951747c)
I0813 04:34:17.258285 18625 slave.cpp:4346] Ignoring new checkpointed resources 
and operations identical to the current version
I0813 04:34:17.261219 18631 provider.cpp:481] Received APPLY_OPERATION event
I0813 04:34:17.261263 18631 provider.cpp:1295] Received UNRESERVE operation '' 
(uuid: 37ad7baf-b1a0-4f60-b02a-06f5de64ca0d)
I0813 04:34:17.270252 18633 hierarchical.cpp:1508] Performed allocation for 1 
agents in 1.193565ms
I0813 04:34:17.271003 18626 master.cpp:10414] Sending offers [ 
83a2b4de-3aa8-40cf-a9bf-5c17535ba026-O6 ] to framework 
83a2b4de-3aa8-40cf-a9bf-5c17535ba026- (default) at 
scheduler-a9b0dcf2-839a-45d9-994e-b507dd594a86@172.17.0.2:39253
I0813 04:34:17.271628 18634 sched.cpp:934] Scheduler::resourceOffers took 
88629ns
I0813 04:34:17.284301 18640 http.cpp:1115] HTTP POST for 
/slave(1222)/api/v1/resource_provider from 172.17.0.2:57272
I0813 04:34:17.285513 18628 slave.cpp:8417] Handling resource provider message 
'UPDATE_OPERATION_STATUS: (uuid: b211a4e0-d0da-458e-acf5-064b21305ef2) for 
framework  (latest state: OPERATION_FINISHED, status update state: 
OPERATION_FINISHED)'
I0813 04:34:17.285693 18628 

Review Request 71275: Fixed agent draining when returning from unreachable state.

2019-08-12 Thread Joseph Wu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/71275/
---

Review request for mesos, Benjamin Bannier, Benno Evers, and Greg Mann.


Bugs: MESOS-9934
https://issues.apache.org/jira/browse/MESOS-9934


Repository: mesos


Description
---

This logic was missing from the initial implementation of agent
draining.  When an agent became unreachable, and then reregistered
with the master, the master would not properly deactivate or drain
the agent.


Diffs
-

  src/master/master.cpp 31c7c97abecb92591369b417e2ef38a99cc09bac 
  src/tests/api_tests.cpp c2099674e742eaa08134c5e0a7cdab1734808119 


Diff: https://reviews.apache.org/r/71275/diff/1/


Testing
---

make check


Thanks,

Joseph Wu