Re: Review Request 72029: Changed termination logic of the default executor.

2020-01-30 Thread Greg Mann


> On Jan. 30, 2020, 12:14 a.m., Greg Mann wrote:
> > The patch looks great, thanks Andrei. What about adding a test for this, 
> > would it be hard? I'm imagining something like:
> > 1) kill a task under the default executor
> > 2) intercept the ACK from agent to executor
> > 3) verify that the executor is still running
> > 4) send the ACK to the executor
> > 5) verify that the executor has terminated
> > 
> > WDYT?
> 
> Andrei Budnik wrote:
> How to implement step (2) and step (4)? Is there an example somewhere in 
> Mesos tests?
> 
> Greg Mann wrote:
> Yep there are some places in the tests where we use `DROP_PROTOBUF` to 
> intercept a message and then inject it manually with `process::post`; see 
> 'TaskStatusUpdateManagerTest.DuplicateUpdateBeforeAck', for example.

Oh shoot, for an HTTP executor that won't work though... hmm I'm not sure if 
there's a good way to do this in our test code right now.


- Greg


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72029/#review219426
---


On Jan. 30, 2020, 3:28 p.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/72029/
> ---
> 
> (Updated Jan. 30, 2020, 3:28 p.m.)
> 
> 
> Review request for mesos, Andrei Sekretenko, Greg Mann, Qian Zhang, and Vinod 
> Kone.
> 
> 
> Bugs: MESOS-8537
> https://issues.apache.org/jira/browse/MESOS-8537
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previously, the default executor terminated itself after all containers
> had terminated. This could lead to termination of the executor before
> processing of a terminal status update by the agent. In order
> to mitigate this issue, the executor slept for one second to give a
> chance to send all status updates and receive all status update
> acknowledgements before terminating itself. This might have led to
> various race conditions in some circumstances (e.g., on a slow host).
> This patch terminates the default executor if all status updates have
> been acknowledged by the agent and no running containers left.
> Also, this patch increases the timeout from one second to one minute
> for fail-safety.
> 
> 
> Diffs
> -
> 
>   src/launcher/default_executor.cpp 4369fd0052b2e8496ba63606fa57e17d881ea52c 
> 
> 
> Diff: https://reviews.apache.org/r/72029/diff/4/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 72029: Changed termination logic of the default executor.

2020-01-30 Thread Greg Mann


> On Jan. 30, 2020, 12:14 a.m., Greg Mann wrote:
> > The patch looks great, thanks Andrei. What about adding a test for this, 
> > would it be hard? I'm imagining something like:
> > 1) kill a task under the default executor
> > 2) intercept the ACK from agent to executor
> > 3) verify that the executor is still running
> > 4) send the ACK to the executor
> > 5) verify that the executor has terminated
> > 
> > WDYT?
> 
> Andrei Budnik wrote:
> How to implement step (2) and step (4)? Is there an example somewhere in 
> Mesos tests?

Yep there are some places in the tests where we use `DROP_PROTOBUF` to 
intercept a message and then inject it manually with `process::post`; see 
'TaskStatusUpdateManagerTest.DuplicateUpdateBeforeAck', for example.


- Greg


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72029/#review219426
---


On Jan. 30, 2020, 3:28 p.m., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/72029/
> ---
> 
> (Updated Jan. 30, 2020, 3:28 p.m.)
> 
> 
> Review request for mesos, Andrei Sekretenko, Greg Mann, Qian Zhang, and Vinod 
> Kone.
> 
> 
> Bugs: MESOS-8537
> https://issues.apache.org/jira/browse/MESOS-8537
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previously, the default executor terminated itself after all containers
> had terminated. This could lead to termination of the executor before
> processing of a terminal status update by the agent. In order
> to mitigate this issue, the executor slept for one second to give a
> chance to send all status updates and receive all status update
> acknowledgements before terminating itself. This might have led to
> various race conditions in some circumstances (e.g., on a slow host).
> This patch terminates the default executor if all status updates have
> been acknowledged by the agent and no running containers left.
> Also, this patch increases the timeout from one second to one minute
> for fail-safety.
> 
> 
> Diffs
> -
> 
>   src/launcher/default_executor.cpp 4369fd0052b2e8496ba63606fa57e17d881ea52c 
> 
> 
> Diff: https://reviews.apache.org/r/72029/diff/4/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 72068: Removed dead code from v1 API serialization changes.

2020-01-30 Thread Mesos Reviewbot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72068/#review219443
---



Bad patch!

Reviews applied: [72056, 72064, 72065, 72066, 72067, 72068]

Failed command: ['bash', '-c', "set -o pipefail; export OS='ubuntu:16.04' 
BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose 
--disable-libtool-wrappers --disable-parallel-test-execution' 
ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/jenkins/buildbot.sh 2>&1 | 
tee build_72068"]

Error:
..
nt 971bbe49-9d06-44b0-bede-888f75d3b35d-S0
I0130 23:47:14.904318  3141 http_connection.hpp:131] Sending 
UPDATE_OPERATION_STATUS call to 
http://172.17.0.2:40819/slave(1249)/api/v1/resource_provider
I0130 23:47:14.905791  3150 process.cpp:3671] Handling HTTP event for process 
'slave(1249)' with path: '/slave(1249)/api/v1/resource_provider'
I0130 23:47:14.909590  3151 hierarchical.cpp:1853] Performed allocation for 1 
agents in 1.711672ms
I0130 23:47:14.910320  3145 master.cpp:9974] Sending offers [ 
971bbe49-9d06-44b0-bede-888f75d3b35d-O3 ] to framework 
971bbe49-9d06-44b0-bede-888f75d3b35d- (default) at 
scheduler-dc793204-e43a-4a41-b4a0-1ad12514a761@172.17.0.2:40819
I0130 23:47:14.911221  3142 sched.cpp:934] Scheduler::resourceOffers took 
102673ns
I0130 23:47:14.914903  3143 process.cpp:3671] Handling HTTP event for process 
'master' with path: '/master/api/v1'
I0130 23:47:14.916847  3149 http.cpp:1405] HTTP POST for /master/api/v1 from 
172.17.0.2:51758
I0130 23:47:14.917249  3149 http.cpp:277] Processing call CREATE_VOLUMES
I0130 23:47:14.918244  3149 master.cpp:3712] Authorizing principal 
'test-principal' to perform action CREATE_VOLUME on object 
{"value":"storage/default-role","resource":{"provider_id":{"value":"b2227d66-f5e1-4683-8e28-849b149e1446"},"name":"disk","type":"SCALAR","scalar":{"value":2048.0},"reservations":[{"type":"DYNAMIC","role":"storage"},{"type":"DYNAMIC","role":"storage/default-role","principal":"test-principal"}],"disk":{"persistence":{"id":"79b8b41d-9a30-4260-8026-840de47897a1","principal":"test-principal"},"volume":{"mode":"RW","container_path":"volume"},"source":{"type":"MOUNT","mount":{"root":"./csi/org.apache.mesos.csi.test/local/mounts"},"vendor":"org.apache.mesos.csi.test.local","id":"/tmp/CSIVersion_StorageLocalResourceProviderTest_OperatorOperationsWithResourceProviderResources_v1_bKXUj5/2GB-bf0b488b-d3f9-4c91-9ca1-017d27a4b283","profile":"test"
I0130 23:47:14.919992  3142 sched.cpp:960] Rescinded offer 
971bbe49-9d06-44b0-bede-888f75d3b35d-O3
I0130 23:47:14.920091  3142 sched.cpp:971] Scheduler::offerRescinded took 
28965ns
I0130 23:47:14.920533  3144 hierarchical.cpp:1576] Recovered ports(allocated: 
storage/default-role):[31000-32000]; disk(allocated: 
storage/default-role)(reservations: 
[(DYNAMIC,storage),(DYNAMIC,storage/default-role,test-principal)])[MOUNT(org.apache.mesos.csi.test.local,/tmp/CSIVersion_StorageLocalResourceProviderTest_OperatorOperationsWithResourceProviderResources_v1_bKXUj5/2GB-bf0b488b-d3f9-4c91-9ca1-017d27a4b283,test)]:2048;
 cpus(allocated: storage/default-role):2; mem(allocated: 
storage/default-role):1024; disk(allocated: storage/default-role):1024 (total: 
cpus:2; mem:1024; disk:1024; ports:[31000-32000]; disk(reservations: 
[(DYNAMIC,storage),(DYNAMIC,storage/default-role,test-principal)])[MOUNT(org.apache.mesos.csi.test.local,/tmp/CSIVersion_StorageLocalResourceProviderTest_OperatorOperationsWithResourceProviderResources_v1_bKXUj5/2GB-bf0b488b-d3f9-4c91-9ca1-017d27a4b283,test)]:2048,
 offered or allocated: {}) on agent 971bbe49-9d06-44b0-bede-888f75d3b35d-S0 
from framework 971bb
 e49-9d06-44b0-bede-888f75d3b35d-
I0130 23:47:14.920750  3151 master.cpp:12273] Removing offer 
971bbe49-9d06-44b0-bede-888f75d3b35d-O3
I0130 23:47:14.921928  3144 hierarchical.cpp:1625] Framework 
971bbe49-9d06-44b0-bede-888f75d3b35d- filtered agent 
971bbe49-9d06-44b0-bede-888f75d3b35d-S0 for 5secs
I0130 23:47:14.924556  3148 master.cpp:12138] Sending operation '' (uuid: 
f182bc6f-30d0-4ffb-9056-488b9c0a9824) to agent 
971bbe49-9d06-44b0-bede-888f75d3b35d-S0 at slave(1249)@172.17.0.2:40819 
(11de8f60f6d8)
I0130 23:47:14.925266  3148 slave.cpp:4428] Ignoring new checkpointed resources 
and operations identical to the current version
I0130 23:47:14.928223  3143 provider.cpp:498] Received APPLY_OPERATION event
I0130 23:47:14.928273  3143 provider.cpp:1351] Received CREATE operation '' 
(uuid: f182bc6f-30d0-4ffb-9056-488b9c0a9824)
I0130 23:47:14.932937  3154 master.cpp:5958] Processing REVIVE call for 
framework 971bbe49-9d06-44b0-bede-888f75d3b35d- (default) at 
scheduler-dc793204-e43a-4a41-b4a0-1ad12514a761@172.17.0.2:40819
I0130 23:47:14.933447  3141 hierarchical.cpp:1721] Unsuppressed offers and 
cleared filters for roles { storage/default-role } of framework 
971bbe49-9d06-44b0-bede-888f75d3b35d-
I0130 23:47:14.935186  3141 

Review Request 72068: Removed dead code from v1 API serialization changes.

2020-01-30 Thread Benjamin Mahler

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72068/
---

Review request for mesos, Andrei Sekretenko and Greg Mann.


Repository: mesos


Description
---

Per MESOS-10026, we replaced the old protobuf object construction
code with a direct serialization approach to improve performance.
This patch removes the old object construction code (I originally
kept it in case we needed it to debug, but at this point I'm more
confident that we don't need it).


Diffs
-

  src/master/http.cpp eeaac88c893b43170e655f8bff1d57dd0f7bbfb2 
  src/master/master.hpp c813e9fc855cfb1701ec32be7f690e06b6eb203f 
  src/slave/http.hpp 0afdad9479f0cc2c94452b6b1f2289dd6ea01494 
  src/slave/http.cpp 04ad0d816618a1880913857a6f0ff38c4643c488 


Diff: https://reviews.apache.org/r/72068/diff/1/


Testing
---

make check


Thanks,

Benjamin Mahler



Re: Review Request 72064: Improved performance of v1 agent operator API GET_FRAMEWORKS call.

2020-01-30 Thread Benjamin Mahler

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72064/
---

(Updated Jan. 30, 2020, 9:45 p.m.)


Review request for mesos, Andrei Sekretenko and Greg Mann.


Changes
---

- Fixed an accidental compilation issue.


Bugs: MESOS-10026
https://issues.apache.org/jira/browse/MESOS-10026


Repository: mesos


Description
---

This follow the same approach used for the master's v1 calls:

https://github.com/apache/mesos/commit/4f4dab961bd45ca444d13b831cdb
https://github.com/apache/mesos/commit/3dda3622f5ed01e8c132dc5ca594

That is, serializing directly to protobuf or json from the in-memory
v0 state.


Diffs (updated)
-

  src/slave/http.hpp 0afdad9479f0cc2c94452b6b1f2289dd6ea01494 
  src/slave/http.cpp 04ad0d816618a1880913857a6f0ff38c4643c488 


Diff: https://reviews.apache.org/r/72064/diff/2/

Changes: https://reviews.apache.org/r/72064/diff/1-2/


Testing
---

make check


Thanks,

Benjamin Mahler



Re: Review Request 72056: Improved performance of v1 agent operator API GET_METRICS call.

2020-01-30 Thread Benjamin Mahler

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72056/
---

(Updated Jan. 30, 2020, 9:31 p.m.)


Review request for mesos, Andrei Sekretenko and Greg Mann.


Changes
---

- Removed unnecessary mesos:: qualification.


Summary (updated)
-

Improved performance of v1 agent operator API GET_METRICS call.


Bugs: MESOS-10026
https://issues.apache.org/jira/browse/MESOS-10026


Repository: mesos


Description (updated)
---

This follow the same approach used for the master's GET_METRICS call:

https://github.com/apache/mesos/commit/469f2ebaf65b1642d1eb4a1df81a
https://github.com/apache/mesos/commit/3dda3622f5ed01e8c132dc5ca594

That is, serializing directly to protobuf or json from the in-memory
v0 state.


Diffs (updated)
-

  src/slave/http.cpp 04ad0d816618a1880913857a6f0ff38c4643c488 


Diff: https://reviews.apache.org/r/72056/diff/2/

Changes: https://reviews.apache.org/r/72056/diff/1-2/


Testing
---

make check


Thanks,

Benjamin Mahler



Review Request 72064: Improved performance of v1 agent operator API GET_FRAMEWORKS call.

2020-01-30 Thread Benjamin Mahler

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72064/
---

Review request for mesos, Andrei Sekretenko and Greg Mann.


Bugs: MESOS-10026
https://issues.apache.org/jira/browse/MESOS-10026


Repository: mesos


Description
---

This follow the same approach used for the master's v1 calls:

https://github.com/apache/mesos/commit/4f4dab961bd45ca444d13b831cdb
https://github.com/apache/mesos/commit/3dda3622f5ed01e8c132dc5ca594

That is, serializing directly to protobuf or json from the in-memory
v0 state.


Diffs
-

  src/slave/http.hpp 0afdad9479f0cc2c94452b6b1f2289dd6ea01494 
  src/slave/http.cpp 04ad0d816618a1880913857a6f0ff38c4643c488 


Diff: https://reviews.apache.org/r/72064/diff/1/


Testing
---

make check


Thanks,

Benjamin Mahler



Review Request 72067: Improved performance of v1 agent operator API GET_STATE call.

2020-01-30 Thread Benjamin Mahler

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72067/
---

Review request for mesos, Andrei Sekretenko and Greg Mann.


Bugs: MESOS-10026
https://issues.apache.org/jira/browse/MESOS-10026


Repository: mesos


Description
---

This follow the same approach used for the master's v1 calls:

https://github.com/apache/mesos/commit/1c60f0e4acbac96c34bd90e26515
https://github.com/apache/mesos/commit/3dda3622f5ed01e8c132dc5ca594

That is, serializing directly to protobuf or json from the in-memory
v0 state.


Diffs
-

  src/slave/http.hpp 0afdad9479f0cc2c94452b6b1f2289dd6ea01494 
  src/slave/http.cpp 04ad0d816618a1880913857a6f0ff38c4643c488 


Diff: https://reviews.apache.org/r/72067/diff/1/


Testing
---

make check


Thanks,

Benjamin Mahler



Review Request 72065: Improved performance of v1 agent operator API GET_EXECUTORS call.

2020-01-30 Thread Benjamin Mahler

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72065/
---

Review request for mesos, Andrei Sekretenko and Greg Mann.


Bugs: MESOS-10026
https://issues.apache.org/jira/browse/MESOS-10026


Repository: mesos


Description
---

This follow the same approach used for the master's v1 calls:

https://github.com/apache/mesos/commit/6ab835459a452e53fec8982a5aaa
https://github.com/apache/mesos/commit/3dda3622f5ed01e8c132dc5ca594

That is, serializing directly to protobuf or json from the in-memory
v0 state.


Diffs
-

  src/slave/http.hpp 0afdad9479f0cc2c94452b6b1f2289dd6ea01494 
  src/slave/http.cpp 04ad0d816618a1880913857a6f0ff38c4643c488 


Diff: https://reviews.apache.org/r/72065/diff/1/


Testing
---

make check


Thanks,

Benjamin Mahler



Review Request 72066: Improved performance of v1 agent operator API GET_TASKS call.

2020-01-30 Thread Benjamin Mahler

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72066/
---

Review request for mesos, Andrei Sekretenko and Greg Mann.


Bugs: MESOS-10026
https://issues.apache.org/jira/browse/MESOS-10026


Repository: mesos


Description
---

This follow the same approach used for the master's v1 calls:

https://github.com/apache/mesos/commit/d7dd4d0e8493331d7b7a21b504eb
https://github.com/apache/mesos/commit/3dda3622f5ed01e8c132dc5ca594

That is, serializing directly to protobuf or json from the in-memory
v0 state.


Diffs
-

  src/slave/http.hpp 0afdad9479f0cc2c94452b6b1f2289dd6ea01494 
  src/slave/http.cpp 04ad0d816618a1880913857a6f0ff38c4643c488 


Diff: https://reviews.apache.org/r/72066/diff/1/


Testing
---

make check


Thanks,

Benjamin Mahler



Re: Review Request 72029: Changed termination logic of the default executor.

2020-01-30 Thread Mesos Reviewbot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72029/#review219441
---



Bad patch!

Reviews applied: [72029]

Failed command: ['bash', '-c', "set -o pipefail; export OS='ubuntu:16.04' 
BUILDTOOL='autotools' COMPILER='gcc' CONFIGURATION='--verbose 
--disable-libtool-wrappers --disable-parallel-test-execution' 
ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1'; ./support/jenkins/buildbot.sh 2>&1 | 
tee build_72029"]

Error:
ubuntu-16.04: Pulling from mesos/mesos-build
18d680d61657: Pulling fs layer
0addb6fece63: Pulling fs layer
78e58219b215: Pulling fs layer
eb6959a66df2: Pulling fs layer
1105027a5560: Pulling fs layer
c36946130a38: Pulling fs layer
6a6a5e68faab: Pulling fs layer
80e07249924c: Pulling fs layer
c4c63e2501db: Pulling fs layer
668b207c2829: Pulling fs layer
ed76dddad568: Pulling fs layer
eb6959a66df2: Waiting
1105027a5560: Waiting
c36946130a38: Waiting
6a6a5e68faab: Waiting
668b207c2829: Waiting
80e07249924c: Waiting
ed76dddad568: Waiting
c4c63e2501db: Waiting
0addb6fece63: Download complete
78e58219b215: Verifying Checksum
78e58219b215: Download complete
eb6959a66df2: Verifying Checksum
eb6959a66df2: Download complete
18d680d61657: Verifying Checksum
18d680d61657: Download complete
6a6a5e68faab: Verifying Checksum
6a6a5e68faab: Download complete
80e07249924c: Verifying Checksum
80e07249924c: Download complete
c4c63e2501db: Verifying Checksum
c4c63e2501db: Download complete
c36946130a38: Verifying Checksum
c36946130a38: Download complete
18d680d61657: Pull complete
668b207c2829: Verifying Checksum
668b207c2829: Download complete
ed76dddad568: Verifying Checksum
ed76dddad568: Download complete
0addb6fece63: Pull complete
78e58219b215: Pull complete
eb6959a66df2: Pull complete
1105027a5560: Verifying Checksum
1105027a5560: Download complete
1105027a5560: Pull complete
c36946130a38: Pull complete
6a6a5e68faab: Pull complete
80e07249924c: Pull complete
c4c63e2501db: Pull complete
668b207c2829: Pull complete
ed76dddad568: Pull complete
Digest: sha256:fa967cbcfb44f55708a3cbc87f245c6d29dd891464db558af56a03ee321526bb
Status: Downloaded newer image for mesos/mesos-build:ubuntu-16.04
docker.io/mesos/mesos-build:ubuntu-16.04
Cloning into '/tmp/SRC'...
Note: checking out '5c6e2ec944e11ecb56b74d364199d642fd478dfb'.

You are in 'detached HEAD' state. You can look around, make experimental
changes and commit them, and you can discard any commits you make in this
state without impacting any branches by performing another checkout.

If you want to create a new branch to retain commits you create, you may
do so (now or later) by using -b with the checkout command again. Example:

  git checkout -b 

autoreconf: Entering directory `.'
autoreconf: configure.ac: not using Gettext
autoreconf: running: aclocal --warnings=all -I m4
autoreconf: configure.ac: tracing
configure.ac:1875: warning: cannot check for file existence when cross compiling
../../lib/autoconf/general.m4:2788: AC_CHECK_FILE is expanded from...
configure.ac:1875: the top level
configure.ac:1880: warning: cannot check for file existence when cross compiling
../../lib/autoconf/general.m4:2788: AC_CHECK_FILE is expanded from...
configure.ac:1880: the top level
configure.ac:2401: warning: AC_RUN_IFELSE called without default to allow cross 
compiling
../../lib/autoconf/general.m4:2759: AC_RUN_IFELSE is expanded from...
configure.ac:2401: the top level
autoreconf: running: libtoolize --copy
libtoolize: putting auxiliary files in '.'.
libtoolize: copying file './ltmain.sh'
libtoolize: putting macros in AC_CONFIG_MACRO_DIRS, 'm4'.
libtoolize: copying file 'm4/libtool.m4'
libtoolize: copying file 'm4/ltoptions.m4'
libtoolize: copying file 'm4/ltsugar.m4'
libtoolize: copying file 'm4/ltversion.m4'
libtoolize: copying file 'm4/lt~obsolete.m4'
autoreconf: running: /usr/bin/autoconf --warnings=all
configure.ac:1875: warning: cannot check for file existence when cross compiling
../../lib/autoconf/general.m4:2788: AC_CHECK_FILE is expanded from...
configure.ac:1875: the top level
configure.ac:1880: warning: cannot check for file existence when cross compiling
../../lib/autoconf/general.m4:2788: AC_CHECK_FILE is expanded from...
configure.ac:1880: the top level
configure.ac:2401: warning: AC_RUN_IFELSE called without default to allow cross 
compiling
../../lib/autoconf/general.m4:2759: AC_RUN_IFELSE is expanded from...
configure.ac:2401: the top level
autoreconf: configure.ac: not using Autoheader
autoreconf: running: automake --add-missing --copy --no-force --warnings=all
configure.ac:50: installing './ar-lib'
configure.ac:34: installing './compile'
configure.ac:24: installing './config.guess'
configure.ac:24: installing './config.sub'
configure.ac:46: installing './install-sh'
configure.ac:46: installing './missing'
3rdparty/Makefile.am:299: warning: source file '$(HTTP_PARSER)/http_parser.c' 
is in a 

Re: Review Request 72029: Changed termination logic of the default executor.

2020-01-30 Thread Andrei Budnik

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72029/
---

(Updated Янв. 30, 2020, 3:28 п.п.)


Review request for mesos, Andrei Sekretenko, Greg Mann, Qian Zhang, and Vinod 
Kone.


Bugs: MESOS-8537
https://issues.apache.org/jira/browse/MESOS-8537


Repository: mesos


Description
---

Previously, the default executor terminated itself after all containers
had terminated. This could lead to termination of the executor before
processing of a terminal status update by the agent. In order
to mitigate this issue, the executor slept for one second to give a
chance to send all status updates and receive all status update
acknowledgements before terminating itself. This might have led to
various race conditions in some circumstances (e.g., on a slow host).
This patch terminates the default executor if all status updates have
been acknowledged by the agent and no running containers left.
Also, this patch increases the timeout from one second to one minute
for fail-safety.


Diffs (updated)
-

  src/launcher/default_executor.cpp 4369fd0052b2e8496ba63606fa57e17d881ea52c 


Diff: https://reviews.apache.org/r/72029/diff/4/

Changes: https://reviews.apache.org/r/72029/diff/3-4/


Testing
---

internal CI


Thanks,

Andrei Budnik



Re: Review Request 72029: Changed termination logic of the default executor.

2020-01-30 Thread Andrei Budnik


> On Янв. 30, 2020, 12:14 д.п., Greg Mann wrote:
> > The patch looks great, thanks Andrei. What about adding a test for this, 
> > would it be hard? I'm imagining something like:
> > 1) kill a task under the default executor
> > 2) intercept the ACK from agent to executor
> > 3) verify that the executor is still running
> > 4) send the ACK to the executor
> > 5) verify that the executor has terminated
> > 
> > WDYT?

How to implement step (2) and step (4)? Is there an example somewhere in Mesos 
tests?


- Andrei


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72029/#review219426
---


On Янв. 29, 2020, 9:28 п.п., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/72029/
> ---
> 
> (Updated Янв. 29, 2020, 9:28 п.п.)
> 
> 
> Review request for mesos, Andrei Sekretenko, Greg Mann, Qian Zhang, and Vinod 
> Kone.
> 
> 
> Bugs: MESOS-8537
> https://issues.apache.org/jira/browse/MESOS-8537
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previously, the default executor terminated itself after all containers
> had terminated. This could lead to termination of the executor before
> processing of a terminal status update by the agent. In order
> to mitigate this issue, the executor slept for one second to give a
> chance to send all status updates and receive all status update
> acknowledgements before terminating itself. This might have led to
> various race conditions in some circumstances (e.g., on a slow host).
> This patch terminates the default executor if all status updates have
> been acknowledged by the agent and no running containers left.
> Also, this patch increases the timeout from one second to one minute
> for fail-safety.
> 
> 
> Diffs
> -
> 
>   src/launcher/default_executor.cpp 4369fd0052b2e8496ba63606fa57e17d881ea52c 
> 
> 
> Diff: https://reviews.apache.org/r/72029/diff/3/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 72055: Changed termination logic of the Docker executor.

2020-01-30 Thread Andrei Budnik


> On Янв. 29, 2020, 10:28 д.п., Qian Zhang wrote:
> > The commit message seems not accurate to me:
> > > This could lead to termination of the executor before receiving all 
> > > status update acknowledgments from the agent.
> > 
> > I think the issue that we wanted to mitigate is, executor may shutdown 
> > itself before the terminal status update (rather than the acks) is sent to 
> > agent.
> 
> Andrei Budnik wrote:
> Updated the description.
> 
> Qian Zhang wrote:
> > This could lead to termination of the executor before processing of a 
> terminal status update by the agent.
> 
> What do you mean for `before processing of a terminal status update by 
> the agent`? Executor processes terminal status update sent by the agent? I 
> think it should be `before the terminal status update is sent to the agent`.

In the case of MESOS-9847, a terminal status update acknowledgment was 
delivered to the agent, but the executor had been terminated before the agent 
processed the status update in the `Slave::statusUpdate`.


- Andrei


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72055/#review219410
---


On Янв. 29, 2020, 4:23 п.п., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/72055/
> ---
> 
> (Updated Янв. 29, 2020, 4:23 п.п.)
> 
> 
> Review request for mesos, Andrei Sekretenko, Greg Mann, Qian Zhang, and Vinod 
> Kone.
> 
> 
> Bugs: MESOS-9847
> https://issues.apache.org/jira/browse/MESOS-9847
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previously, the Docker executor terminated itself after a task's
> container had terminated. This could lead to termination of the
> executor before processing of a terminal status update by the agent.
> In order to mitigate this issue, the executor slept for one second to
> give a chance to send all status updates and receive all status update
> acknowledgments before terminating itself. This might have led to
> various race conditions in some circumstances (e.g., on a slow host).
> This patch terminates the Docker executor after receiving a terminal
> status update acknowledgment. Also, this patch increases the timeout
> from one second to one minute for fail-safety.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 132f42bfa42c846fc5dc40f7763aa0b5d12a7798 
>   src/exec/exec.cpp 69e5e24b248c7c913421de5e42713c34fd79ad46 
> 
> 
> Diff: https://reviews.apache.org/r/72055/diff/2/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 72055: Changed termination logic of the Docker executor.

2020-01-30 Thread Andrei Budnik


> On Янв. 30, 2020, 2:31 д.п., Qian Zhang wrote:
> > src/docker/executor.cpp
> > Lines 779-788 (original), 779-789 (patched)
> > 
> >
> > This seems a bit redundant to me, I'd suggest to only have:
> > ```
> >   void _stop()
> >   {
> > driver.get()->stop();
> >   }
> > ```
> > 
> > And then call `delay(Seconds(60), self(), ::__stop);` in 
> > `_reaped()` and `launchTask()`.

I agree, but ideally we'd need to introduce a named constant for `Seconds(60)` 
(since it appears more than once in the code). I noticed that there is no such 
named constant for other executors.


- Andrei


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72055/#review219431
---


On Янв. 29, 2020, 4:23 п.п., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/72055/
> ---
> 
> (Updated Янв. 29, 2020, 4:23 п.п.)
> 
> 
> Review request for mesos, Andrei Sekretenko, Greg Mann, Qian Zhang, and Vinod 
> Kone.
> 
> 
> Bugs: MESOS-9847
> https://issues.apache.org/jira/browse/MESOS-9847
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previously, the Docker executor terminated itself after a task's
> container had terminated. This could lead to termination of the
> executor before processing of a terminal status update by the agent.
> In order to mitigate this issue, the executor slept for one second to
> give a chance to send all status updates and receive all status update
> acknowledgments before terminating itself. This might have led to
> various race conditions in some circumstances (e.g., on a slow host).
> This patch terminates the Docker executor after receiving a terminal
> status update acknowledgment. Also, this patch increases the timeout
> from one second to one minute for fail-safety.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 132f42bfa42c846fc5dc40f7763aa0b5d12a7798 
>   src/exec/exec.cpp 69e5e24b248c7c913421de5e42713c34fd79ad46 
> 
> 
> Diff: https://reviews.apache.org/r/72055/diff/2/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>



Re: Review Request 72055: Changed termination logic of the Docker executor.

2020-01-30 Thread Andrei Budnik


> On Янв. 30, 2020, 12:18 д.п., Greg Mann wrote:
> > src/docker/executor.cpp
> > Lines 781-783 (original), 781-783 (patched)
> > 
> >
> > Is there a reason to have a failsafe here, but not in the default 
> > executor?

I added a fail-safe in the default executor too. Please see the latest version 
of `/r/72029/`.


> On Янв. 30, 2020, 12:18 д.п., Greg Mann wrote:
> > src/exec/exec.cpp
> > Lines 412-417 (patched)
> > 
> >
> > This seems like it's addressing a separate issue? If so, could we split 
> > it out into a separate patch?

The only reason I added this check is that we need to guard the condition 
statement below. Also, I doubt if this is an issue at all. So I think that it 
make sense to have them together in a single patch.


- Andrei


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/72055/#review219427
---


On Янв. 29, 2020, 4:23 п.п., Andrei Budnik wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/72055/
> ---
> 
> (Updated Янв. 29, 2020, 4:23 п.п.)
> 
> 
> Review request for mesos, Andrei Sekretenko, Greg Mann, Qian Zhang, and Vinod 
> Kone.
> 
> 
> Bugs: MESOS-9847
> https://issues.apache.org/jira/browse/MESOS-9847
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Previously, the Docker executor terminated itself after a task's
> container had terminated. This could lead to termination of the
> executor before processing of a terminal status update by the agent.
> In order to mitigate this issue, the executor slept for one second to
> give a chance to send all status updates and receive all status update
> acknowledgments before terminating itself. This might have led to
> various race conditions in some circumstances (e.g., on a slow host).
> This patch terminates the Docker executor after receiving a terminal
> status update acknowledgment. Also, this patch increases the timeout
> from one second to one minute for fail-safety.
> 
> 
> Diffs
> -
> 
>   src/docker/executor.cpp 132f42bfa42c846fc5dc40f7763aa0b5d12a7798 
>   src/exec/exec.cpp 69e5e24b248c7c913421de5e42713c34fd79ad46 
> 
> 
> Diff: https://reviews.apache.org/r/72055/diff/2/
> 
> 
> Testing
> ---
> 
> internal CI
> 
> 
> Thanks,
> 
> Andrei Budnik
> 
>