Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-12 Thread Timothy Chen


 On May 11, 2015, 11:07 p.m., Timothy Chen wrote:
  Thanks Jay the changes looks reasonable to me, will wait for Jie to comment.
 
 Jay Buffington wrote:
 Great, thanks!  When you commit can you add the three separate commits 
 that are here: 
 https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
 
 rather than just one big commit?

Sure, but can you next time submit seperate reviews then? It's hard to check 
that what you have is exactly what you put in the reviewboard, and bugs might 
just slip in.


- Timothy


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review83310
---


On May 11, 2015, 10:06 p.m., Jay Buffington wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33249/
 ---
 
 (Updated May 11, 2015, 10:06 p.m.)
 
 
 Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
 
 
 Bugs: MESOS-2020 and MESOS-2656
 https://issues.apache.org/jira/browse/MESOS-2020
 https://issues.apache.org/jira/browse/MESOS-2656
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Send statusUpdate to scheduler on containerizer launch failure
 
 This review is actually for three commits.  The three commits are separated 
 out on github here:
 https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
 
 Commit #1:
 docker-no-executor-framework example should send ContainerInfo
 
 docker-no-executor-framework stopped working when the protobuf message
 changed during development.  Apparently noone noticed because we don't
 run it as part of our test suite.  I used it to develop a fix for
 proper error reporting, so I fixed it.
 
 Commit #2:
 serialize wait and destroy in containerizer
 
 Containerizers keep track of all the containers they have created.  When
 a container is destroy()'ed the containerizer removes the container from
 its internal hashmap.  This means that wait() must be called before the
 container is destroyed.
 
 When the docker containerizer fails because of a docker pull or
 docker run failure, we end up with a race condition where wait() does
 not properly return the Termination, so we ended up with a default
 failure message of Abnormal Executor Termination which isn't very
 useful.
 
 This commit solves this issue by not destroying the container until
 after wait is called in the failure case.
 
 Fixes MESOS-2656
 
 Commit #3:
 send scheduler containerizer launch failure errors
 
 If a method in the launch sequence returns a failure the error message
 wasn't making it back to the scheduler.
 
 Fixes MESOS-2020
 
 
 Diffs
 -
 
   include/mesos/containerizer/containerizer.proto 
 95c84dfbecef28ca9a220a98293c5fc6215a692f 
   src/examples/docker_no_executor_framework.cpp 
 d5385d9295d30a6039bab9938f3270006582d3da 
   src/slave/containerizer/containerizer.hpp 
 56c088abb036aa26c323c3142fd27ea29ed51d4f 
   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
   src/slave/containerizer/mesos/containerizer.hpp 
 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
   src/slave/containerizer/mesos/containerizer.cpp 
 f2587280dc0e1d566d2b856a80358c7b3896c603 
   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
   src/tests/docker_containerizer_tests.cpp 
 c9d66b3fbc7d081f36c26781573dca50de823c44 
 
 Diff: https://reviews.apache.org/r/33249/diff/
 
 
 Testing
 ---
 
 I used examples/docker_no_executor_framework.cpp, and wrote a few tests and 
 ran make check
 
 
 Thanks,
 
 Jay Buffington
 




Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-11 Thread Timothy Chen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review83265
---



src/slave/containerizer/docker.cpp
https://reviews.apache.org/r/33249/#comment134171

We don't enqueue destroy anymore, let's fix this comment.



src/tests/slave_tests.cpp
https://reviews.apache.org/r/33249/#comment134173

I don't really understand this comment? And if you simply pause the clock 
your await won't work right?


- Timothy Chen


On May 11, 2015, 5:10 p.m., Jay Buffington wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33249/
 ---
 
 (Updated May 11, 2015, 5:10 p.m.)
 
 
 Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
 
 
 Bugs: MESOS-2020 and MESOS-2656
 https://issues.apache.org/jira/browse/MESOS-2020
 https://issues.apache.org/jira/browse/MESOS-2656
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Send statusUpdate to scheduler on containerizer launch failure
 
 This review is actually for three commits.  The three commits are separated 
 out on github here:
 https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
 
 Commit #1:
 docker-no-executor-framework example should send ContainerInfo
 
 docker-no-executor-framework stopped working when the protobuf message
 changed during development.  Apparently noone noticed because we don't
 run it as part of our test suite.  I used it to develop a fix for
 proper error reporting, so I fixed it.
 
 Commit #2:
 serialize wait and destroy in containerizer
 
 Containerizers keep track of all the containers they have created.  When
 a container is destroy()'ed the containerizer removes the container from
 its internal hashmap.  This means that wait() must be called before the
 container is destroyed.
 
 When the docker containerizer fails because of a docker pull or
 docker run failure, we end up with a race condition where wait() does
 not properly return the Termination, so we ended up with a default
 failure message of Abnormal Executor Termination which isn't very
 useful.
 
 This commit solves this issue by not destroying the container until
 after wait is called in the failure case.
 
 Fixes MESOS-2656
 
 Commit #3:
 send scheduler containerizer launch failure errors
 
 If a method in the launch sequence returns a failure the error message
 wasn't making it back to the scheduler.
 
 Fixes MESOS-2020
 
 
 Diffs
 -
 
   include/mesos/containerizer/containerizer.proto 
 95c84dfbecef28ca9a220a98293c5fc6215a692f 
   src/examples/docker_no_executor_framework.cpp 
 d5385d9295d30a6039bab9938f3270006582d3da 
   src/slave/containerizer/containerizer.hpp 
 56c088abb036aa26c323c3142fd27ea29ed51d4f 
   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
   src/slave/containerizer/mesos/containerizer.hpp 
 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
   src/slave/containerizer/mesos/containerizer.cpp 
 f2587280dc0e1d566d2b856a80358c7b3896c603 
   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
   src/tests/docker_containerizer_tests.cpp 
 c9d66b3fbc7d081f36c26781573dca50de823c44 
   src/tests/slave_tests.cpp 04e79ec061e767a82d40242be7884967942c50b6 
 
 Diff: https://reviews.apache.org/r/33249/diff/
 
 
 Testing
 ---
 
 I used examples/docker_no_executor_framework.cpp, and wrote a few tests and 
 ran make check
 
 
 Thanks,
 
 Jay Buffington
 




Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-11 Thread Jay Buffington


 On May 11, 2015, 12:12 p.m., Timothy Chen wrote:
  src/tests/slave_tests.cpp, line 325
  https://reviews.apache.org/r/33249/diff/7/?file=955468#file955468line325
 
  I don't really understand this comment? And if you simply pause the 
  clock your await won't work right?

I'm starting to question the legitmaticy of this test anyway.  What it really 
ends up testing is the TestContainerizer, and that makes no sense.  So I'm just 
going to remove the whole thing.

Also, it currently doesn't work (see ReviewBot's previous post) because 
EXPECT_CALL seems to cause _launch to never get invoked.


- Jay


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review83265
---


On May 11, 2015, 10:10 a.m., Jay Buffington wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33249/
 ---
 
 (Updated May 11, 2015, 10:10 a.m.)
 
 
 Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
 
 
 Bugs: MESOS-2020 and MESOS-2656
 https://issues.apache.org/jira/browse/MESOS-2020
 https://issues.apache.org/jira/browse/MESOS-2656
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Send statusUpdate to scheduler on containerizer launch failure
 
 This review is actually for three commits.  The three commits are separated 
 out on github here:
 https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
 
 Commit #1:
 docker-no-executor-framework example should send ContainerInfo
 
 docker-no-executor-framework stopped working when the protobuf message
 changed during development.  Apparently noone noticed because we don't
 run it as part of our test suite.  I used it to develop a fix for
 proper error reporting, so I fixed it.
 
 Commit #2:
 serialize wait and destroy in containerizer
 
 Containerizers keep track of all the containers they have created.  When
 a container is destroy()'ed the containerizer removes the container from
 its internal hashmap.  This means that wait() must be called before the
 container is destroyed.
 
 When the docker containerizer fails because of a docker pull or
 docker run failure, we end up with a race condition where wait() does
 not properly return the Termination, so we ended up with a default
 failure message of Abnormal Executor Termination which isn't very
 useful.
 
 This commit solves this issue by not destroying the container until
 after wait is called in the failure case.
 
 Fixes MESOS-2656
 
 Commit #3:
 send scheduler containerizer launch failure errors
 
 If a method in the launch sequence returns a failure the error message
 wasn't making it back to the scheduler.
 
 Fixes MESOS-2020
 
 
 Diffs
 -
 
   include/mesos/containerizer/containerizer.proto 
 95c84dfbecef28ca9a220a98293c5fc6215a692f 
   src/examples/docker_no_executor_framework.cpp 
 d5385d9295d30a6039bab9938f3270006582d3da 
   src/slave/containerizer/containerizer.hpp 
 56c088abb036aa26c323c3142fd27ea29ed51d4f 
   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
   src/slave/containerizer/mesos/containerizer.hpp 
 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
   src/slave/containerizer/mesos/containerizer.cpp 
 f2587280dc0e1d566d2b856a80358c7b3896c603 
   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
   src/tests/docker_containerizer_tests.cpp 
 c9d66b3fbc7d081f36c26781573dca50de823c44 
   src/tests/slave_tests.cpp 04e79ec061e767a82d40242be7884967942c50b6 
 
 Diff: https://reviews.apache.org/r/33249/diff/
 
 
 Testing
 ---
 
 I used examples/docker_no_executor_framework.cpp, and wrote a few tests and 
 ran make check
 
 
 Thanks,
 
 Jay Buffington
 




Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-11 Thread Jay Buffington

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/
---

(Updated May 11, 2015, 3:06 p.m.)


Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.


Bugs: MESOS-2020 and MESOS-2656
https://issues.apache.org/jira/browse/MESOS-2020
https://issues.apache.org/jira/browse/MESOS-2656


Repository: mesos


Description
---

Send statusUpdate to scheduler on containerizer launch failure

This review is actually for three commits.  The three commits are separated out 
on github here:
https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1

Commit #1:
docker-no-executor-framework example should send ContainerInfo

docker-no-executor-framework stopped working when the protobuf message
changed during development.  Apparently noone noticed because we don't
run it as part of our test suite.  I used it to develop a fix for
proper error reporting, so I fixed it.

Commit #2:
serialize wait and destroy in containerizer

Containerizers keep track of all the containers they have created.  When
a container is destroy()'ed the containerizer removes the container from
its internal hashmap.  This means that wait() must be called before the
container is destroyed.

When the docker containerizer fails because of a docker pull or
docker run failure, we end up with a race condition where wait() does
not properly return the Termination, so we ended up with a default
failure message of Abnormal Executor Termination which isn't very
useful.

This commit solves this issue by not destroying the container until
after wait is called in the failure case.

Fixes MESOS-2656

Commit #3:
send scheduler containerizer launch failure errors

If a method in the launch sequence returns a failure the error message
wasn't making it back to the scheduler.

Fixes MESOS-2020


Diffs (updated)
-

  include/mesos/containerizer/containerizer.proto 
95c84dfbecef28ca9a220a98293c5fc6215a692f 
  src/examples/docker_no_executor_framework.cpp 
d5385d9295d30a6039bab9938f3270006582d3da 
  src/slave/containerizer/containerizer.hpp 
56c088abb036aa26c323c3142fd27ea29ed51d4f 
  src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
  src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
  src/slave/containerizer/mesos/containerizer.hpp 
5e5f13ed8a71ff9510b40b6032d00fd16d312622 
  src/slave/containerizer/mesos/containerizer.cpp 
f2587280dc0e1d566d2b856a80358c7b3896c603 
  src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
  src/tests/docker_containerizer_tests.cpp 
c9d66b3fbc7d081f36c26781573dca50de823c44 

Diff: https://reviews.apache.org/r/33249/diff/


Testing
---

I used examples/docker_no_executor_framework.cpp, and wrote a few tests and ran 
make check


Thanks,

Jay Buffington



Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-11 Thread Jay Buffington


 On May 11, 2015, 4:07 p.m., Timothy Chen wrote:
  Thanks Jay the changes looks reasonable to me, will wait for Jie to comment.

Great, thanks!  When you commit can you add the three separate commits that are 
here: 
https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1

rather than just one big commit?


- Jay


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review83310
---


On May 11, 2015, 3:06 p.m., Jay Buffington wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33249/
 ---
 
 (Updated May 11, 2015, 3:06 p.m.)
 
 
 Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
 
 
 Bugs: MESOS-2020 and MESOS-2656
 https://issues.apache.org/jira/browse/MESOS-2020
 https://issues.apache.org/jira/browse/MESOS-2656
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Send statusUpdate to scheduler on containerizer launch failure
 
 This review is actually for three commits.  The three commits are separated 
 out on github here:
 https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
 
 Commit #1:
 docker-no-executor-framework example should send ContainerInfo
 
 docker-no-executor-framework stopped working when the protobuf message
 changed during development.  Apparently noone noticed because we don't
 run it as part of our test suite.  I used it to develop a fix for
 proper error reporting, so I fixed it.
 
 Commit #2:
 serialize wait and destroy in containerizer
 
 Containerizers keep track of all the containers they have created.  When
 a container is destroy()'ed the containerizer removes the container from
 its internal hashmap.  This means that wait() must be called before the
 container is destroyed.
 
 When the docker containerizer fails because of a docker pull or
 docker run failure, we end up with a race condition where wait() does
 not properly return the Termination, so we ended up with a default
 failure message of Abnormal Executor Termination which isn't very
 useful.
 
 This commit solves this issue by not destroying the container until
 after wait is called in the failure case.
 
 Fixes MESOS-2656
 
 Commit #3:
 send scheduler containerizer launch failure errors
 
 If a method in the launch sequence returns a failure the error message
 wasn't making it back to the scheduler.
 
 Fixes MESOS-2020
 
 
 Diffs
 -
 
   include/mesos/containerizer/containerizer.proto 
 95c84dfbecef28ca9a220a98293c5fc6215a692f 
   src/examples/docker_no_executor_framework.cpp 
 d5385d9295d30a6039bab9938f3270006582d3da 
   src/slave/containerizer/containerizer.hpp 
 56c088abb036aa26c323c3142fd27ea29ed51d4f 
   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
   src/slave/containerizer/mesos/containerizer.hpp 
 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
   src/slave/containerizer/mesos/containerizer.cpp 
 f2587280dc0e1d566d2b856a80358c7b3896c603 
   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
   src/tests/docker_containerizer_tests.cpp 
 c9d66b3fbc7d081f36c26781573dca50de823c44 
 
 Diff: https://reviews.apache.org/r/33249/diff/
 
 
 Testing
 ---
 
 I used examples/docker_no_executor_framework.cpp, and wrote a few tests and 
 ran make check
 
 
 Thanks,
 
 Jay Buffington
 




Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-11 Thread Jie Yu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review83319
---

Ship it!


Nice tests! LGTM!

- Jie Yu


On May 11, 2015, 10:06 p.m., Jay Buffington wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33249/
 ---
 
 (Updated May 11, 2015, 10:06 p.m.)
 
 
 Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
 
 
 Bugs: MESOS-2020 and MESOS-2656
 https://issues.apache.org/jira/browse/MESOS-2020
 https://issues.apache.org/jira/browse/MESOS-2656
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Send statusUpdate to scheduler on containerizer launch failure
 
 This review is actually for three commits.  The three commits are separated 
 out on github here:
 https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
 
 Commit #1:
 docker-no-executor-framework example should send ContainerInfo
 
 docker-no-executor-framework stopped working when the protobuf message
 changed during development.  Apparently noone noticed because we don't
 run it as part of our test suite.  I used it to develop a fix for
 proper error reporting, so I fixed it.
 
 Commit #2:
 serialize wait and destroy in containerizer
 
 Containerizers keep track of all the containers they have created.  When
 a container is destroy()'ed the containerizer removes the container from
 its internal hashmap.  This means that wait() must be called before the
 container is destroyed.
 
 When the docker containerizer fails because of a docker pull or
 docker run failure, we end up with a race condition where wait() does
 not properly return the Termination, so we ended up with a default
 failure message of Abnormal Executor Termination which isn't very
 useful.
 
 This commit solves this issue by not destroying the container until
 after wait is called in the failure case.
 
 Fixes MESOS-2656
 
 Commit #3:
 send scheduler containerizer launch failure errors
 
 If a method in the launch sequence returns a failure the error message
 wasn't making it back to the scheduler.
 
 Fixes MESOS-2020
 
 
 Diffs
 -
 
   include/mesos/containerizer/containerizer.proto 
 95c84dfbecef28ca9a220a98293c5fc6215a692f 
   src/examples/docker_no_executor_framework.cpp 
 d5385d9295d30a6039bab9938f3270006582d3da 
   src/slave/containerizer/containerizer.hpp 
 56c088abb036aa26c323c3142fd27ea29ed51d4f 
   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
   src/slave/containerizer/mesos/containerizer.hpp 
 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
   src/slave/containerizer/mesos/containerizer.cpp 
 f2587280dc0e1d566d2b856a80358c7b3896c603 
   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
   src/tests/docker_containerizer_tests.cpp 
 c9d66b3fbc7d081f36c26781573dca50de823c44 
 
 Diff: https://reviews.apache.org/r/33249/diff/
 
 
 Testing
 ---
 
 I used examples/docker_no_executor_framework.cpp, and wrote a few tests and 
 ran make check
 
 
 Thanks,
 
 Jay Buffington
 




Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-11 Thread Timothy Chen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review83310
---


Thanks Jay the changes looks reasonable to me, will wait for Jie to comment.

- Timothy Chen


On May 11, 2015, 10:06 p.m., Jay Buffington wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33249/
 ---
 
 (Updated May 11, 2015, 10:06 p.m.)
 
 
 Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
 
 
 Bugs: MESOS-2020 and MESOS-2656
 https://issues.apache.org/jira/browse/MESOS-2020
 https://issues.apache.org/jira/browse/MESOS-2656
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Send statusUpdate to scheduler on containerizer launch failure
 
 This review is actually for three commits.  The three commits are separated 
 out on github here:
 https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
 
 Commit #1:
 docker-no-executor-framework example should send ContainerInfo
 
 docker-no-executor-framework stopped working when the protobuf message
 changed during development.  Apparently noone noticed because we don't
 run it as part of our test suite.  I used it to develop a fix for
 proper error reporting, so I fixed it.
 
 Commit #2:
 serialize wait and destroy in containerizer
 
 Containerizers keep track of all the containers they have created.  When
 a container is destroy()'ed the containerizer removes the container from
 its internal hashmap.  This means that wait() must be called before the
 container is destroyed.
 
 When the docker containerizer fails because of a docker pull or
 docker run failure, we end up with a race condition where wait() does
 not properly return the Termination, so we ended up with a default
 failure message of Abnormal Executor Termination which isn't very
 useful.
 
 This commit solves this issue by not destroying the container until
 after wait is called in the failure case.
 
 Fixes MESOS-2656
 
 Commit #3:
 send scheduler containerizer launch failure errors
 
 If a method in the launch sequence returns a failure the error message
 wasn't making it back to the scheduler.
 
 Fixes MESOS-2020
 
 
 Diffs
 -
 
   include/mesos/containerizer/containerizer.proto 
 95c84dfbecef28ca9a220a98293c5fc6215a692f 
   src/examples/docker_no_executor_framework.cpp 
 d5385d9295d30a6039bab9938f3270006582d3da 
   src/slave/containerizer/containerizer.hpp 
 56c088abb036aa26c323c3142fd27ea29ed51d4f 
   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
   src/slave/containerizer/mesos/containerizer.hpp 
 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
   src/slave/containerizer/mesos/containerizer.cpp 
 f2587280dc0e1d566d2b856a80358c7b3896c603 
   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
   src/tests/docker_containerizer_tests.cpp 
 c9d66b3fbc7d081f36c26781573dca50de823c44 
 
 Diff: https://reviews.apache.org/r/33249/diff/
 
 
 Testing
 ---
 
 I used examples/docker_no_executor_framework.cpp, and wrote a few tests and 
 ran make check
 
 
 Thanks,
 
 Jay Buffington
 




Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-11 Thread Timothy Chen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review83308
---

Ship it!


Ship It!

- Timothy Chen


On May 11, 2015, 10:06 p.m., Jay Buffington wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33249/
 ---
 
 (Updated May 11, 2015, 10:06 p.m.)
 
 
 Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
 
 
 Bugs: MESOS-2020 and MESOS-2656
 https://issues.apache.org/jira/browse/MESOS-2020
 https://issues.apache.org/jira/browse/MESOS-2656
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Send statusUpdate to scheduler on containerizer launch failure
 
 This review is actually for three commits.  The three commits are separated 
 out on github here:
 https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
 
 Commit #1:
 docker-no-executor-framework example should send ContainerInfo
 
 docker-no-executor-framework stopped working when the protobuf message
 changed during development.  Apparently noone noticed because we don't
 run it as part of our test suite.  I used it to develop a fix for
 proper error reporting, so I fixed it.
 
 Commit #2:
 serialize wait and destroy in containerizer
 
 Containerizers keep track of all the containers they have created.  When
 a container is destroy()'ed the containerizer removes the container from
 its internal hashmap.  This means that wait() must be called before the
 container is destroyed.
 
 When the docker containerizer fails because of a docker pull or
 docker run failure, we end up with a race condition where wait() does
 not properly return the Termination, so we ended up with a default
 failure message of Abnormal Executor Termination which isn't very
 useful.
 
 This commit solves this issue by not destroying the container until
 after wait is called in the failure case.
 
 Fixes MESOS-2656
 
 Commit #3:
 send scheduler containerizer launch failure errors
 
 If a method in the launch sequence returns a failure the error message
 wasn't making it back to the scheduler.
 
 Fixes MESOS-2020
 
 
 Diffs
 -
 
   include/mesos/containerizer/containerizer.proto 
 95c84dfbecef28ca9a220a98293c5fc6215a692f 
   src/examples/docker_no_executor_framework.cpp 
 d5385d9295d30a6039bab9938f3270006582d3da 
   src/slave/containerizer/containerizer.hpp 
 56c088abb036aa26c323c3142fd27ea29ed51d4f 
   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
   src/slave/containerizer/mesos/containerizer.hpp 
 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
   src/slave/containerizer/mesos/containerizer.cpp 
 f2587280dc0e1d566d2b856a80358c7b3896c603 
   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
   src/tests/docker_containerizer_tests.cpp 
 c9d66b3fbc7d081f36c26781573dca50de823c44 
 
 Diff: https://reviews.apache.org/r/33249/diff/
 
 
 Testing
 ---
 
 I used examples/docker_no_executor_framework.cpp, and wrote a few tests and 
 ran make check
 
 
 Thanks,
 
 Jay Buffington
 




Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-11 Thread Mesos ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review83309
---


Patch looks great!

Reviews applied: [33249]

All tests passed.

- Mesos ReviewBot


On May 11, 2015, 10:06 p.m., Jay Buffington wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33249/
 ---
 
 (Updated May 11, 2015, 10:06 p.m.)
 
 
 Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
 
 
 Bugs: MESOS-2020 and MESOS-2656
 https://issues.apache.org/jira/browse/MESOS-2020
 https://issues.apache.org/jira/browse/MESOS-2656
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Send statusUpdate to scheduler on containerizer launch failure
 
 This review is actually for three commits.  The three commits are separated 
 out on github here:
 https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
 
 Commit #1:
 docker-no-executor-framework example should send ContainerInfo
 
 docker-no-executor-framework stopped working when the protobuf message
 changed during development.  Apparently noone noticed because we don't
 run it as part of our test suite.  I used it to develop a fix for
 proper error reporting, so I fixed it.
 
 Commit #2:
 serialize wait and destroy in containerizer
 
 Containerizers keep track of all the containers they have created.  When
 a container is destroy()'ed the containerizer removes the container from
 its internal hashmap.  This means that wait() must be called before the
 container is destroyed.
 
 When the docker containerizer fails because of a docker pull or
 docker run failure, we end up with a race condition where wait() does
 not properly return the Termination, so we ended up with a default
 failure message of Abnormal Executor Termination which isn't very
 useful.
 
 This commit solves this issue by not destroying the container until
 after wait is called in the failure case.
 
 Fixes MESOS-2656
 
 Commit #3:
 send scheduler containerizer launch failure errors
 
 If a method in the launch sequence returns a failure the error message
 wasn't making it back to the scheduler.
 
 Fixes MESOS-2020
 
 
 Diffs
 -
 
   include/mesos/containerizer/containerizer.proto 
 95c84dfbecef28ca9a220a98293c5fc6215a692f 
   src/examples/docker_no_executor_framework.cpp 
 d5385d9295d30a6039bab9938f3270006582d3da 
   src/slave/containerizer/containerizer.hpp 
 56c088abb036aa26c323c3142fd27ea29ed51d4f 
   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
   src/slave/containerizer/mesos/containerizer.hpp 
 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
   src/slave/containerizer/mesos/containerizer.cpp 
 f2587280dc0e1d566d2b856a80358c7b3896c603 
   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
   src/tests/docker_containerizer_tests.cpp 
 c9d66b3fbc7d081f36c26781573dca50de823c44 
 
 Diff: https://reviews.apache.org/r/33249/diff/
 
 
 Testing
 ---
 
 I used examples/docker_no_executor_framework.cpp, and wrote a few tests and 
 ran make check
 
 
 Thanks,
 
 Jay Buffington
 




Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-11 Thread Jay Buffington


 On May 8, 2015, 11:48 p.m., Timothy Chen wrote:
  src/slave/containerizer/mesos/containerizer.cpp, line 567
  https://reviews.apache.org/r/33249/diff/6/?file=954330#file954330line567
 
  No tests for mesos containerizer change?

The changes to mesos containerizer were to support the serialization of 
wait/destroy.  If the mesos containerizer fails to, say, fetch, the scheduler 
still get's back a poor executor terminated message.  This happens much more 
rarely than in the docker case, so I didn't address it in this patch.

We weren't testing that destroy was getting called when launch failed.  Adding 
a test now that tests that destroy does not get called on failure is odd.  I'm 
up for suggestions, though.

I did re-add my original slave_test.cpp test back.


- Jay


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review83136
---


On May 8, 2015, 7:42 p.m., Jay Buffington wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33249/
 ---
 
 (Updated May 8, 2015, 7:42 p.m.)
 
 
 Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
 
 
 Bugs: MESOS-2020 and MESOS-2656
 https://issues.apache.org/jira/browse/MESOS-2020
 https://issues.apache.org/jira/browse/MESOS-2656
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Send statusUpdate to scheduler on containerizer launch failure
 
 This review is actually for three commits.  The three commits are separated 
 out on github here:
 https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
 
 Commit #1:
 docker-no-executor-framework example should send ContainerInfo
 
 docker-no-executor-framework stopped working when the protobuf message
 changed during development.  Apparently noone noticed because we don't
 run it as part of our test suite.  I used it to develop a fix for
 proper error reporting, so I fixed it.
 
 Commit #2:
 serialize wait and destroy in containerizer
 
 Containerizers keep track of all the containers they have created.  When
 a container is destroy()'ed the containerizer removes the container from
 its internal hashmap.  This means that wait() must be called before the
 container is destroyed.
 
 When the docker containerizer fails because of a docker pull or
 docker run failure, we end up with a race condition where wait() does
 not properly return the Termination, so we ended up with a default
 failure message of Abnormal Executor Termination which isn't very
 useful.
 
 This commit solves this issue by not destroying the container until
 after wait is called in the failure case.
 
 Fixes MESOS-2656
 
 Commit #3:
 send scheduler containerizer launch failure errors
 
 If a method in the launch sequence returns a failure the error message
 wasn't making it back to the scheduler.
 
 Fixes MESOS-2020
 
 
 Diffs
 -
 
   include/mesos/containerizer/containerizer.proto 
 95c84dfbecef28ca9a220a98293c5fc6215a692f 
   src/examples/docker_no_executor_framework.cpp 
 d5385d9295d30a6039bab9938f3270006582d3da 
   src/slave/containerizer/containerizer.hpp 
 56c088abb036aa26c323c3142fd27ea29ed51d4f 
   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
   src/slave/containerizer/mesos/containerizer.hpp 
 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
   src/slave/containerizer/mesos/containerizer.cpp 
 f2587280dc0e1d566d2b856a80358c7b3896c603 
   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
   src/tests/docker_containerizer_tests.cpp 
 c9d66b3fbc7d081f36c26781573dca50de823c44 
 
 Diff: https://reviews.apache.org/r/33249/diff/
 
 
 Testing
 ---
 
 I used examples/docker_no_executor_framework.cpp, and wrote a few tests and 
 ran make check
 
 
 Thanks,
 
 Jay Buffington
 




Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-11 Thread Timothy Chen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review83233
---



include/mesos/containerizer/containerizer.proto
https://reviews.apache.org/r/33249/#comment134137

Hmm, was going to say it should end with a period, but since it's a type 
perhaps this is fine.



src/slave/containerizer/docker.cpp
https://reviews.apache.org/r/33249/#comment134138

Does this means when docker run fails we won't be able to get the error 
message from it?


- Timothy Chen


On May 11, 2015, 5:10 p.m., Jay Buffington wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33249/
 ---
 
 (Updated May 11, 2015, 5:10 p.m.)
 
 
 Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
 
 
 Bugs: MESOS-2020 and MESOS-2656
 https://issues.apache.org/jira/browse/MESOS-2020
 https://issues.apache.org/jira/browse/MESOS-2656
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Send statusUpdate to scheduler on containerizer launch failure
 
 This review is actually for three commits.  The three commits are separated 
 out on github here:
 https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
 
 Commit #1:
 docker-no-executor-framework example should send ContainerInfo
 
 docker-no-executor-framework stopped working when the protobuf message
 changed during development.  Apparently noone noticed because we don't
 run it as part of our test suite.  I used it to develop a fix for
 proper error reporting, so I fixed it.
 
 Commit #2:
 serialize wait and destroy in containerizer
 
 Containerizers keep track of all the containers they have created.  When
 a container is destroy()'ed the containerizer removes the container from
 its internal hashmap.  This means that wait() must be called before the
 container is destroyed.
 
 When the docker containerizer fails because of a docker pull or
 docker run failure, we end up with a race condition where wait() does
 not properly return the Termination, so we ended up with a default
 failure message of Abnormal Executor Termination which isn't very
 useful.
 
 This commit solves this issue by not destroying the container until
 after wait is called in the failure case.
 
 Fixes MESOS-2656
 
 Commit #3:
 send scheduler containerizer launch failure errors
 
 If a method in the launch sequence returns a failure the error message
 wasn't making it back to the scheduler.
 
 Fixes MESOS-2020
 
 
 Diffs
 -
 
   include/mesos/containerizer/containerizer.proto 
 95c84dfbecef28ca9a220a98293c5fc6215a692f 
   src/examples/docker_no_executor_framework.cpp 
 d5385d9295d30a6039bab9938f3270006582d3da 
   src/slave/containerizer/containerizer.hpp 
 56c088abb036aa26c323c3142fd27ea29ed51d4f 
   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
   src/slave/containerizer/mesos/containerizer.hpp 
 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
   src/slave/containerizer/mesos/containerizer.cpp 
 f2587280dc0e1d566d2b856a80358c7b3896c603 
   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
   src/tests/docker_containerizer_tests.cpp 
 c9d66b3fbc7d081f36c26781573dca50de823c44 
   src/tests/slave_tests.cpp 04e79ec061e767a82d40242be7884967942c50b6 
 
 Diff: https://reviews.apache.org/r/33249/diff/
 
 
 Testing
 ---
 
 I used examples/docker_no_executor_framework.cpp, and wrote a few tests and 
 ran make check
 
 
 Thanks,
 
 Jay Buffington
 




Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-11 Thread Jay Buffington

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/
---

(Updated May 11, 2015, 10:10 a.m.)


Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.


Changes
---

added back the slave_test and fixed tnachen's two comments.


Bugs: MESOS-2020 and MESOS-2656
https://issues.apache.org/jira/browse/MESOS-2020
https://issues.apache.org/jira/browse/MESOS-2656


Repository: mesos


Description
---

Send statusUpdate to scheduler on containerizer launch failure

This review is actually for three commits.  The three commits are separated out 
on github here:
https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1

Commit #1:
docker-no-executor-framework example should send ContainerInfo

docker-no-executor-framework stopped working when the protobuf message
changed during development.  Apparently noone noticed because we don't
run it as part of our test suite.  I used it to develop a fix for
proper error reporting, so I fixed it.

Commit #2:
serialize wait and destroy in containerizer

Containerizers keep track of all the containers they have created.  When
a container is destroy()'ed the containerizer removes the container from
its internal hashmap.  This means that wait() must be called before the
container is destroyed.

When the docker containerizer fails because of a docker pull or
docker run failure, we end up with a race condition where wait() does
not properly return the Termination, so we ended up with a default
failure message of Abnormal Executor Termination which isn't very
useful.

This commit solves this issue by not destroying the container until
after wait is called in the failure case.

Fixes MESOS-2656

Commit #3:
send scheduler containerizer launch failure errors

If a method in the launch sequence returns a failure the error message
wasn't making it back to the scheduler.

Fixes MESOS-2020


Diffs (updated)
-

  include/mesos/containerizer/containerizer.proto 
95c84dfbecef28ca9a220a98293c5fc6215a692f 
  src/examples/docker_no_executor_framework.cpp 
d5385d9295d30a6039bab9938f3270006582d3da 
  src/slave/containerizer/containerizer.hpp 
56c088abb036aa26c323c3142fd27ea29ed51d4f 
  src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
  src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
  src/slave/containerizer/mesos/containerizer.hpp 
5e5f13ed8a71ff9510b40b6032d00fd16d312622 
  src/slave/containerizer/mesos/containerizer.cpp 
f2587280dc0e1d566d2b856a80358c7b3896c603 
  src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
  src/tests/docker_containerizer_tests.cpp 
c9d66b3fbc7d081f36c26781573dca50de823c44 
  src/tests/slave_tests.cpp 04e79ec061e767a82d40242be7884967942c50b6 

Diff: https://reviews.apache.org/r/33249/diff/


Testing
---

I used examples/docker_no_executor_framework.cpp, and wrote a few tests and ran 
make check


Thanks,

Jay Buffington



Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-07 Thread Mesos ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review82883
---


Patch looks great!

Reviews applied: [33249]

All tests passed.

- Mesos ReviewBot


On May 7, 2015, 6:57 p.m., Jay Buffington wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33249/
 ---
 
 (Updated May 7, 2015, 6:57 p.m.)
 
 
 Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
 
 
 Bugs: MESOS-2020 and MESOS-2656
 https://issues.apache.org/jira/browse/MESOS-2020
 https://issues.apache.org/jira/browse/MESOS-2656
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Send statusUpdate to scheduler on containerizer launch failure
 
 This review is actually for three commits.  The three commits are separated 
 out
 on github here:
 https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
 
 Commit #1:
 docker-no-executor-framework example should send ContainerInfo
 
 docker-no-executor-framework stopped working when the protobuf message
 changed during development.  Apparently noone noticed because we don't
 run it as part of our test suite.  I used it to develop a fix for
 proper error reporting, so I fixed it.
 
 Commit #2:
 serialize wait and destroy in containerizer
 
 Containerizers keep track of all the containers they have created.  When
 a container is destroy()'ed the containerizer removes the container from
 its internal hashmap.  This means that wait() must be called before the
 container is destroyed.
 
 When the docker containerizer fails because of a docker pull or
 docker run failure, we end up with a race condition where wait() does
 not properly return the Termination, so we ended up with a default
 failure message of Abnormal Executor Termination which isn't very
 useful.
 
 This commit solves this issue by not destroying the container until
 after wait is called in the failure case.
 
 Fixes MESOS-2656
 
 Commit #3:
 send scheduler containerizer launch failure errors
 
 If a method in the launch sequence returns a failure the error message
 wasn't making it back to the scheduler.
 
 Fixes MESOS-2020
 
 
 Diffs
 -
 
   src/examples/docker_no_executor_framework.cpp 
 d5385d9295d30a6039bab9938f3270006582d3da 
   src/slave/containerizer/containerizer.hpp 
 56c088abb036aa26c323c3142fd27ea29ed51d4f 
   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
   src/slave/containerizer/mesos/containerizer.hpp 
 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
   src/slave/containerizer/mesos/containerizer.cpp 
 f2587280dc0e1d566d2b856a80358c7b3896c603 
   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
 
 Diff: https://reviews.apache.org/r/33249/diff/
 
 
 Testing
 ---
 
 I used examples/docker_no_executor_framework.cpp and make check
 
 
 Thanks,
 
 Jay Buffington
 




Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-07 Thread Jay Buffington


 On May 7, 2015, 10:02 a.m., Timothy Chen wrote:
  src/examples/docker_no_executor_framework.cpp, line 109
  https://reviews.apache.org/r/33249/diff/4/?file=952333#file952333line109
 
  is this something you like to commit?

Whops!  That snuck in.  I'll fix.


 On May 7, 2015, 10:02 a.m., Timothy Chen wrote:
  src/slave/containerizer/docker.cpp, line 1187
  https://reviews.apache.org/r/33249/diff/4/?file=952336#file952336line1187
 
  This seems to not respect anymore the explicit state we try to track 
  what stage the containerizer was running when we destroy, and what stage it 
  failed.
  
  Does the error message has clear message saying it failed to 
  fetch/pull/run?

tracking the stage the container is in when we destroyed is done by the 
container-state enum now, right?  

Yes, the status message on TASK_FAIL that the scheduler gets back now looks 
like this:

Failed to launch container: Failed to 'docker pull bogus-image:latest': 
exit status = exited with status 1 stderr = time=2015-05-07T17:45:35Z 
level=fatal msg=Error: image library/bogus-image:latest not found

I tested fetch failures by providing a bogus URI and run failures by starting 
the slave with --docker=/tmp/docker which looked like this:

$ cat /tmp/docker
#!/bin/bash

if [[ $1 == run ]]; then
echo docker run failure! 12
exit 123
fi

exec docker $*


 On May 7, 2015, 10:02 a.m., Timothy Chen wrote:
  src/slave/slave.cpp, line 3125
  https://reviews.apache.org/r/33249/diff/4/?file=952339#file952339line3125
 
  We shouldn't need to destroy if none of the containerizers return true.

Ah, that's right.  I will just remove this line.


- Jay


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review82832
---


On May 7, 2015, 9:17 a.m., Jay Buffington wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33249/
 ---
 
 (Updated May 7, 2015, 9:17 a.m.)
 
 
 Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
 
 
 Bugs: MESOS-2020 and MESOS-2656
 https://issues.apache.org/jira/browse/MESOS-2020
 https://issues.apache.org/jira/browse/MESOS-2656
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Send statusUpdate to scheduler on containerizer launch failure
 
 This review is actually for three commits.  The three commits are separated 
 out
 on github here:
 https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
 
 Commit #1:
 docker-no-executor-framework example should send ContainerInfo
 
 docker-no-executor-framework stopped working when the protobuf message
 changed during development.  Apparently noone noticed because we don't
 run it as part of our test suite.  I used it to develop a fix for
 proper error reporting, so I fixed it.
 
 Commit #2:
 serialize wait and destroy in containerizer
 
 Containerizers keep track of all the containers they have created.  When
 a container is destroy()'ed the containerizer removes the container from
 its internal hashmap.  This means that wait() must be called before the
 container is destroyed.
 
 When the docker containerizer fails because of a docker pull or
 docker run failure, we end up with a race condition where wait() does
 not properly return the Termination, so we ended up with a default
 failure message of Abnormal Executor Termination which isn't very
 useful.
 
 This commit solves this issue by not destroying the container until
 after wait is called in the failure case.
 
 Fixes MESOS-2656
 
 Commit #3:
 send scheduler containerizer launch failure errors
 
 If a method in the launch sequence returns a failure the error message
 wasn't making it back to the scheduler.
 
 Fixes MESOS-2020
 
 
 Diffs
 -
 
   src/examples/docker_no_executor_framework.cpp 
 d5385d9295d30a6039bab9938f3270006582d3da 
   src/slave/containerizer/containerizer.hpp 
 56c088abb036aa26c323c3142fd27ea29ed51d4f 
   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
   src/slave/containerizer/mesos/containerizer.hpp 
 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
   src/slave/containerizer/mesos/containerizer.cpp 
 f2587280dc0e1d566d2b856a80358c7b3896c603 
   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
 
 Diff: https://reviews.apache.org/r/33249/diff/
 
 
 Testing
 ---
 
 I used examples/docker_no_executor_framework.cpp and make check
 
 
 Thanks,
 
 Jay Buffington
 




Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-07 Thread Jay Buffington

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/
---

(Updated May 7, 2015, 11:57 a.m.)


Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.


Bugs: MESOS-2020 and MESOS-2656
https://issues.apache.org/jira/browse/MESOS-2020
https://issues.apache.org/jira/browse/MESOS-2656


Repository: mesos


Description
---

Send statusUpdate to scheduler on containerizer launch failure

This review is actually for three commits.  The three commits are separated out
on github here:
https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1

Commit #1:
docker-no-executor-framework example should send ContainerInfo

docker-no-executor-framework stopped working when the protobuf message
changed during development.  Apparently noone noticed because we don't
run it as part of our test suite.  I used it to develop a fix for
proper error reporting, so I fixed it.

Commit #2:
serialize wait and destroy in containerizer

Containerizers keep track of all the containers they have created.  When
a container is destroy()'ed the containerizer removes the container from
its internal hashmap.  This means that wait() must be called before the
container is destroyed.

When the docker containerizer fails because of a docker pull or
docker run failure, we end up with a race condition where wait() does
not properly return the Termination, so we ended up with a default
failure message of Abnormal Executor Termination which isn't very
useful.

This commit solves this issue by not destroying the container until
after wait is called in the failure case.

Fixes MESOS-2656

Commit #3:
send scheduler containerizer launch failure errors

If a method in the launch sequence returns a failure the error message
wasn't making it back to the scheduler.

Fixes MESOS-2020


Diffs (updated)
-

  src/examples/docker_no_executor_framework.cpp 
d5385d9295d30a6039bab9938f3270006582d3da 
  src/slave/containerizer/containerizer.hpp 
56c088abb036aa26c323c3142fd27ea29ed51d4f 
  src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
  src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
  src/slave/containerizer/mesos/containerizer.hpp 
5e5f13ed8a71ff9510b40b6032d00fd16d312622 
  src/slave/containerizer/mesos/containerizer.cpp 
f2587280dc0e1d566d2b856a80358c7b3896c603 
  src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 

Diff: https://reviews.apache.org/r/33249/diff/


Testing
---

I used examples/docker_no_executor_framework.cpp and make check


Thanks,

Jay Buffington



Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-07 Thread Jay Buffington

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/
---

(Updated May 7, 2015, 9:17 a.m.)


Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.


Bugs: MESOS-2020 and MESOS-2656
https://issues.apache.org/jira/browse/MESOS-2020
https://issues.apache.org/jira/browse/MESOS-2656


Repository: mesos


Description (updated)
---

Send statusUpdate to scheduler on containerizer launch failure

This review is actually for three commits.  The three commits are separated out
on github here:
https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1

Commit #1:
docker-no-executor-framework example should send ContainerInfo

docker-no-executor-framework stopped working when the protobuf message
changed during development.  Apparently noone noticed because we don't
run it as part of our test suite.  I used it to develop a fix for
proper error reporting, so I fixed it.

Commit #2:
serialize wait and destroy in containerizer

Containerizers keep track of all the containers they have created.  When
a container is destroy()'ed the containerizer removes the container from
its internal hashmap.  This means that wait() must be called before the
container is destroyed.

When the docker containerizer fails because of a docker pull or
docker run failure, we end up with a race condition where wait() does
not properly return the Termination, so we ended up with a default
failure message of Abnormal Executor Termination which isn't very
useful.

This commit solves this issue by not destroying the container until
after wait is called in the failure case.

Fixes MESOS-2656

Commit #3:
send scheduler containerizer launch failure errors

If a method in the launch sequence returns a failure the error message
wasn't making it back to the scheduler.

Fixes MESOS-2020


Diffs (updated)
-

  src/examples/docker_no_executor_framework.cpp 
d5385d9295d30a6039bab9938f3270006582d3da 
  src/slave/containerizer/containerizer.hpp 
56c088abb036aa26c323c3142fd27ea29ed51d4f 
  src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
  src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
  src/slave/containerizer/mesos/containerizer.hpp 
5e5f13ed8a71ff9510b40b6032d00fd16d312622 
  src/slave/containerizer/mesos/containerizer.cpp 
f2587280dc0e1d566d2b856a80358c7b3896c603 
  src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 

Diff: https://reviews.apache.org/r/33249/diff/


Testing (updated)
---

I used examples/docker_no_executor_framework.cpp and make check


Thanks,

Jay Buffington



Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-07 Thread Mesos ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review82833
---


Patch looks great!

Reviews applied: [33249]

All tests passed.

- Mesos ReviewBot


On May 7, 2015, 4:17 p.m., Jay Buffington wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33249/
 ---
 
 (Updated May 7, 2015, 4:17 p.m.)
 
 
 Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
 
 
 Bugs: MESOS-2020 and MESOS-2656
 https://issues.apache.org/jira/browse/MESOS-2020
 https://issues.apache.org/jira/browse/MESOS-2656
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Send statusUpdate to scheduler on containerizer launch failure
 
 This review is actually for three commits.  The three commits are separated 
 out
 on github here:
 https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
 
 Commit #1:
 docker-no-executor-framework example should send ContainerInfo
 
 docker-no-executor-framework stopped working when the protobuf message
 changed during development.  Apparently noone noticed because we don't
 run it as part of our test suite.  I used it to develop a fix for
 proper error reporting, so I fixed it.
 
 Commit #2:
 serialize wait and destroy in containerizer
 
 Containerizers keep track of all the containers they have created.  When
 a container is destroy()'ed the containerizer removes the container from
 its internal hashmap.  This means that wait() must be called before the
 container is destroyed.
 
 When the docker containerizer fails because of a docker pull or
 docker run failure, we end up with a race condition where wait() does
 not properly return the Termination, so we ended up with a default
 failure message of Abnormal Executor Termination which isn't very
 useful.
 
 This commit solves this issue by not destroying the container until
 after wait is called in the failure case.
 
 Fixes MESOS-2656
 
 Commit #3:
 send scheduler containerizer launch failure errors
 
 If a method in the launch sequence returns a failure the error message
 wasn't making it back to the scheduler.
 
 Fixes MESOS-2020
 
 
 Diffs
 -
 
   src/examples/docker_no_executor_framework.cpp 
 d5385d9295d30a6039bab9938f3270006582d3da 
   src/slave/containerizer/containerizer.hpp 
 56c088abb036aa26c323c3142fd27ea29ed51d4f 
   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
   src/slave/containerizer/mesos/containerizer.hpp 
 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
   src/slave/containerizer/mesos/containerizer.cpp 
 f2587280dc0e1d566d2b856a80358c7b3896c603 
   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
 
 Diff: https://reviews.apache.org/r/33249/diff/
 
 
 Testing
 ---
 
 I used examples/docker_no_executor_framework.cpp and make check
 
 
 Thanks,
 
 Jay Buffington
 




Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-07 Thread Timothy Chen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review82832
---


Do we have any tests to check for the change in mesos containerizer and docker 
containerizer?


src/examples/docker_no_executor_framework.cpp
https://reviews.apache.org/r/33249/#comment133671

is this something you like to commit?



src/slave/containerizer/docker.cpp
https://reviews.apache.org/r/33249/#comment133673

This seems to not respect anymore the explicit state we try to track what 
stage the containerizer was running when we destroy, and what stage it failed.

Does the error message has clear message saying it failed to fetch/pull/run?



src/slave/slave.cpp
https://reviews.apache.org/r/33249/#comment133672

We shouldn't need to destroy if none of the containerizers return true.


- Timothy Chen


On May 7, 2015, 4:17 p.m., Jay Buffington wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33249/
 ---
 
 (Updated May 7, 2015, 4:17 p.m.)
 
 
 Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
 
 
 Bugs: MESOS-2020 and MESOS-2656
 https://issues.apache.org/jira/browse/MESOS-2020
 https://issues.apache.org/jira/browse/MESOS-2656
 
 
 Repository: mesos
 
 
 Description
 ---
 
 Send statusUpdate to scheduler on containerizer launch failure
 
 This review is actually for three commits.  The three commits are separated 
 out
 on github here:
 https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
 
 Commit #1:
 docker-no-executor-framework example should send ContainerInfo
 
 docker-no-executor-framework stopped working when the protobuf message
 changed during development.  Apparently noone noticed because we don't
 run it as part of our test suite.  I used it to develop a fix for
 proper error reporting, so I fixed it.
 
 Commit #2:
 serialize wait and destroy in containerizer
 
 Containerizers keep track of all the containers they have created.  When
 a container is destroy()'ed the containerizer removes the container from
 its internal hashmap.  This means that wait() must be called before the
 container is destroyed.
 
 When the docker containerizer fails because of a docker pull or
 docker run failure, we end up with a race condition where wait() does
 not properly return the Termination, so we ended up with a default
 failure message of Abnormal Executor Termination which isn't very
 useful.
 
 This commit solves this issue by not destroying the container until
 after wait is called in the failure case.
 
 Fixes MESOS-2656
 
 Commit #3:
 send scheduler containerizer launch failure errors
 
 If a method in the launch sequence returns a failure the error message
 wasn't making it back to the scheduler.
 
 Fixes MESOS-2020
 
 
 Diffs
 -
 
   src/examples/docker_no_executor_framework.cpp 
 d5385d9295d30a6039bab9938f3270006582d3da 
   src/slave/containerizer/containerizer.hpp 
 56c088abb036aa26c323c3142fd27ea29ed51d4f 
   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
   src/slave/containerizer/mesos/containerizer.hpp 
 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
   src/slave/containerizer/mesos/containerizer.cpp 
 f2587280dc0e1d566d2b856a80358c7b3896c603 
   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
 
 Diff: https://reviews.apache.org/r/33249/diff/
 
 
 Testing
 ---
 
 I used examples/docker_no_executor_framework.cpp and make check
 
 
 Thanks,
 
 Jay Buffington
 




Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-06 Thread Timothy Chen


 On April 21, 2015, 11:25 p.m., Jie Yu wrote:
  src/slave/slave.cpp, lines 3065-3078
  https://reviews.apache.org/r/33249/diff/3/?file=938221#file938221line3065
 
  Instead of doing that in your way, can we just try to make sure 
  `containerizer-wait` here will return a failure (or a Termination with 
  some reason) when `containerizer-launch` fails. In that way, the 
  `executorTerminated` will properly send status updates to the slave 
  (TASK_LOST/TASK_FAILED).
  
  Or am I missing something?
 
 Jie Yu wrote:
 OK, I think I got confused by the ticket. There are actually two problems 
 here. The problem I am refering to is the fact that we don't send status 
 update to the scheduler if containerizer launch fails until executor 
 reregistration timeout happens. Since for docker containerizer, someone might 
 use a very large timeout value, ideally, the slave should send a status 
 update to the scheduler right after containerizer launch fails.
 
 After chat with Jay, the problem you guys are refering to is the fact 
 that the scheduler cannot disinguish between the case where the task has 
 failed vs. the case where the configuration of a task is not correct, because 
 in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
 
 Jie Yu wrote:
 To address the first problem, I think the simplest way is to add a 
 containerizer-destroy(..) in executorLaunched when containerizer-launch 
 fails. In that way, it's going to trigger containerizer-wait and thus send 
 status update to the scheduler.
 
 Jie Yu wrote:
 Regarding the second problem, IMO, we should include a reason field in 
 Termination (https://issues.apache.org/jira/browse/MESOS-2035) and let 
 sendExecutorTerminatedStatusUpdate to propagate the termination reason to the 
 scheduler.
 
 Timothy Chen wrote:
 Reason field sounds good, I think what you proposed makes sense, in 
 docker containerizer at least we also need to make sure termination message 
 is set correctly as currently it doesn't contain all the error information 
 that we pass back to the launch future.
 
 Jay Buffington wrote:
 Sorry for the confusion.   There are three problems that are all related. 
  Yes, we need to send statusUpdates as soon as containerizer launch fails and 
 yes, we need to set the reason so the scheduler can distinguish a slave 
 failure from a bad request.  However, my intention with this patch is not to 
 address either of those two problems.
 
 My goal with this patch is to simply send the containerizer launch 
 failure message back to the scheduler.  I am using Aurora with the docker 
 containerizer.  There are a myriad of reasons that dockerd could fail to 
 docker run a container.  Currently, when that fails the user sees a useless 
 and incorrect Abnormal Executor Termination message in the Aurora web ui.  
 With this patch they see the stderr output of docker run.  This way they 
 can report meaningful error messages to the operations team.
 
 I can update this patch to address the other two issues, but the key is 
 that when the containerizer launch fails we have to send a statusUpdate with 
 a message that contains future.failure().  Before this patch we were only 
 logging it.  The scheduler needs to get that error message.
 
 Jie Yu wrote:
 Thanks for clarifying it, Jay! In fact, my proposal above should be able 
 to solve the third problem cleanly. Check out 
 `Slave::sendExecutorTerminatedStatusUpdate`. The containerizer should 
 properly set the message and reason fields in the Termination protobuf 
 (basically why the container gets terminated and what's the error message). 
 The slave will forward the reason and message to the scheduler through status 
 update.
 
 I chatted with BenM about this yesterday, and there are a couple of notes 
 I want to write down here.
 
 1. We probably need multiple levels for TaskStatus::Reason. In other 
 words, we probably need a repeated Reason reasons field in status update 
 message. For instance, for a containerizer launch failure, we probably need 
 two reasons: 1) the top level reason REASON_EXECUTOR_TERMINATED; 2) the 
 second level reason REASON_DOCKER_PULL_FAILURE;
 
 2. We probably want to allow extension to TaskStatus::Reason enum. For 
 example, some framework/executor may want to add customized reasons. We could 
 leverage the protobuf extension support to achieve that 
 (https://developers.google.com/protocol-buffers/docs/proto#extensions).
 
 3. The semantics around Termination is broken currently and we need to 
 clean it up. For instance, the following code is broken,
 ```
 void Slave::sendExecutorTerminatedStatusUpdate(...)
 {
   ...
   if (termination.isReady()  termination.get().killed()) {
 taskState = TASK_FAILED;
 // TODO(dhamon): MESOS-2035: Add 'reason' to 
 containerizer::Termination.
 reason = TaskStatus::REASON_MEMORY_LIMIT;
   }
   

Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-06 Thread Jay Buffington


 On April 21, 2015, 4:25 p.m., Jie Yu wrote:
  src/slave/slave.cpp, lines 3065-3078
  https://reviews.apache.org/r/33249/diff/3/?file=938221#file938221line3065
 
  Instead of doing that in your way, can we just try to make sure 
  `containerizer-wait` here will return a failure (or a Termination with 
  some reason) when `containerizer-launch` fails. In that way, the 
  `executorTerminated` will properly send status updates to the slave 
  (TASK_LOST/TASK_FAILED).
  
  Or am I missing something?
 
 Jie Yu wrote:
 OK, I think I got confused by the ticket. There are actually two problems 
 here. The problem I am refering to is the fact that we don't send status 
 update to the scheduler if containerizer launch fails until executor 
 reregistration timeout happens. Since for docker containerizer, someone might 
 use a very large timeout value, ideally, the slave should send a status 
 update to the scheduler right after containerizer launch fails.
 
 After chat with Jay, the problem you guys are refering to is the fact 
 that the scheduler cannot disinguish between the case where the task has 
 failed vs. the case where the configuration of a task is not correct, because 
 in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
 
 Jie Yu wrote:
 To address the first problem, I think the simplest way is to add a 
 containerizer-destroy(..) in executorLaunched when containerizer-launch 
 fails. In that way, it's going to trigger containerizer-wait and thus send 
 status update to the scheduler.
 
 Jie Yu wrote:
 Regarding the second problem, IMO, we should include a reason field in 
 Termination (https://issues.apache.org/jira/browse/MESOS-2035) and let 
 sendExecutorTerminatedStatusUpdate to propagate the termination reason to the 
 scheduler.
 
 Timothy Chen wrote:
 Reason field sounds good, I think what you proposed makes sense, in 
 docker containerizer at least we also need to make sure termination message 
 is set correctly as currently it doesn't contain all the error information 
 that we pass back to the launch future.
 
 Jay Buffington wrote:
 Sorry for the confusion.   There are three problems that are all related. 
  Yes, we need to send statusUpdates as soon as containerizer launch fails and 
 yes, we need to set the reason so the scheduler can distinguish a slave 
 failure from a bad request.  However, my intention with this patch is not to 
 address either of those two problems.
 
 My goal with this patch is to simply send the containerizer launch 
 failure message back to the scheduler.  I am using Aurora with the docker 
 containerizer.  There are a myriad of reasons that dockerd could fail to 
 docker run a container.  Currently, when that fails the user sees a useless 
 and incorrect Abnormal Executor Termination message in the Aurora web ui.  
 With this patch they see the stderr output of docker run.  This way they 
 can report meaningful error messages to the operations team.
 
 I can update this patch to address the other two issues, but the key is 
 that when the containerizer launch fails we have to send a statusUpdate with 
 a message that contains future.failure().  Before this patch we were only 
 logging it.  The scheduler needs to get that error message.
 
 Jie Yu wrote:
 Thanks for clarifying it, Jay! In fact, my proposal above should be able 
 to solve the third problem cleanly. Check out 
 `Slave::sendExecutorTerminatedStatusUpdate`. The containerizer should 
 properly set the message and reason fields in the Termination protobuf 
 (basically why the container gets terminated and what's the error message). 
 The slave will forward the reason and message to the scheduler through status 
 update.
 
 I chatted with BenM about this yesterday, and there are a couple of notes 
 I want to write down here.
 
 1. We probably need multiple levels for TaskStatus::Reason. In other 
 words, we probably need a repeated Reason reasons field in status update 
 message. For instance, for a containerizer launch failure, we probably need 
 two reasons: 1) the top level reason REASON_EXECUTOR_TERMINATED; 2) the 
 second level reason REASON_DOCKER_PULL_FAILURE;
 
 2. We probably want to allow extension to TaskStatus::Reason enum. For 
 example, some framework/executor may want to add customized reasons. We could 
 leverage the protobuf extension support to achieve that 
 (https://developers.google.com/protocol-buffers/docs/proto#extensions).
 
 3. The semantics around Termination is broken currently and we need to 
 clean it up. For instance, the following code is broken,
 ```
 void Slave::sendExecutorTerminatedStatusUpdate(...)
 {
   ...
   if (termination.isReady()  termination.get().killed()) {
 taskState = TASK_FAILED;
 // TODO(dhamon): MESOS-2035: Add 'reason' to 
 containerizer::Termination.
 reason = TaskStatus::REASON_MEMORY_LIMIT;
   }

Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-05 Thread Timothy Chen


 On April 21, 2015, 11:25 p.m., Jie Yu wrote:
  src/slave/slave.cpp, lines 3065-3078
  https://reviews.apache.org/r/33249/diff/3/?file=938221#file938221line3065
 
  Instead of doing that in your way, can we just try to make sure 
  `containerizer-wait` here will return a failure (or a Termination with 
  some reason) when `containerizer-launch` fails. In that way, the 
  `executorTerminated` will properly send status updates to the slave 
  (TASK_LOST/TASK_FAILED).
  
  Or am I missing something?
 
 Jie Yu wrote:
 OK, I think I got confused by the ticket. There are actually two problems 
 here. The problem I am refering to is the fact that we don't send status 
 update to the scheduler if containerizer launch fails until executor 
 reregistration timeout happens. Since for docker containerizer, someone might 
 use a very large timeout value, ideally, the slave should send a status 
 update to the scheduler right after containerizer launch fails.
 
 After chat with Jay, the problem you guys are refering to is the fact 
 that the scheduler cannot disinguish between the case where the task has 
 failed vs. the case where the configuration of a task is not correct, because 
 in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
 
 Jie Yu wrote:
 To address the first problem, I think the simplest way is to add a 
 containerizer-destroy(..) in executorLaunched when containerizer-launch 
 fails. In that way, it's going to trigger containerizer-wait and thus send 
 status update to the scheduler.
 
 Jie Yu wrote:
 Regarding the second problem, IMO, we should include a reason field in 
 Termination (https://issues.apache.org/jira/browse/MESOS-2035) and let 
 sendExecutorTerminatedStatusUpdate to propagate the termination reason to the 
 scheduler.
 
 Timothy Chen wrote:
 Reason field sounds good, I think what you proposed makes sense, in 
 docker containerizer at least we also need to make sure termination message 
 is set correctly as currently it doesn't contain all the error information 
 that we pass back to the launch future.
 
 Jay Buffington wrote:
 Sorry for the confusion.   There are three problems that are all related. 
  Yes, we need to send statusUpdates as soon as containerizer launch fails and 
 yes, we need to set the reason so the scheduler can distinguish a slave 
 failure from a bad request.  However, my intention with this patch is not to 
 address either of those two problems.
 
 My goal with this patch is to simply send the containerizer launch 
 failure message back to the scheduler.  I am using Aurora with the docker 
 containerizer.  There are a myriad of reasons that dockerd could fail to 
 docker run a container.  Currently, when that fails the user sees a useless 
 and incorrect Abnormal Executor Termination message in the Aurora web ui.  
 With this patch they see the stderr output of docker run.  This way they 
 can report meaningful error messages to the operations team.
 
 I can update this patch to address the other two issues, but the key is 
 that when the containerizer launch fails we have to send a statusUpdate with 
 a message that contains future.failure().  Before this patch we were only 
 logging it.  The scheduler needs to get that error message.
 
 Jie Yu wrote:
 Thanks for clarifying it, Jay! In fact, my proposal above should be able 
 to solve the third problem cleanly. Check out 
 `Slave::sendExecutorTerminatedStatusUpdate`. The containerizer should 
 properly set the message and reason fields in the Termination protobuf 
 (basically why the container gets terminated and what's the error message). 
 The slave will forward the reason and message to the scheduler through status 
 update.
 
 I chatted with BenM about this yesterday, and there are a couple of notes 
 I want to write down here.
 
 1. We probably need multiple levels for TaskStatus::Reason. In other 
 words, we probably need a repeated Reason reasons field in status update 
 message. For instance, for a containerizer launch failure, we probably need 
 two reasons: 1) the top level reason REASON_EXECUTOR_TERMINATED; 2) the 
 second level reason REASON_DOCKER_PULL_FAILURE;
 
 2. We probably want to allow extension to TaskStatus::Reason enum. For 
 example, some framework/executor may want to add customized reasons. We could 
 leverage the protobuf extension support to achieve that 
 (https://developers.google.com/protocol-buffers/docs/proto#extensions).
 
 3. The semantics around Termination is broken currently and we need to 
 clean it up. For instance, the following code is broken,
 ```
 void Slave::sendExecutorTerminatedStatusUpdate(...)
 {
   ...
   if (termination.isReady()  termination.get().killed()) {
 taskState = TASK_FAILED;
 // TODO(dhamon): MESOS-2035: Add 'reason' to 
 containerizer::Termination.
 reason = TaskStatus::REASON_MEMORY_LIMIT;
   }
   

Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-05 Thread Jie Yu


 On April 21, 2015, 11:25 p.m., Jie Yu wrote:
  src/slave/slave.cpp, lines 3065-3078
  https://reviews.apache.org/r/33249/diff/3/?file=938221#file938221line3065
 
  Instead of doing that in your way, can we just try to make sure 
  `containerizer-wait` here will return a failure (or a Termination with 
  some reason) when `containerizer-launch` fails. In that way, the 
  `executorTerminated` will properly send status updates to the slave 
  (TASK_LOST/TASK_FAILED).
  
  Or am I missing something?
 
 Jie Yu wrote:
 OK, I think I got confused by the ticket. There are actually two problems 
 here. The problem I am refering to is the fact that we don't send status 
 update to the scheduler if containerizer launch fails until executor 
 reregistration timeout happens. Since for docker containerizer, someone might 
 use a very large timeout value, ideally, the slave should send a status 
 update to the scheduler right after containerizer launch fails.
 
 After chat with Jay, the problem you guys are refering to is the fact 
 that the scheduler cannot disinguish between the case where the task has 
 failed vs. the case where the configuration of a task is not correct, because 
 in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
 
 Jie Yu wrote:
 To address the first problem, I think the simplest way is to add a 
 containerizer-destroy(..) in executorLaunched when containerizer-launch 
 fails. In that way, it's going to trigger containerizer-wait and thus send 
 status update to the scheduler.
 
 Jie Yu wrote:
 Regarding the second problem, IMO, we should include a reason field in 
 Termination (https://issues.apache.org/jira/browse/MESOS-2035) and let 
 sendExecutorTerminatedStatusUpdate to propagate the termination reason to the 
 scheduler.
 
 Timothy Chen wrote:
 Reason field sounds good, I think what you proposed makes sense, in 
 docker containerizer at least we also need to make sure termination message 
 is set correctly as currently it doesn't contain all the error information 
 that we pass back to the launch future.
 
 Jay Buffington wrote:
 Sorry for the confusion.   There are three problems that are all related. 
  Yes, we need to send statusUpdates as soon as containerizer launch fails and 
 yes, we need to set the reason so the scheduler can distinguish a slave 
 failure from a bad request.  However, my intention with this patch is not to 
 address either of those two problems.
 
 My goal with this patch is to simply send the containerizer launch 
 failure message back to the scheduler.  I am using Aurora with the docker 
 containerizer.  There are a myriad of reasons that dockerd could fail to 
 docker run a container.  Currently, when that fails the user sees a useless 
 and incorrect Abnormal Executor Termination message in the Aurora web ui.  
 With this patch they see the stderr output of docker run.  This way they 
 can report meaningful error messages to the operations team.
 
 I can update this patch to address the other two issues, but the key is 
 that when the containerizer launch fails we have to send a statusUpdate with 
 a message that contains future.failure().  Before this patch we were only 
 logging it.  The scheduler needs to get that error message.
 
 Jie Yu wrote:
 Thanks for clarifying it, Jay! In fact, my proposal above should be able 
 to solve the third problem cleanly. Check out 
 `Slave::sendExecutorTerminatedStatusUpdate`. The containerizer should 
 properly set the message and reason fields in the Termination protobuf 
 (basically why the container gets terminated and what's the error message). 
 The slave will forward the reason and message to the scheduler through status 
 update.
 
 I chatted with BenM about this yesterday, and there are a couple of notes 
 I want to write down here.
 
 1. We probably need multiple levels for TaskStatus::Reason. In other 
 words, we probably need a repeated Reason reasons field in status update 
 message. For instance, for a containerizer launch failure, we probably need 
 two reasons: 1) the top level reason REASON_EXECUTOR_TERMINATED; 2) the 
 second level reason REASON_DOCKER_PULL_FAILURE;
 
 2. We probably want to allow extension to TaskStatus::Reason enum. For 
 example, some framework/executor may want to add customized reasons. We could 
 leverage the protobuf extension support to achieve that 
 (https://developers.google.com/protocol-buffers/docs/proto#extensions).
 
 3. The semantics around Termination is broken currently and we need to 
 clean it up. For instance, the following code is broken,
 ```
 void Slave::sendExecutorTerminatedStatusUpdate(...)
 {
   ...
   if (termination.isReady()  termination.get().killed()) {
 taskState = TASK_FAILED;
 // TODO(dhamon): MESOS-2035: Add 'reason' to 
 containerizer::Termination.
 reason = TaskStatus::REASON_MEMORY_LIMIT;
   }
   

Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-05 Thread Jay Buffington


 On April 21, 2015, 4:25 p.m., Jie Yu wrote:
  src/slave/slave.cpp, lines 3065-3078
  https://reviews.apache.org/r/33249/diff/3/?file=938221#file938221line3065
 
  Instead of doing that in your way, can we just try to make sure 
  `containerizer-wait` here will return a failure (or a Termination with 
  some reason) when `containerizer-launch` fails. In that way, the 
  `executorTerminated` will properly send status updates to the slave 
  (TASK_LOST/TASK_FAILED).
  
  Or am I missing something?
 
 Jie Yu wrote:
 OK, I think I got confused by the ticket. There are actually two problems 
 here. The problem I am refering to is the fact that we don't send status 
 update to the scheduler if containerizer launch fails until executor 
 reregistration timeout happens. Since for docker containerizer, someone might 
 use a very large timeout value, ideally, the slave should send a status 
 update to the scheduler right after containerizer launch fails.
 
 After chat with Jay, the problem you guys are refering to is the fact 
 that the scheduler cannot disinguish between the case where the task has 
 failed vs. the case where the configuration of a task is not correct, because 
 in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
 
 Jie Yu wrote:
 To address the first problem, I think the simplest way is to add a 
 containerizer-destroy(..) in executorLaunched when containerizer-launch 
 fails. In that way, it's going to trigger containerizer-wait and thus send 
 status update to the scheduler.
 
 Jie Yu wrote:
 Regarding the second problem, IMO, we should include a reason field in 
 Termination (https://issues.apache.org/jira/browse/MESOS-2035) and let 
 sendExecutorTerminatedStatusUpdate to propagate the termination reason to the 
 scheduler.
 
 Timothy Chen wrote:
 Reason field sounds good, I think what you proposed makes sense, in 
 docker containerizer at least we also need to make sure termination message 
 is set correctly as currently it doesn't contain all the error information 
 that we pass back to the launch future.
 
 Jay Buffington wrote:
 Sorry for the confusion.   There are three problems that are all related. 
  Yes, we need to send statusUpdates as soon as containerizer launch fails and 
 yes, we need to set the reason so the scheduler can distinguish a slave 
 failure from a bad request.  However, my intention with this patch is not to 
 address either of those two problems.
 
 My goal with this patch is to simply send the containerizer launch 
 failure message back to the scheduler.  I am using Aurora with the docker 
 containerizer.  There are a myriad of reasons that dockerd could fail to 
 docker run a container.  Currently, when that fails the user sees a useless 
 and incorrect Abnormal Executor Termination message in the Aurora web ui.  
 With this patch they see the stderr output of docker run.  This way they 
 can report meaningful error messages to the operations team.
 
 I can update this patch to address the other two issues, but the key is 
 that when the containerizer launch fails we have to send a statusUpdate with 
 a message that contains future.failure().  Before this patch we were only 
 logging it.  The scheduler needs to get that error message.
 
 Jie Yu wrote:
 Thanks for clarifying it, Jay! In fact, my proposal above should be able 
 to solve the third problem cleanly. Check out 
 `Slave::sendExecutorTerminatedStatusUpdate`. The containerizer should 
 properly set the message and reason fields in the Termination protobuf 
 (basically why the container gets terminated and what's the error message). 
 The slave will forward the reason and message to the scheduler through status 
 update.
 
 I chatted with BenM about this yesterday, and there are a couple of notes 
 I want to write down here.
 
 1. We probably need multiple levels for TaskStatus::Reason. In other 
 words, we probably need a repeated Reason reasons field in status update 
 message. For instance, for a containerizer launch failure, we probably need 
 two reasons: 1) the top level reason REASON_EXECUTOR_TERMINATED; 2) the 
 second level reason REASON_DOCKER_PULL_FAILURE;
 
 2. We probably want to allow extension to TaskStatus::Reason enum. For 
 example, some framework/executor may want to add customized reasons. We could 
 leverage the protobuf extension support to achieve that 
 (https://developers.google.com/protocol-buffers/docs/proto#extensions).
 
 3. The semantics around Termination is broken currently and we need to 
 clean it up. For instance, the following code is broken,
 ```
 void Slave::sendExecutorTerminatedStatusUpdate(...)
 {
   ...
   if (termination.isReady()  termination.get().killed()) {
 taskState = TASK_FAILED;
 // TODO(dhamon): MESOS-2035: Add 'reason' to 
 containerizer::Termination.
 reason = TaskStatus::REASON_MEMORY_LIMIT;
   }

Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-04 Thread Jay Buffington


 On April 21, 2015, 4:25 p.m., Jie Yu wrote:
  src/slave/slave.cpp, lines 3065-3078
  https://reviews.apache.org/r/33249/diff/3/?file=938221#file938221line3065
 
  Instead of doing that in your way, can we just try to make sure 
  `containerizer-wait` here will return a failure (or a Termination with 
  some reason) when `containerizer-launch` fails. In that way, the 
  `executorTerminated` will properly send status updates to the slave 
  (TASK_LOST/TASK_FAILED).
  
  Or am I missing something?
 
 Jie Yu wrote:
 OK, I think I got confused by the ticket. There are actually two problems 
 here. The problem I am refering to is the fact that we don't send status 
 update to the scheduler if containerizer launch fails until executor 
 reregistration timeout happens. Since for docker containerizer, someone might 
 use a very large timeout value, ideally, the slave should send a status 
 update to the scheduler right after containerizer launch fails.
 
 After chat with Jay, the problem you guys are refering to is the fact 
 that the scheduler cannot disinguish between the case where the task has 
 failed vs. the case where the configuration of a task is not correct, because 
 in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
 
 Jie Yu wrote:
 To address the first problem, I think the simplest way is to add a 
 containerizer-destroy(..) in executorLaunched when containerizer-launch 
 fails. In that way, it's going to trigger containerizer-wait and thus send 
 status update to the scheduler.
 
 Jie Yu wrote:
 Regarding the second problem, IMO, we should include a reason field in 
 Termination (https://issues.apache.org/jira/browse/MESOS-2035) and let 
 sendExecutorTerminatedStatusUpdate to propagate the termination reason to the 
 scheduler.
 
 Timothy Chen wrote:
 Reason field sounds good, I think what you proposed makes sense, in 
 docker containerizer at least we also need to make sure termination message 
 is set correctly as currently it doesn't contain all the error information 
 that we pass back to the launch future.
 
 Jay Buffington wrote:
 Sorry for the confusion.   There are three problems that are all related. 
  Yes, we need to send statusUpdates as soon as containerizer launch fails and 
 yes, we need to set the reason so the scheduler can distinguish a slave 
 failure from a bad request.  However, my intention with this patch is not to 
 address either of those two problems.
 
 My goal with this patch is to simply send the containerizer launch 
 failure message back to the scheduler.  I am using Aurora with the docker 
 containerizer.  There are a myriad of reasons that dockerd could fail to 
 docker run a container.  Currently, when that fails the user sees a useless 
 and incorrect Abnormal Executor Termination message in the Aurora web ui.  
 With this patch they see the stderr output of docker run.  This way they 
 can report meaningful error messages to the operations team.
 
 I can update this patch to address the other two issues, but the key is 
 that when the containerizer launch fails we have to send a statusUpdate with 
 a message that contains future.failure().  Before this patch we were only 
 logging it.  The scheduler needs to get that error message.
 
 Jie Yu wrote:
 Thanks for clarifying it, Jay! In fact, my proposal above should be able 
 to solve the third problem cleanly. Check out 
 `Slave::sendExecutorTerminatedStatusUpdate`. The containerizer should 
 properly set the message and reason fields in the Termination protobuf 
 (basically why the container gets terminated and what's the error message). 
 The slave will forward the reason and message to the scheduler through status 
 update.
 
 I chatted with BenM about this yesterday, and there are a couple of notes 
 I want to write down here.
 
 1. We probably need multiple levels for TaskStatus::Reason. In other 
 words, we probably need a repeated Reason reasons field in status update 
 message. For instance, for a containerizer launch failure, we probably need 
 two reasons: 1) the top level reason REASON_EXECUTOR_TERMINATED; 2) the 
 second level reason REASON_DOCKER_PULL_FAILURE;
 
 2. We probably want to allow extension to TaskStatus::Reason enum. For 
 example, some framework/executor may want to add customized reasons. We could 
 leverage the protobuf extension support to achieve that 
 (https://developers.google.com/protocol-buffers/docs/proto#extensions).
 
 3. The semantics around Termination is broken currently and we need to 
 clean it up. For instance, the following code is broken,
 ```
 void Slave::sendExecutorTerminatedStatusUpdate(...)
 {
   ...
   if (termination.isReady()  termination.get().killed()) {
 taskState = TASK_FAILED;
 // TODO(dhamon): MESOS-2035: Add 'reason' to 
 containerizer::Termination.
 reason = TaskStatus::REASON_MEMORY_LIMIT;
   }

Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-01 Thread Jie Yu


 On April 21, 2015, 11:25 p.m., Jie Yu wrote:
  src/slave/slave.cpp, lines 3065-3078
  https://reviews.apache.org/r/33249/diff/3/?file=938221#file938221line3065
 
  Instead of doing that in your way, can we just try to make sure 
  `containerizer-wait` here will return a failure (or a Termination with 
  some reason) when `containerizer-launch` fails. In that way, the 
  `executorTerminated` will properly send status updates to the slave 
  (TASK_LOST/TASK_FAILED).
  
  Or am I missing something?
 
 Jie Yu wrote:
 OK, I think I got confused by the ticket. There are actually two problems 
 here. The problem I am refering to is the fact that we don't send status 
 update to the scheduler if containerizer launch fails until executor 
 reregistration timeout happens. Since for docker containerizer, someone might 
 use a very large timeout value, ideally, the slave should send a status 
 update to the scheduler right after containerizer launch fails.
 
 After chat with Jay, the problem you guys are refering to is the fact 
 that the scheduler cannot disinguish between the case where the task has 
 failed vs. the case where the configuration of a task is not correct, because 
 in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
 
 Jie Yu wrote:
 To address the first problem, I think the simplest way is to add a 
 containerizer-destroy(..) in executorLaunched when containerizer-launch 
 fails. In that way, it's going to trigger containerizer-wait and thus send 
 status update to the scheduler.
 
 Jie Yu wrote:
 Regarding the second problem, IMO, we should include a reason field in 
 Termination (https://issues.apache.org/jira/browse/MESOS-2035) and let 
 sendExecutorTerminatedStatusUpdate to propagate the termination reason to the 
 scheduler.
 
 Timothy Chen wrote:
 Reason field sounds good, I think what you proposed makes sense, in 
 docker containerizer at least we also need to make sure termination message 
 is set correctly as currently it doesn't contain all the error information 
 that we pass back to the launch future.
 
 Jay Buffington wrote:
 Sorry for the confusion.   There are three problems that are all related. 
  Yes, we need to send statusUpdates as soon as containerizer launch fails and 
 yes, we need to set the reason so the scheduler can distinguish a slave 
 failure from a bad request.  However, my intention with this patch is not to 
 address either of those two problems.
 
 My goal with this patch is to simply send the containerizer launch 
 failure message back to the scheduler.  I am using Aurora with the docker 
 containerizer.  There are a myriad of reasons that dockerd could fail to 
 docker run a container.  Currently, when that fails the user sees a useless 
 and incorrect Abnormal Executor Termination message in the Aurora web ui.  
 With this patch they see the stderr output of docker run.  This way they 
 can report meaningful error messages to the operations team.
 
 I can update this patch to address the other two issues, but the key is 
 that when the containerizer launch fails we have to send a statusUpdate with 
 a message that contains future.failure().  Before this patch we were only 
 logging it.  The scheduler needs to get that error message.
 
 Jie Yu wrote:
 Thanks for clarifying it, Jay! In fact, my proposal above should be able 
 to solve the third problem cleanly. Check out 
 `Slave::sendExecutorTerminatedStatusUpdate`. The containerizer should 
 properly set the message and reason fields in the Termination protobuf 
 (basically why the container gets terminated and what's the error message). 
 The slave will forward the reason and message to the scheduler through status 
 update.
 
 I chatted with BenM about this yesterday, and there are a couple of notes 
 I want to write down here.
 
 1. We probably need multiple levels for TaskStatus::Reason. In other 
 words, we probably need a repeated Reason reasons field in status update 
 message. For instance, for a containerizer launch failure, we probably need 
 two reasons: 1) the top level reason REASON_EXECUTOR_TERMINATED; 2) the 
 second level reason REASON_DOCKER_PULL_FAILURE;
 
 2. We probably want to allow extension to TaskStatus::Reason enum. For 
 example, some framework/executor may want to add customized reasons. We could 
 leverage the protobuf extension support to achieve that 
 (https://developers.google.com/protocol-buffers/docs/proto#extensions).
 
 3. The semantics around Termination is broken currently and we need to 
 clean it up. For instance, the following code is broken,
 ```
 void Slave::sendExecutorTerminatedStatusUpdate(...)
 {
   ...
   if (termination.isReady()  termination.get().killed()) {
 taskState = TASK_FAILED;
 // TODO(dhamon): MESOS-2035: Add 'reason' to 
 containerizer::Termination.
 reason = TaskStatus::REASON_MEMORY_LIMIT;
   }
   

Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-04-30 Thread Jay Buffington


 On April 21, 2015, 4:25 p.m., Jie Yu wrote:
  src/slave/slave.cpp, lines 3065-3078
  https://reviews.apache.org/r/33249/diff/3/?file=938221#file938221line3065
 
  Instead of doing that in your way, can we just try to make sure 
  `containerizer-wait` here will return a failure (or a Termination with 
  some reason) when `containerizer-launch` fails. In that way, the 
  `executorTerminated` will properly send status updates to the slave 
  (TASK_LOST/TASK_FAILED).
  
  Or am I missing something?
 
 Jie Yu wrote:
 OK, I think I got confused by the ticket. There are actually two problems 
 here. The problem I am refering to is the fact that we don't send status 
 update to the scheduler if containerizer launch fails until executor 
 reregistration timeout happens. Since for docker containerizer, someone might 
 use a very large timeout value, ideally, the slave should send a status 
 update to the scheduler right after containerizer launch fails.
 
 After chat with Jay, the problem you guys are refering to is the fact 
 that the scheduler cannot disinguish between the case where the task has 
 failed vs. the case where the configuration of a task is not correct, because 
 in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
 
 Jie Yu wrote:
 To address the first problem, I think the simplest way is to add a 
 containerizer-destroy(..) in executorLaunched when containerizer-launch 
 fails. In that way, it's going to trigger containerizer-wait and thus send 
 status update to the scheduler.
 
 Jie Yu wrote:
 Regarding the second problem, IMO, we should include a reason field in 
 Termination (https://issues.apache.org/jira/browse/MESOS-2035) and let 
 sendExecutorTerminatedStatusUpdate to propagate the termination reason to the 
 scheduler.
 
 Timothy Chen wrote:
 Reason field sounds good, I think what you proposed makes sense, in 
 docker containerizer at least we also need to make sure termination message 
 is set correctly as currently it doesn't contain all the error information 
 that we pass back to the launch future.
 
 Jay Buffington wrote:
 Sorry for the confusion.   There are three problems that are all related. 
  Yes, we need to send statusUpdates as soon as containerizer launch fails and 
 yes, we need to set the reason so the scheduler can distinguish a slave 
 failure from a bad request.  However, my intention with this patch is not to 
 address either of those two problems.
 
 My goal with this patch is to simply send the containerizer launch 
 failure message back to the scheduler.  I am using Aurora with the docker 
 containerizer.  There are a myriad of reasons that dockerd could fail to 
 docker run a container.  Currently, when that fails the user sees a useless 
 and incorrect Abnormal Executor Termination message in the Aurora web ui.  
 With this patch they see the stderr output of docker run.  This way they 
 can report meaningful error messages to the operations team.
 
 I can update this patch to address the other two issues, but the key is 
 that when the containerizer launch fails we have to send a statusUpdate with 
 a message that contains future.failure().  Before this patch we were only 
 logging it.  The scheduler needs to get that error message.
 
 Jie Yu wrote:
 Thanks for clarifying it, Jay! In fact, my proposal above should be able 
 to solve the third problem cleanly. Check out 
 `Slave::sendExecutorTerminatedStatusUpdate`. The containerizer should 
 properly set the message and reason fields in the Termination protobuf 
 (basically why the container gets terminated and what's the error message). 
 The slave will forward the reason and message to the scheduler through status 
 update.
 
 I chatted with BenM about this yesterday, and there are a couple of notes 
 I want to write down here.
 
 1. We probably need multiple levels for TaskStatus::Reason. In other 
 words, we probably need a repeated Reason reasons field in status update 
 message. For instance, for a containerizer launch failure, we probably need 
 two reasons: 1) the top level reason REASON_EXECUTOR_TERMINATED; 2) the 
 second level reason REASON_DOCKER_PULL_FAILURE;
 
 2. We probably want to allow extension to TaskStatus::Reason enum. For 
 example, some framework/executor may want to add customized reasons. We could 
 leverage the protobuf extension support to achieve that 
 (https://developers.google.com/protocol-buffers/docs/proto#extensions).
 
 3. The semantics around Termination is broken currently and we need to 
 clean it up. For instance, the following code is broken,
 ```
 void Slave::sendExecutorTerminatedStatusUpdate(...)
 {
   ...
   if (termination.isReady()  termination.get().killed()) {
 taskState = TASK_FAILED;
 // TODO(dhamon): MESOS-2035: Add 'reason' to 
 containerizer::Termination.
 reason = TaskStatus::REASON_MEMORY_LIMIT;
   }

Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-04-22 Thread Jay Buffington


 On April 21, 2015, 4:25 p.m., Jie Yu wrote:
  src/slave/slave.cpp, lines 3065-3078
  https://reviews.apache.org/r/33249/diff/3/?file=938221#file938221line3065
 
  Instead of doing that in your way, can we just try to make sure 
  `containerizer-wait` here will return a failure (or a Termination with 
  some reason) when `containerizer-launch` fails. In that way, the 
  `executorTerminated` will properly send status updates to the slave 
  (TASK_LOST/TASK_FAILED).
  
  Or am I missing something?
 
 Jie Yu wrote:
 OK, I think I got confused by the ticket. There are actually two problems 
 here. The problem I am refering to is the fact that we don't send status 
 update to the scheduler if containerizer launch fails until executor 
 reregistration timeout happens. Since for docker containerizer, someone might 
 use a very large timeout value, ideally, the slave should send a status 
 update to the scheduler right after containerizer launch fails.
 
 After chat with Jay, the problem you guys are refering to is the fact 
 that the scheduler cannot disinguish between the case where the task has 
 failed vs. the case where the configuration of a task is not correct, because 
 in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
 
 Jie Yu wrote:
 To address the first problem, I think the simplest way is to add a 
 containerizer-destroy(..) in executorLaunched when containerizer-launch 
 fails. In that way, it's going to trigger containerizer-wait and thus send 
 status update to the scheduler.
 
 Jie Yu wrote:
 Regarding the second problem, IMO, we should include a reason field in 
 Termination (https://issues.apache.org/jira/browse/MESOS-2035) and let 
 sendExecutorTerminatedStatusUpdate to propagate the termination reason to the 
 scheduler.
 
 Timothy Chen wrote:
 Reason field sounds good, I think what you proposed makes sense, in 
 docker containerizer at least we also need to make sure termination message 
 is set correctly as currently it doesn't contain all the error information 
 that we pass back to the launch future.

Sorry for the confusion.   There are three problems that are all related.  Yes, 
we need to send statusUpdates as soon as containerizer launch fails and yes, we 
need to set the reason so the scheduler can distinguish a slave failure from a 
bad request.  However, my intention with this patch is not to address either of 
those two problems.

My goal with this patch is to simply send the containerizer launch failure 
message back to the scheduler.  I am using Aurora with the docker 
containerizer.  There are a myriad of reasons that dockerd could fail to 
docker run a container.  Currently, when that fails the user sees a useless 
and incorrect Abnormal Executor Termination message in the Aurora web ui.  
With this patch they see the stderr output of docker run.  This way they can 
report meaningful error messages to the operations team.

I can update this patch to address the other two issues, but the key is that 
when the containerizer launch fails we have to send a statusUpdate with a 
message that contains future.failure().  Before this patch we were only logging 
it.  The scheduler needs to get that error message.


- Jay


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review81090
---


On April 21, 2015, 10:14 a.m., Jay Buffington wrote:
 
 ---
 This is an automatically generated e-mail. To reply, visit:
 https://reviews.apache.org/r/33249/
 ---
 
 (Updated April 21, 2015, 10:14 a.m.)
 
 
 Review request for mesos, Ben Mahler, Timothy Chen, and Vinod Kone.
 
 
 Bugs: MESOS-2020
 https://issues.apache.org/jira/browse/MESOS-2020
 
 
 Repository: mesos
 
 
 Description
 ---
 
 When mesos is unable to launch the containerizer the scheduler should
 get a TASK_FAILED with a status message that includes the error the
 containerizer encounted when trying to launch.
 
 Introduces a new TaskStatus: REASON_CONTAINERIZER_LAUNCH_FAILED
 
 Fixes MESOS-2020
 
 
 Diffs
 -
 
   include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 
   src/slave/slave.cpp 8ec80ed26f338690e0a1e712065750ab77a724cd 
   src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 
 
 Diff: https://reviews.apache.org/r/33249/diff/
 
 
 Testing
 ---
 
 I added test case to slave_test.cpp.  I also tried this with Aurora, supplied 
 a bogus docker image url and saw the docker pull failure stderr message in 
 Aurora's web UI.
 
 
 Thanks,
 
 Jay Buffington
 




Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-04-22 Thread Jie Yu


 On April 21, 2015, 11:25 p.m., Jie Yu wrote:
  src/slave/slave.cpp, lines 3065-3078
  https://reviews.apache.org/r/33249/diff/3/?file=938221#file938221line3065
 
  Instead of doing that in your way, can we just try to make sure 
  `containerizer-wait` here will return a failure (or a Termination with 
  some reason) when `containerizer-launch` fails. In that way, the 
  `executorTerminated` will properly send status updates to the slave 
  (TASK_LOST/TASK_FAILED).
  
  Or am I missing something?
 
 Jie Yu wrote:
 OK, I think I got confused by the ticket. There are actually two problems 
 here. The problem I am refering to is the fact that we don't send status 
 update to the scheduler if containerizer launch fails until executor 
 reregistration timeout happens. Since for docker containerizer, someone might 
 use a very large timeout value, ideally, the slave should send a status 
 update to the scheduler right after containerizer launch fails.
 
 After chat with Jay, the problem you guys are refering to is the fact 
 that the scheduler cannot disinguish between the case where the task has 
 failed vs. the case where the configuration of a task is not correct, because 
 in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
 
 Jie Yu wrote:
 To address the first problem, I think the simplest way is to add a 
 containerizer-destroy(..) in executorLaunched when containerizer-launch 
 fails. In that way, it's going to trigger containerizer-wait and thus send 
 status update to the scheduler.
 
 Jie Yu wrote:
 Regarding the second problem, IMO, we should include a reason field in 
 Termination (https://issues.apache.org/jira/browse/MESOS-2035) and let 
 sendExecutorTerminatedStatusUpdate to propagate the termination reason to the 
 scheduler.
 
 Timothy Chen wrote:
 Reason field sounds good, I think what you proposed makes sense, in 
 docker containerizer at least we also need to make sure termination message 
 is set correctly as currently it doesn't contain all the error information 
 that we pass back to the launch future.
 
 Jay Buffington wrote:
 Sorry for the confusion.   There are three problems that are all related. 
  Yes, we need to send statusUpdates as soon as containerizer launch fails and 
 yes, we need to set the reason so the scheduler can distinguish a slave 
 failure from a bad request.  However, my intention with this patch is not to 
 address either of those two problems.
 
 My goal with this patch is to simply send the containerizer launch 
 failure message back to the scheduler.  I am using Aurora with the docker 
 containerizer.  There are a myriad of reasons that dockerd could fail to 
 docker run a container.  Currently, when that fails the user sees a useless 
 and incorrect Abnormal Executor Termination message in the Aurora web ui.  
 With this patch they see the stderr output of docker run.  This way they 
 can report meaningful error messages to the operations team.
 
 I can update this patch to address the other two issues, but the key is 
 that when the containerizer launch fails we have to send a statusUpdate with 
 a message that contains future.failure().  Before this patch we were only 
 logging it.  The scheduler needs to get that error message.

Thanks for clarifying it, Jay! In fact, my proposal above should be able to 
solve the third problem cleanly. Check out 
`Slave::sendExecutorTerminatedStatusUpdate`. The containerizer should properly 
set the message and reason fields in the Termination protobuf (basically why 
the container gets terminated and what's the error message). The slave will 
forward the reason and message to the scheduler through status update.

I chatted with BenM about this yesterday, and there are a couple of notes I 
want to write down here.

1. We probably need multiple levels for TaskStatus::Reason. In other words, we 
probably need a repeated Reason reasons field in status update message. For 
instance, for a containerizer launch failure, we probably need two reasons: 1) 
the top level reason REASON_EXECUTOR_TERMINATED; 2) the second level reason 
REASON_DOCKER_PULL_FAILURE;

2. We probably want to allow extension to TaskStatus::Reason enum. For example, 
some framework/executor may want to add customized reasons. We could leverage 
the protobuf extension support to achieve that 
(https://developers.google.com/protocol-buffers/docs/proto#extensions).

3. The semantics around Termination is broken currently and we need to clean it 
up. For instance, the following code is broken,
```
void Slave::sendExecutorTerminatedStatusUpdate(...)
{
  ...
  if (termination.isReady()  termination.get().killed()) {
taskState = TASK_FAILED;
// TODO(dhamon): MESOS-2035: Add 'reason' to containerizer::Termination.
reason = TaskStatus::REASON_MEMORY_LIMIT;
  }
}
```
because we now have disk limit as well.

Another issue about Termination is that containerizer-wait sometimes returns a 

Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-04-22 Thread Jay Buffington


 On April 21, 2015, 4:25 p.m., Jie Yu wrote:
  src/slave/slave.cpp, lines 3065-3078
  https://reviews.apache.org/r/33249/diff/3/?file=938221#file938221line3065
 
  Instead of doing that in your way, can we just try to make sure 
  `containerizer-wait` here will return a failure (or a Termination with 
  some reason) when `containerizer-launch` fails. In that way, the 
  `executorTerminated` will properly send status updates to the slave 
  (TASK_LOST/TASK_FAILED).
  
  Or am I missing something?
 
 Jie Yu wrote:
 OK, I think I got confused by the ticket. There are actually two problems 
 here. The problem I am refering to is the fact that we don't send status 
 update to the scheduler if containerizer launch fails until executor 
 reregistration timeout happens. Since for docker containerizer, someone might 
 use a very large timeout value, ideally, the slave should send a status 
 update to the scheduler right after containerizer launch fails.
 
 After chat with Jay, the problem you guys are refering to is the fact 
 that the scheduler cannot disinguish between the case where the task has 
 failed vs. the case where the configuration of a task is not correct, because 
 in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
 
 Jie Yu wrote:
 To address the first problem, I think the simplest way is to add a 
 containerizer-destroy(..) in executorLaunched when containerizer-launch 
 fails. In that way, it's going to trigger containerizer-wait and thus send 
 status update to the scheduler.
 
 Jie Yu wrote:
 Regarding the second problem, IMO, we should include a reason field in 
 Termination (https://issues.apache.org/jira/browse/MESOS-2035) and let 
 sendExecutorTerminatedStatusUpdate to propagate the termination reason to the 
 scheduler.
 
 Timothy Chen wrote:
 Reason field sounds good, I think what you proposed makes sense, in 
 docker containerizer at least we also need to make sure termination message 
 is set correctly as currently it doesn't contain all the error information 
 that we pass back to the launch future.
 
 Jay Buffington wrote:
 Sorry for the confusion.   There are three problems that are all related. 
  Yes, we need to send statusUpdates as soon as containerizer launch fails and 
 yes, we need to set the reason so the scheduler can distinguish a slave 
 failure from a bad request.  However, my intention with this patch is not to 
 address either of those two problems.
 
 My goal with this patch is to simply send the containerizer launch 
 failure message back to the scheduler.  I am using Aurora with the docker 
 containerizer.  There are a myriad of reasons that dockerd could fail to 
 docker run a container.  Currently, when that fails the user sees a useless 
 and incorrect Abnormal Executor Termination message in the Aurora web ui.  
 With this patch they see the stderr output of docker run.  This way they 
 can report meaningful error messages to the operations team.
 
 I can update this patch to address the other two issues, but the key is 
 that when the containerizer launch fails we have to send a statusUpdate with 
 a message that contains future.failure().  Before this patch we were only 
 logging it.  The scheduler needs to get that error message.
 
 Jie Yu wrote:
 Thanks for clarifying it, Jay! In fact, my proposal above should be able 
 to solve the third problem cleanly. Check out 
 `Slave::sendExecutorTerminatedStatusUpdate`. The containerizer should 
 properly set the message and reason fields in the Termination protobuf 
 (basically why the container gets terminated and what's the error message). 
 The slave will forward the reason and message to the scheduler through status 
 update.
 
 I chatted with BenM about this yesterday, and there are a couple of notes 
 I want to write down here.
 
 1. We probably need multiple levels for TaskStatus::Reason. In other 
 words, we probably need a repeated Reason reasons field in status update 
 message. For instance, for a containerizer launch failure, we probably need 
 two reasons: 1) the top level reason REASON_EXECUTOR_TERMINATED; 2) the 
 second level reason REASON_DOCKER_PULL_FAILURE;
 
 2. We probably want to allow extension to TaskStatus::Reason enum. For 
 example, some framework/executor may want to add customized reasons. We could 
 leverage the protobuf extension support to achieve that 
 (https://developers.google.com/protocol-buffers/docs/proto#extensions).
 
 3. The semantics around Termination is broken currently and we need to 
 clean it up. For instance, the following code is broken,
 ```
 void Slave::sendExecutorTerminatedStatusUpdate(...)
 {
   ...
   if (termination.isReady()  termination.get().killed()) {
 taskState = TASK_FAILED;
 // TODO(dhamon): MESOS-2035: Add 'reason' to 
 containerizer::Termination.
 reason = TaskStatus::REASON_MEMORY_LIMIT;
   }

Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-04-22 Thread Jie Yu


 On April 21, 2015, 11:25 p.m., Jie Yu wrote:
  src/slave/slave.cpp, lines 3065-3078
  https://reviews.apache.org/r/33249/diff/3/?file=938221#file938221line3065
 
  Instead of doing that in your way, can we just try to make sure 
  `containerizer-wait` here will return a failure (or a Termination with 
  some reason) when `containerizer-launch` fails. In that way, the 
  `executorTerminated` will properly send status updates to the slave 
  (TASK_LOST/TASK_FAILED).
  
  Or am I missing something?
 
 Jie Yu wrote:
 OK, I think I got confused by the ticket. There are actually two problems 
 here. The problem I am refering to is the fact that we don't send status 
 update to the scheduler if containerizer launch fails until executor 
 reregistration timeout happens. Since for docker containerizer, someone might 
 use a very large timeout value, ideally, the slave should send a status 
 update to the scheduler right after containerizer launch fails.
 
 After chat with Jay, the problem you guys are refering to is the fact 
 that the scheduler cannot disinguish between the case where the task has 
 failed vs. the case where the configuration of a task is not correct, because 
 in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
 
 Jie Yu wrote:
 To address the first problem, I think the simplest way is to add a 
 containerizer-destroy(..) in executorLaunched when containerizer-launch 
 fails. In that way, it's going to trigger containerizer-wait and thus send 
 status update to the scheduler.
 
 Jie Yu wrote:
 Regarding the second problem, IMO, we should include a reason field in 
 Termination (https://issues.apache.org/jira/browse/MESOS-2035) and let 
 sendExecutorTerminatedStatusUpdate to propagate the termination reason to the 
 scheduler.
 
 Timothy Chen wrote:
 Reason field sounds good, I think what you proposed makes sense, in 
 docker containerizer at least we also need to make sure termination message 
 is set correctly as currently it doesn't contain all the error information 
 that we pass back to the launch future.
 
 Jay Buffington wrote:
 Sorry for the confusion.   There are three problems that are all related. 
  Yes, we need to send statusUpdates as soon as containerizer launch fails and 
 yes, we need to set the reason so the scheduler can distinguish a slave 
 failure from a bad request.  However, my intention with this patch is not to 
 address either of those two problems.
 
 My goal with this patch is to simply send the containerizer launch 
 failure message back to the scheduler.  I am using Aurora with the docker 
 containerizer.  There are a myriad of reasons that dockerd could fail to 
 docker run a container.  Currently, when that fails the user sees a useless 
 and incorrect Abnormal Executor Termination message in the Aurora web ui.  
 With this patch they see the stderr output of docker run.  This way they 
 can report meaningful error messages to the operations team.
 
 I can update this patch to address the other two issues, but the key is 
 that when the containerizer launch fails we have to send a statusUpdate with 
 a message that contains future.failure().  Before this patch we were only 
 logging it.  The scheduler needs to get that error message.
 
 Jie Yu wrote:
 Thanks for clarifying it, Jay! In fact, my proposal above should be able 
 to solve the third problem cleanly. Check out 
 `Slave::sendExecutorTerminatedStatusUpdate`. The containerizer should 
 properly set the message and reason fields in the Termination protobuf 
 (basically why the container gets terminated and what's the error message). 
 The slave will forward the reason and message to the scheduler through status 
 update.
 
 I chatted with BenM about this yesterday, and there are a couple of notes 
 I want to write down here.
 
 1. We probably need multiple levels for TaskStatus::Reason. In other 
 words, we probably need a repeated Reason reasons field in status update 
 message. For instance, for a containerizer launch failure, we probably need 
 two reasons: 1) the top level reason REASON_EXECUTOR_TERMINATED; 2) the 
 second level reason REASON_DOCKER_PULL_FAILURE;
 
 2. We probably want to allow extension to TaskStatus::Reason enum. For 
 example, some framework/executor may want to add customized reasons. We could 
 leverage the protobuf extension support to achieve that 
 (https://developers.google.com/protocol-buffers/docs/proto#extensions).
 
 3. The semantics around Termination is broken currently and we need to 
 clean it up. For instance, the following code is broken,
 ```
 void Slave::sendExecutorTerminatedStatusUpdate(...)
 {
   ...
   if (termination.isReady()  termination.get().killed()) {
 taskState = TASK_FAILED;
 // TODO(dhamon): MESOS-2035: Add 'reason' to 
 containerizer::Termination.
 reason = TaskStatus::REASON_MEMORY_LIMIT;
   }