Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-11 Thread Timothy Chen


> On May 11, 2015, 11:07 p.m., Timothy Chen wrote:
> > Thanks Jay the changes looks reasonable to me, will wait for Jie to comment.
> 
> Jay Buffington wrote:
> Great, thanks!  When you commit can you add the three separate commits 
> that are here: 
> https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
> 
> rather than just one big commit?

Sure, but can you next time submit seperate reviews then? It's hard to check 
that what you have is exactly what you put in the reviewboard, and bugs might 
just slip in.


- Timothy


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review83310
---


On May 11, 2015, 10:06 p.m., Jay Buffington wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33249/
> ---
> 
> (Updated May 11, 2015, 10:06 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
> 
> 
> Bugs: MESOS-2020 and MESOS-2656
> https://issues.apache.org/jira/browse/MESOS-2020
> https://issues.apache.org/jira/browse/MESOS-2656
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Send statusUpdate to scheduler on containerizer launch failure
> 
> This review is actually for three commits.  The three commits are separated 
> out on github here:
> https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
> 
> Commit #1:
> docker-no-executor-framework example should send ContainerInfo
> 
> docker-no-executor-framework stopped working when the protobuf message
> changed during development.  Apparently noone noticed because we don't
> run it as part of our test suite.  I used it to develop a fix for
> proper error reporting, so I fixed it.
> 
> Commit #2:
> serialize wait and destroy in containerizer
> 
> Containerizers keep track of all the containers they have created.  When
> a container is destroy()'ed the containerizer removes the container from
> its internal hashmap.  This means that wait() must be called before the
> container is destroyed.
> 
> When the docker containerizer fails because of a "docker pull" or
> "docker run" failure, we end up with a race condition where wait() does
> not properly return the Termination, so we ended up with a default
> failure message of "Abnormal Executor Termination" which isn't very
> useful.
> 
> This commit solves this issue by not destroying the container until
> after wait is called in the failure case.
> 
> Fixes MESOS-2656
> 
> Commit #3:
> send scheduler containerizer launch failure errors
> 
> If a method in the launch sequence returns a failure the error message
> wasn't making it back to the scheduler.
> 
> Fixes MESOS-2020
> 
> 
> Diffs
> -
> 
>   include/mesos/containerizer/containerizer.proto 
> 95c84dfbecef28ca9a220a98293c5fc6215a692f 
>   src/examples/docker_no_executor_framework.cpp 
> d5385d9295d30a6039bab9938f3270006582d3da 
>   src/slave/containerizer/containerizer.hpp 
> 56c088abb036aa26c323c3142fd27ea29ed51d4f 
>   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
>   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
>   src/slave/containerizer/mesos/containerizer.hpp 
> 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
>   src/slave/containerizer/mesos/containerizer.cpp 
> f2587280dc0e1d566d2b856a80358c7b3896c603 
>   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
>   src/tests/docker_containerizer_tests.cpp 
> c9d66b3fbc7d081f36c26781573dca50de823c44 
> 
> Diff: https://reviews.apache.org/r/33249/diff/
> 
> 
> Testing
> ---
> 
> I used examples/docker_no_executor_framework.cpp, and wrote a few tests and 
> ran make check
> 
> 
> Thanks,
> 
> Jay Buffington
> 
>



Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-11 Thread Jie Yu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review83319
---

Ship it!


Nice tests! LGTM!

- Jie Yu


On May 11, 2015, 10:06 p.m., Jay Buffington wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33249/
> ---
> 
> (Updated May 11, 2015, 10:06 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
> 
> 
> Bugs: MESOS-2020 and MESOS-2656
> https://issues.apache.org/jira/browse/MESOS-2020
> https://issues.apache.org/jira/browse/MESOS-2656
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Send statusUpdate to scheduler on containerizer launch failure
> 
> This review is actually for three commits.  The three commits are separated 
> out on github here:
> https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
> 
> Commit #1:
> docker-no-executor-framework example should send ContainerInfo
> 
> docker-no-executor-framework stopped working when the protobuf message
> changed during development.  Apparently noone noticed because we don't
> run it as part of our test suite.  I used it to develop a fix for
> proper error reporting, so I fixed it.
> 
> Commit #2:
> serialize wait and destroy in containerizer
> 
> Containerizers keep track of all the containers they have created.  When
> a container is destroy()'ed the containerizer removes the container from
> its internal hashmap.  This means that wait() must be called before the
> container is destroyed.
> 
> When the docker containerizer fails because of a "docker pull" or
> "docker run" failure, we end up with a race condition where wait() does
> not properly return the Termination, so we ended up with a default
> failure message of "Abnormal Executor Termination" which isn't very
> useful.
> 
> This commit solves this issue by not destroying the container until
> after wait is called in the failure case.
> 
> Fixes MESOS-2656
> 
> Commit #3:
> send scheduler containerizer launch failure errors
> 
> If a method in the launch sequence returns a failure the error message
> wasn't making it back to the scheduler.
> 
> Fixes MESOS-2020
> 
> 
> Diffs
> -
> 
>   include/mesos/containerizer/containerizer.proto 
> 95c84dfbecef28ca9a220a98293c5fc6215a692f 
>   src/examples/docker_no_executor_framework.cpp 
> d5385d9295d30a6039bab9938f3270006582d3da 
>   src/slave/containerizer/containerizer.hpp 
> 56c088abb036aa26c323c3142fd27ea29ed51d4f 
>   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
>   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
>   src/slave/containerizer/mesos/containerizer.hpp 
> 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
>   src/slave/containerizer/mesos/containerizer.cpp 
> f2587280dc0e1d566d2b856a80358c7b3896c603 
>   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
>   src/tests/docker_containerizer_tests.cpp 
> c9d66b3fbc7d081f36c26781573dca50de823c44 
> 
> Diff: https://reviews.apache.org/r/33249/diff/
> 
> 
> Testing
> ---
> 
> I used examples/docker_no_executor_framework.cpp, and wrote a few tests and 
> ran make check
> 
> 
> Thanks,
> 
> Jay Buffington
> 
>



Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-11 Thread Jay Buffington


> On May 11, 2015, 4:07 p.m., Timothy Chen wrote:
> > Thanks Jay the changes looks reasonable to me, will wait for Jie to comment.

Great, thanks!  When you commit can you add the three separate commits that are 
here: 
https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1

rather than just one big commit?


- Jay


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review83310
---


On May 11, 2015, 3:06 p.m., Jay Buffington wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33249/
> ---
> 
> (Updated May 11, 2015, 3:06 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
> 
> 
> Bugs: MESOS-2020 and MESOS-2656
> https://issues.apache.org/jira/browse/MESOS-2020
> https://issues.apache.org/jira/browse/MESOS-2656
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Send statusUpdate to scheduler on containerizer launch failure
> 
> This review is actually for three commits.  The three commits are separated 
> out on github here:
> https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
> 
> Commit #1:
> docker-no-executor-framework example should send ContainerInfo
> 
> docker-no-executor-framework stopped working when the protobuf message
> changed during development.  Apparently noone noticed because we don't
> run it as part of our test suite.  I used it to develop a fix for
> proper error reporting, so I fixed it.
> 
> Commit #2:
> serialize wait and destroy in containerizer
> 
> Containerizers keep track of all the containers they have created.  When
> a container is destroy()'ed the containerizer removes the container from
> its internal hashmap.  This means that wait() must be called before the
> container is destroyed.
> 
> When the docker containerizer fails because of a "docker pull" or
> "docker run" failure, we end up with a race condition where wait() does
> not properly return the Termination, so we ended up with a default
> failure message of "Abnormal Executor Termination" which isn't very
> useful.
> 
> This commit solves this issue by not destroying the container until
> after wait is called in the failure case.
> 
> Fixes MESOS-2656
> 
> Commit #3:
> send scheduler containerizer launch failure errors
> 
> If a method in the launch sequence returns a failure the error message
> wasn't making it back to the scheduler.
> 
> Fixes MESOS-2020
> 
> 
> Diffs
> -
> 
>   include/mesos/containerizer/containerizer.proto 
> 95c84dfbecef28ca9a220a98293c5fc6215a692f 
>   src/examples/docker_no_executor_framework.cpp 
> d5385d9295d30a6039bab9938f3270006582d3da 
>   src/slave/containerizer/containerizer.hpp 
> 56c088abb036aa26c323c3142fd27ea29ed51d4f 
>   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
>   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
>   src/slave/containerizer/mesos/containerizer.hpp 
> 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
>   src/slave/containerizer/mesos/containerizer.cpp 
> f2587280dc0e1d566d2b856a80358c7b3896c603 
>   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
>   src/tests/docker_containerizer_tests.cpp 
> c9d66b3fbc7d081f36c26781573dca50de823c44 
> 
> Diff: https://reviews.apache.org/r/33249/diff/
> 
> 
> Testing
> ---
> 
> I used examples/docker_no_executor_framework.cpp, and wrote a few tests and 
> ran make check
> 
> 
> Thanks,
> 
> Jay Buffington
> 
>



Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-11 Thread Mesos ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review83309
---


Patch looks great!

Reviews applied: [33249]

All tests passed.

- Mesos ReviewBot


On May 11, 2015, 10:06 p.m., Jay Buffington wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33249/
> ---
> 
> (Updated May 11, 2015, 10:06 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
> 
> 
> Bugs: MESOS-2020 and MESOS-2656
> https://issues.apache.org/jira/browse/MESOS-2020
> https://issues.apache.org/jira/browse/MESOS-2656
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Send statusUpdate to scheduler on containerizer launch failure
> 
> This review is actually for three commits.  The three commits are separated 
> out on github here:
> https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
> 
> Commit #1:
> docker-no-executor-framework example should send ContainerInfo
> 
> docker-no-executor-framework stopped working when the protobuf message
> changed during development.  Apparently noone noticed because we don't
> run it as part of our test suite.  I used it to develop a fix for
> proper error reporting, so I fixed it.
> 
> Commit #2:
> serialize wait and destroy in containerizer
> 
> Containerizers keep track of all the containers they have created.  When
> a container is destroy()'ed the containerizer removes the container from
> its internal hashmap.  This means that wait() must be called before the
> container is destroyed.
> 
> When the docker containerizer fails because of a "docker pull" or
> "docker run" failure, we end up with a race condition where wait() does
> not properly return the Termination, so we ended up with a default
> failure message of "Abnormal Executor Termination" which isn't very
> useful.
> 
> This commit solves this issue by not destroying the container until
> after wait is called in the failure case.
> 
> Fixes MESOS-2656
> 
> Commit #3:
> send scheduler containerizer launch failure errors
> 
> If a method in the launch sequence returns a failure the error message
> wasn't making it back to the scheduler.
> 
> Fixes MESOS-2020
> 
> 
> Diffs
> -
> 
>   include/mesos/containerizer/containerizer.proto 
> 95c84dfbecef28ca9a220a98293c5fc6215a692f 
>   src/examples/docker_no_executor_framework.cpp 
> d5385d9295d30a6039bab9938f3270006582d3da 
>   src/slave/containerizer/containerizer.hpp 
> 56c088abb036aa26c323c3142fd27ea29ed51d4f 
>   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
>   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
>   src/slave/containerizer/mesos/containerizer.hpp 
> 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
>   src/slave/containerizer/mesos/containerizer.cpp 
> f2587280dc0e1d566d2b856a80358c7b3896c603 
>   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
>   src/tests/docker_containerizer_tests.cpp 
> c9d66b3fbc7d081f36c26781573dca50de823c44 
> 
> Diff: https://reviews.apache.org/r/33249/diff/
> 
> 
> Testing
> ---
> 
> I used examples/docker_no_executor_framework.cpp, and wrote a few tests and 
> ran make check
> 
> 
> Thanks,
> 
> Jay Buffington
> 
>



Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-11 Thread Timothy Chen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review83310
---


Thanks Jay the changes looks reasonable to me, will wait for Jie to comment.

- Timothy Chen


On May 11, 2015, 10:06 p.m., Jay Buffington wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33249/
> ---
> 
> (Updated May 11, 2015, 10:06 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
> 
> 
> Bugs: MESOS-2020 and MESOS-2656
> https://issues.apache.org/jira/browse/MESOS-2020
> https://issues.apache.org/jira/browse/MESOS-2656
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Send statusUpdate to scheduler on containerizer launch failure
> 
> This review is actually for three commits.  The three commits are separated 
> out on github here:
> https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
> 
> Commit #1:
> docker-no-executor-framework example should send ContainerInfo
> 
> docker-no-executor-framework stopped working when the protobuf message
> changed during development.  Apparently noone noticed because we don't
> run it as part of our test suite.  I used it to develop a fix for
> proper error reporting, so I fixed it.
> 
> Commit #2:
> serialize wait and destroy in containerizer
> 
> Containerizers keep track of all the containers they have created.  When
> a container is destroy()'ed the containerizer removes the container from
> its internal hashmap.  This means that wait() must be called before the
> container is destroyed.
> 
> When the docker containerizer fails because of a "docker pull" or
> "docker run" failure, we end up with a race condition where wait() does
> not properly return the Termination, so we ended up with a default
> failure message of "Abnormal Executor Termination" which isn't very
> useful.
> 
> This commit solves this issue by not destroying the container until
> after wait is called in the failure case.
> 
> Fixes MESOS-2656
> 
> Commit #3:
> send scheduler containerizer launch failure errors
> 
> If a method in the launch sequence returns a failure the error message
> wasn't making it back to the scheduler.
> 
> Fixes MESOS-2020
> 
> 
> Diffs
> -
> 
>   include/mesos/containerizer/containerizer.proto 
> 95c84dfbecef28ca9a220a98293c5fc6215a692f 
>   src/examples/docker_no_executor_framework.cpp 
> d5385d9295d30a6039bab9938f3270006582d3da 
>   src/slave/containerizer/containerizer.hpp 
> 56c088abb036aa26c323c3142fd27ea29ed51d4f 
>   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
>   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
>   src/slave/containerizer/mesos/containerizer.hpp 
> 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
>   src/slave/containerizer/mesos/containerizer.cpp 
> f2587280dc0e1d566d2b856a80358c7b3896c603 
>   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
>   src/tests/docker_containerizer_tests.cpp 
> c9d66b3fbc7d081f36c26781573dca50de823c44 
> 
> Diff: https://reviews.apache.org/r/33249/diff/
> 
> 
> Testing
> ---
> 
> I used examples/docker_no_executor_framework.cpp, and wrote a few tests and 
> ran make check
> 
> 
> Thanks,
> 
> Jay Buffington
> 
>



Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-11 Thread Timothy Chen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review83308
---

Ship it!


Ship It!

- Timothy Chen


On May 11, 2015, 10:06 p.m., Jay Buffington wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33249/
> ---
> 
> (Updated May 11, 2015, 10:06 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
> 
> 
> Bugs: MESOS-2020 and MESOS-2656
> https://issues.apache.org/jira/browse/MESOS-2020
> https://issues.apache.org/jira/browse/MESOS-2656
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Send statusUpdate to scheduler on containerizer launch failure
> 
> This review is actually for three commits.  The three commits are separated 
> out on github here:
> https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
> 
> Commit #1:
> docker-no-executor-framework example should send ContainerInfo
> 
> docker-no-executor-framework stopped working when the protobuf message
> changed during development.  Apparently noone noticed because we don't
> run it as part of our test suite.  I used it to develop a fix for
> proper error reporting, so I fixed it.
> 
> Commit #2:
> serialize wait and destroy in containerizer
> 
> Containerizers keep track of all the containers they have created.  When
> a container is destroy()'ed the containerizer removes the container from
> its internal hashmap.  This means that wait() must be called before the
> container is destroyed.
> 
> When the docker containerizer fails because of a "docker pull" or
> "docker run" failure, we end up with a race condition where wait() does
> not properly return the Termination, so we ended up with a default
> failure message of "Abnormal Executor Termination" which isn't very
> useful.
> 
> This commit solves this issue by not destroying the container until
> after wait is called in the failure case.
> 
> Fixes MESOS-2656
> 
> Commit #3:
> send scheduler containerizer launch failure errors
> 
> If a method in the launch sequence returns a failure the error message
> wasn't making it back to the scheduler.
> 
> Fixes MESOS-2020
> 
> 
> Diffs
> -
> 
>   include/mesos/containerizer/containerizer.proto 
> 95c84dfbecef28ca9a220a98293c5fc6215a692f 
>   src/examples/docker_no_executor_framework.cpp 
> d5385d9295d30a6039bab9938f3270006582d3da 
>   src/slave/containerizer/containerizer.hpp 
> 56c088abb036aa26c323c3142fd27ea29ed51d4f 
>   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
>   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
>   src/slave/containerizer/mesos/containerizer.hpp 
> 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
>   src/slave/containerizer/mesos/containerizer.cpp 
> f2587280dc0e1d566d2b856a80358c7b3896c603 
>   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
>   src/tests/docker_containerizer_tests.cpp 
> c9d66b3fbc7d081f36c26781573dca50de823c44 
> 
> Diff: https://reviews.apache.org/r/33249/diff/
> 
> 
> Testing
> ---
> 
> I used examples/docker_no_executor_framework.cpp, and wrote a few tests and 
> ran make check
> 
> 
> Thanks,
> 
> Jay Buffington
> 
>



Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-11 Thread Jay Buffington

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/
---

(Updated May 11, 2015, 3:06 p.m.)


Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.


Bugs: MESOS-2020 and MESOS-2656
https://issues.apache.org/jira/browse/MESOS-2020
https://issues.apache.org/jira/browse/MESOS-2656


Repository: mesos


Description
---

Send statusUpdate to scheduler on containerizer launch failure

This review is actually for three commits.  The three commits are separated out 
on github here:
https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1

Commit #1:
docker-no-executor-framework example should send ContainerInfo

docker-no-executor-framework stopped working when the protobuf message
changed during development.  Apparently noone noticed because we don't
run it as part of our test suite.  I used it to develop a fix for
proper error reporting, so I fixed it.

Commit #2:
serialize wait and destroy in containerizer

Containerizers keep track of all the containers they have created.  When
a container is destroy()'ed the containerizer removes the container from
its internal hashmap.  This means that wait() must be called before the
container is destroyed.

When the docker containerizer fails because of a "docker pull" or
"docker run" failure, we end up with a race condition where wait() does
not properly return the Termination, so we ended up with a default
failure message of "Abnormal Executor Termination" which isn't very
useful.

This commit solves this issue by not destroying the container until
after wait is called in the failure case.

Fixes MESOS-2656

Commit #3:
send scheduler containerizer launch failure errors

If a method in the launch sequence returns a failure the error message
wasn't making it back to the scheduler.

Fixes MESOS-2020


Diffs (updated)
-

  include/mesos/containerizer/containerizer.proto 
95c84dfbecef28ca9a220a98293c5fc6215a692f 
  src/examples/docker_no_executor_framework.cpp 
d5385d9295d30a6039bab9938f3270006582d3da 
  src/slave/containerizer/containerizer.hpp 
56c088abb036aa26c323c3142fd27ea29ed51d4f 
  src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
  src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
  src/slave/containerizer/mesos/containerizer.hpp 
5e5f13ed8a71ff9510b40b6032d00fd16d312622 
  src/slave/containerizer/mesos/containerizer.cpp 
f2587280dc0e1d566d2b856a80358c7b3896c603 
  src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
  src/tests/docker_containerizer_tests.cpp 
c9d66b3fbc7d081f36c26781573dca50de823c44 

Diff: https://reviews.apache.org/r/33249/diff/


Testing
---

I used examples/docker_no_executor_framework.cpp, and wrote a few tests and ran 
make check


Thanks,

Jay Buffington



Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-11 Thread Jay Buffington


> On May 11, 2015, 12:12 p.m., Timothy Chen wrote:
> > src/tests/slave_tests.cpp, line 325
> > 
> >
> > I don't really understand this comment? And if you simply pause the 
> > clock your await won't work right?

I'm starting to question the legitmaticy of this test anyway.  What it really 
ends up testing is the TestContainerizer, and that makes no sense.  So I'm just 
going to remove the whole thing.

Also, it currently doesn't work (see ReviewBot's previous post) because 
EXPECT_CALL seems to cause _launch to never get invoked.


- Jay


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review83265
---


On May 11, 2015, 10:10 a.m., Jay Buffington wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33249/
> ---
> 
> (Updated May 11, 2015, 10:10 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
> 
> 
> Bugs: MESOS-2020 and MESOS-2656
> https://issues.apache.org/jira/browse/MESOS-2020
> https://issues.apache.org/jira/browse/MESOS-2656
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Send statusUpdate to scheduler on containerizer launch failure
> 
> This review is actually for three commits.  The three commits are separated 
> out on github here:
> https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
> 
> Commit #1:
> docker-no-executor-framework example should send ContainerInfo
> 
> docker-no-executor-framework stopped working when the protobuf message
> changed during development.  Apparently noone noticed because we don't
> run it as part of our test suite.  I used it to develop a fix for
> proper error reporting, so I fixed it.
> 
> Commit #2:
> serialize wait and destroy in containerizer
> 
> Containerizers keep track of all the containers they have created.  When
> a container is destroy()'ed the containerizer removes the container from
> its internal hashmap.  This means that wait() must be called before the
> container is destroyed.
> 
> When the docker containerizer fails because of a "docker pull" or
> "docker run" failure, we end up with a race condition where wait() does
> not properly return the Termination, so we ended up with a default
> failure message of "Abnormal Executor Termination" which isn't very
> useful.
> 
> This commit solves this issue by not destroying the container until
> after wait is called in the failure case.
> 
> Fixes MESOS-2656
> 
> Commit #3:
> send scheduler containerizer launch failure errors
> 
> If a method in the launch sequence returns a failure the error message
> wasn't making it back to the scheduler.
> 
> Fixes MESOS-2020
> 
> 
> Diffs
> -
> 
>   include/mesos/containerizer/containerizer.proto 
> 95c84dfbecef28ca9a220a98293c5fc6215a692f 
>   src/examples/docker_no_executor_framework.cpp 
> d5385d9295d30a6039bab9938f3270006582d3da 
>   src/slave/containerizer/containerizer.hpp 
> 56c088abb036aa26c323c3142fd27ea29ed51d4f 
>   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
>   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
>   src/slave/containerizer/mesos/containerizer.hpp 
> 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
>   src/slave/containerizer/mesos/containerizer.cpp 
> f2587280dc0e1d566d2b856a80358c7b3896c603 
>   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
>   src/tests/docker_containerizer_tests.cpp 
> c9d66b3fbc7d081f36c26781573dca50de823c44 
>   src/tests/slave_tests.cpp 04e79ec061e767a82d40242be7884967942c50b6 
> 
> Diff: https://reviews.apache.org/r/33249/diff/
> 
> 
> Testing
> ---
> 
> I used examples/docker_no_executor_framework.cpp, and wrote a few tests and 
> ran make check
> 
> 
> Thanks,
> 
> Jay Buffington
> 
>



Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-11 Thread Timothy Chen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review83265
---



src/slave/containerizer/docker.cpp


We don't enqueue destroy anymore, let's fix this comment.



src/tests/slave_tests.cpp


I don't really understand this comment? And if you simply pause the clock 
your await won't work right?


- Timothy Chen


On May 11, 2015, 5:10 p.m., Jay Buffington wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33249/
> ---
> 
> (Updated May 11, 2015, 5:10 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
> 
> 
> Bugs: MESOS-2020 and MESOS-2656
> https://issues.apache.org/jira/browse/MESOS-2020
> https://issues.apache.org/jira/browse/MESOS-2656
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Send statusUpdate to scheduler on containerizer launch failure
> 
> This review is actually for three commits.  The three commits are separated 
> out on github here:
> https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
> 
> Commit #1:
> docker-no-executor-framework example should send ContainerInfo
> 
> docker-no-executor-framework stopped working when the protobuf message
> changed during development.  Apparently noone noticed because we don't
> run it as part of our test suite.  I used it to develop a fix for
> proper error reporting, so I fixed it.
> 
> Commit #2:
> serialize wait and destroy in containerizer
> 
> Containerizers keep track of all the containers they have created.  When
> a container is destroy()'ed the containerizer removes the container from
> its internal hashmap.  This means that wait() must be called before the
> container is destroyed.
> 
> When the docker containerizer fails because of a "docker pull" or
> "docker run" failure, we end up with a race condition where wait() does
> not properly return the Termination, so we ended up with a default
> failure message of "Abnormal Executor Termination" which isn't very
> useful.
> 
> This commit solves this issue by not destroying the container until
> after wait is called in the failure case.
> 
> Fixes MESOS-2656
> 
> Commit #3:
> send scheduler containerizer launch failure errors
> 
> If a method in the launch sequence returns a failure the error message
> wasn't making it back to the scheduler.
> 
> Fixes MESOS-2020
> 
> 
> Diffs
> -
> 
>   include/mesos/containerizer/containerizer.proto 
> 95c84dfbecef28ca9a220a98293c5fc6215a692f 
>   src/examples/docker_no_executor_framework.cpp 
> d5385d9295d30a6039bab9938f3270006582d3da 
>   src/slave/containerizer/containerizer.hpp 
> 56c088abb036aa26c323c3142fd27ea29ed51d4f 
>   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
>   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
>   src/slave/containerizer/mesos/containerizer.hpp 
> 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
>   src/slave/containerizer/mesos/containerizer.cpp 
> f2587280dc0e1d566d2b856a80358c7b3896c603 
>   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
>   src/tests/docker_containerizer_tests.cpp 
> c9d66b3fbc7d081f36c26781573dca50de823c44 
>   src/tests/slave_tests.cpp 04e79ec061e767a82d40242be7884967942c50b6 
> 
> Diff: https://reviews.apache.org/r/33249/diff/
> 
> 
> Testing
> ---
> 
> I used examples/docker_no_executor_framework.cpp, and wrote a few tests and 
> ran make check
> 
> 
> Thanks,
> 
> Jay Buffington
> 
>



Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-11 Thread Jay Buffington


> On May 11, 2015, 10:14 a.m., Timothy Chen wrote:
> > src/slave/containerizer/docker.cpp, line 673
> > 
> >
> > Does this means when docker run fails we won't be able to get the error 
> > message from it?

No, I just moved it up the stack.  Now instead of just capturing the Future of 
run, we do it for all functions returned in the launch sequence.  
container->launch will return the Failure when run fails.


- Jay


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review83233
---


On May 11, 2015, 10:10 a.m., Jay Buffington wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33249/
> ---
> 
> (Updated May 11, 2015, 10:10 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
> 
> 
> Bugs: MESOS-2020 and MESOS-2656
> https://issues.apache.org/jira/browse/MESOS-2020
> https://issues.apache.org/jira/browse/MESOS-2656
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Send statusUpdate to scheduler on containerizer launch failure
> 
> This review is actually for three commits.  The three commits are separated 
> out on github here:
> https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
> 
> Commit #1:
> docker-no-executor-framework example should send ContainerInfo
> 
> docker-no-executor-framework stopped working when the protobuf message
> changed during development.  Apparently noone noticed because we don't
> run it as part of our test suite.  I used it to develop a fix for
> proper error reporting, so I fixed it.
> 
> Commit #2:
> serialize wait and destroy in containerizer
> 
> Containerizers keep track of all the containers they have created.  When
> a container is destroy()'ed the containerizer removes the container from
> its internal hashmap.  This means that wait() must be called before the
> container is destroyed.
> 
> When the docker containerizer fails because of a "docker pull" or
> "docker run" failure, we end up with a race condition where wait() does
> not properly return the Termination, so we ended up with a default
> failure message of "Abnormal Executor Termination" which isn't very
> useful.
> 
> This commit solves this issue by not destroying the container until
> after wait is called in the failure case.
> 
> Fixes MESOS-2656
> 
> Commit #3:
> send scheduler containerizer launch failure errors
> 
> If a method in the launch sequence returns a failure the error message
> wasn't making it back to the scheduler.
> 
> Fixes MESOS-2020
> 
> 
> Diffs
> -
> 
>   include/mesos/containerizer/containerizer.proto 
> 95c84dfbecef28ca9a220a98293c5fc6215a692f 
>   src/examples/docker_no_executor_framework.cpp 
> d5385d9295d30a6039bab9938f3270006582d3da 
>   src/slave/containerizer/containerizer.hpp 
> 56c088abb036aa26c323c3142fd27ea29ed51d4f 
>   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
>   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
>   src/slave/containerizer/mesos/containerizer.hpp 
> 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
>   src/slave/containerizer/mesos/containerizer.cpp 
> f2587280dc0e1d566d2b856a80358c7b3896c603 
>   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
>   src/tests/docker_containerizer_tests.cpp 
> c9d66b3fbc7d081f36c26781573dca50de823c44 
>   src/tests/slave_tests.cpp 04e79ec061e767a82d40242be7884967942c50b6 
> 
> Diff: https://reviews.apache.org/r/33249/diff/
> 
> 
> Testing
> ---
> 
> I used examples/docker_no_executor_framework.cpp, and wrote a few tests and 
> ran make check
> 
> 
> Thanks,
> 
> Jay Buffington
> 
>



Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-11 Thread Timothy Chen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review83233
---



include/mesos/containerizer/containerizer.proto


Hmm, was going to say it should end with a period, but since it's a type 
perhaps this is fine.



src/slave/containerizer/docker.cpp


Does this means when docker run fails we won't be able to get the error 
message from it?


- Timothy Chen


On May 11, 2015, 5:10 p.m., Jay Buffington wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33249/
> ---
> 
> (Updated May 11, 2015, 5:10 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
> 
> 
> Bugs: MESOS-2020 and MESOS-2656
> https://issues.apache.org/jira/browse/MESOS-2020
> https://issues.apache.org/jira/browse/MESOS-2656
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Send statusUpdate to scheduler on containerizer launch failure
> 
> This review is actually for three commits.  The three commits are separated 
> out on github here:
> https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
> 
> Commit #1:
> docker-no-executor-framework example should send ContainerInfo
> 
> docker-no-executor-framework stopped working when the protobuf message
> changed during development.  Apparently noone noticed because we don't
> run it as part of our test suite.  I used it to develop a fix for
> proper error reporting, so I fixed it.
> 
> Commit #2:
> serialize wait and destroy in containerizer
> 
> Containerizers keep track of all the containers they have created.  When
> a container is destroy()'ed the containerizer removes the container from
> its internal hashmap.  This means that wait() must be called before the
> container is destroyed.
> 
> When the docker containerizer fails because of a "docker pull" or
> "docker run" failure, we end up with a race condition where wait() does
> not properly return the Termination, so we ended up with a default
> failure message of "Abnormal Executor Termination" which isn't very
> useful.
> 
> This commit solves this issue by not destroying the container until
> after wait is called in the failure case.
> 
> Fixes MESOS-2656
> 
> Commit #3:
> send scheduler containerizer launch failure errors
> 
> If a method in the launch sequence returns a failure the error message
> wasn't making it back to the scheduler.
> 
> Fixes MESOS-2020
> 
> 
> Diffs
> -
> 
>   include/mesos/containerizer/containerizer.proto 
> 95c84dfbecef28ca9a220a98293c5fc6215a692f 
>   src/examples/docker_no_executor_framework.cpp 
> d5385d9295d30a6039bab9938f3270006582d3da 
>   src/slave/containerizer/containerizer.hpp 
> 56c088abb036aa26c323c3142fd27ea29ed51d4f 
>   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
>   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
>   src/slave/containerizer/mesos/containerizer.hpp 
> 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
>   src/slave/containerizer/mesos/containerizer.cpp 
> f2587280dc0e1d566d2b856a80358c7b3896c603 
>   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
>   src/tests/docker_containerizer_tests.cpp 
> c9d66b3fbc7d081f36c26781573dca50de823c44 
>   src/tests/slave_tests.cpp 04e79ec061e767a82d40242be7884967942c50b6 
> 
> Diff: https://reviews.apache.org/r/33249/diff/
> 
> 
> Testing
> ---
> 
> I used examples/docker_no_executor_framework.cpp, and wrote a few tests and 
> ran make check
> 
> 
> Thanks,
> 
> Jay Buffington
> 
>



Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-11 Thread Jay Buffington

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/
---

(Updated May 11, 2015, 10:10 a.m.)


Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.


Changes
---

added back the slave_test and fixed tnachen's two comments.


Bugs: MESOS-2020 and MESOS-2656
https://issues.apache.org/jira/browse/MESOS-2020
https://issues.apache.org/jira/browse/MESOS-2656


Repository: mesos


Description
---

Send statusUpdate to scheduler on containerizer launch failure

This review is actually for three commits.  The three commits are separated out 
on github here:
https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1

Commit #1:
docker-no-executor-framework example should send ContainerInfo

docker-no-executor-framework stopped working when the protobuf message
changed during development.  Apparently noone noticed because we don't
run it as part of our test suite.  I used it to develop a fix for
proper error reporting, so I fixed it.

Commit #2:
serialize wait and destroy in containerizer

Containerizers keep track of all the containers they have created.  When
a container is destroy()'ed the containerizer removes the container from
its internal hashmap.  This means that wait() must be called before the
container is destroyed.

When the docker containerizer fails because of a "docker pull" or
"docker run" failure, we end up with a race condition where wait() does
not properly return the Termination, so we ended up with a default
failure message of "Abnormal Executor Termination" which isn't very
useful.

This commit solves this issue by not destroying the container until
after wait is called in the failure case.

Fixes MESOS-2656

Commit #3:
send scheduler containerizer launch failure errors

If a method in the launch sequence returns a failure the error message
wasn't making it back to the scheduler.

Fixes MESOS-2020


Diffs (updated)
-

  include/mesos/containerizer/containerizer.proto 
95c84dfbecef28ca9a220a98293c5fc6215a692f 
  src/examples/docker_no_executor_framework.cpp 
d5385d9295d30a6039bab9938f3270006582d3da 
  src/slave/containerizer/containerizer.hpp 
56c088abb036aa26c323c3142fd27ea29ed51d4f 
  src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
  src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
  src/slave/containerizer/mesos/containerizer.hpp 
5e5f13ed8a71ff9510b40b6032d00fd16d312622 
  src/slave/containerizer/mesos/containerizer.cpp 
f2587280dc0e1d566d2b856a80358c7b3896c603 
  src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
  src/tests/docker_containerizer_tests.cpp 
c9d66b3fbc7d081f36c26781573dca50de823c44 
  src/tests/slave_tests.cpp 04e79ec061e767a82d40242be7884967942c50b6 

Diff: https://reviews.apache.org/r/33249/diff/


Testing
---

I used examples/docker_no_executor_framework.cpp, and wrote a few tests and ran 
make check


Thanks,

Jay Buffington



Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-11 Thread Jay Buffington


> On May 8, 2015, 11:48 p.m., Timothy Chen wrote:
> > src/slave/containerizer/mesos/containerizer.cpp, line 567
> > 
> >
> > No tests for mesos containerizer change?

The changes to mesos containerizer were to support the serialization of 
wait/destroy.  If the mesos containerizer fails to, say, fetch, the scheduler 
still get's back a poor "executor terminated" message.  This happens much more 
rarely than in the docker case, so I didn't address it in this patch.

We weren't testing that destroy was getting called when launch failed.  Adding 
a test now that tests that destroy does not get called on failure is odd.  I'm 
up for suggestions, though.

I did re-add my original slave_test.cpp test back.


- Jay


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review83136
---


On May 8, 2015, 7:42 p.m., Jay Buffington wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33249/
> ---
> 
> (Updated May 8, 2015, 7:42 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
> 
> 
> Bugs: MESOS-2020 and MESOS-2656
> https://issues.apache.org/jira/browse/MESOS-2020
> https://issues.apache.org/jira/browse/MESOS-2656
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Send statusUpdate to scheduler on containerizer launch failure
> 
> This review is actually for three commits.  The three commits are separated 
> out on github here:
> https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
> 
> Commit #1:
> docker-no-executor-framework example should send ContainerInfo
> 
> docker-no-executor-framework stopped working when the protobuf message
> changed during development.  Apparently noone noticed because we don't
> run it as part of our test suite.  I used it to develop a fix for
> proper error reporting, so I fixed it.
> 
> Commit #2:
> serialize wait and destroy in containerizer
> 
> Containerizers keep track of all the containers they have created.  When
> a container is destroy()'ed the containerizer removes the container from
> its internal hashmap.  This means that wait() must be called before the
> container is destroyed.
> 
> When the docker containerizer fails because of a "docker pull" or
> "docker run" failure, we end up with a race condition where wait() does
> not properly return the Termination, so we ended up with a default
> failure message of "Abnormal Executor Termination" which isn't very
> useful.
> 
> This commit solves this issue by not destroying the container until
> after wait is called in the failure case.
> 
> Fixes MESOS-2656
> 
> Commit #3:
> send scheduler containerizer launch failure errors
> 
> If a method in the launch sequence returns a failure the error message
> wasn't making it back to the scheduler.
> 
> Fixes MESOS-2020
> 
> 
> Diffs
> -
> 
>   include/mesos/containerizer/containerizer.proto 
> 95c84dfbecef28ca9a220a98293c5fc6215a692f 
>   src/examples/docker_no_executor_framework.cpp 
> d5385d9295d30a6039bab9938f3270006582d3da 
>   src/slave/containerizer/containerizer.hpp 
> 56c088abb036aa26c323c3142fd27ea29ed51d4f 
>   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
>   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
>   src/slave/containerizer/mesos/containerizer.hpp 
> 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
>   src/slave/containerizer/mesos/containerizer.cpp 
> f2587280dc0e1d566d2b856a80358c7b3896c603 
>   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
>   src/tests/docker_containerizer_tests.cpp 
> c9d66b3fbc7d081f36c26781573dca50de823c44 
> 
> Diff: https://reviews.apache.org/r/33249/diff/
> 
> 
> Testing
> ---
> 
> I used examples/docker_no_executor_framework.cpp, and wrote a few tests and 
> ran make check
> 
> 
> Thanks,
> 
> Jay Buffington
> 
>



Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-08 Thread Timothy Chen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review83136
---



src/slave/containerizer/mesos/containerizer.cpp


No tests for mesos containerizer change?



src/tests/docker_containerizer_tests.cpp


Do you want to call it just _pull to be consistent here? It's not going to 
clash because we usually prefix the namespace.



src/tests/docker_containerizer_tests.cpp


Remove the extra two spaces.


- Timothy Chen


On May 9, 2015, 2:42 a.m., Jay Buffington wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33249/
> ---
> 
> (Updated May 9, 2015, 2:42 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
> 
> 
> Bugs: MESOS-2020 and MESOS-2656
> https://issues.apache.org/jira/browse/MESOS-2020
> https://issues.apache.org/jira/browse/MESOS-2656
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Send statusUpdate to scheduler on containerizer launch failure
> 
> This review is actually for three commits.  The three commits are separated 
> out on github here:
> https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
> 
> Commit #1:
> docker-no-executor-framework example should send ContainerInfo
> 
> docker-no-executor-framework stopped working when the protobuf message
> changed during development.  Apparently noone noticed because we don't
> run it as part of our test suite.  I used it to develop a fix for
> proper error reporting, so I fixed it.
> 
> Commit #2:
> serialize wait and destroy in containerizer
> 
> Containerizers keep track of all the containers they have created.  When
> a container is destroy()'ed the containerizer removes the container from
> its internal hashmap.  This means that wait() must be called before the
> container is destroyed.
> 
> When the docker containerizer fails because of a "docker pull" or
> "docker run" failure, we end up with a race condition where wait() does
> not properly return the Termination, so we ended up with a default
> failure message of "Abnormal Executor Termination" which isn't very
> useful.
> 
> This commit solves this issue by not destroying the container until
> after wait is called in the failure case.
> 
> Fixes MESOS-2656
> 
> Commit #3:
> send scheduler containerizer launch failure errors
> 
> If a method in the launch sequence returns a failure the error message
> wasn't making it back to the scheduler.
> 
> Fixes MESOS-2020
> 
> 
> Diffs
> -
> 
>   include/mesos/containerizer/containerizer.proto 
> 95c84dfbecef28ca9a220a98293c5fc6215a692f 
>   src/examples/docker_no_executor_framework.cpp 
> d5385d9295d30a6039bab9938f3270006582d3da 
>   src/slave/containerizer/containerizer.hpp 
> 56c088abb036aa26c323c3142fd27ea29ed51d4f 
>   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
>   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
>   src/slave/containerizer/mesos/containerizer.hpp 
> 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
>   src/slave/containerizer/mesos/containerizer.cpp 
> f2587280dc0e1d566d2b856a80358c7b3896c603 
>   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
>   src/tests/docker_containerizer_tests.cpp 
> c9d66b3fbc7d081f36c26781573dca50de823c44 
> 
> Diff: https://reviews.apache.org/r/33249/diff/
> 
> 
> Testing
> ---
> 
> I used examples/docker_no_executor_framework.cpp, and wrote a few tests and 
> ran make check
> 
> 
> Thanks,
> 
> Jay Buffington
> 
>



Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-08 Thread Mesos ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review83130
---


Patch looks great!

Reviews applied: [33249]

All tests passed.

- Mesos ReviewBot


On May 9, 2015, 2:42 a.m., Jay Buffington wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33249/
> ---
> 
> (Updated May 9, 2015, 2:42 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
> 
> 
> Bugs: MESOS-2020 and MESOS-2656
> https://issues.apache.org/jira/browse/MESOS-2020
> https://issues.apache.org/jira/browse/MESOS-2656
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Send statusUpdate to scheduler on containerizer launch failure
> 
> This review is actually for three commits.  The three commits are separated 
> out on github here:
> https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
> 
> Commit #1:
> docker-no-executor-framework example should send ContainerInfo
> 
> docker-no-executor-framework stopped working when the protobuf message
> changed during development.  Apparently noone noticed because we don't
> run it as part of our test suite.  I used it to develop a fix for
> proper error reporting, so I fixed it.
> 
> Commit #2:
> serialize wait and destroy in containerizer
> 
> Containerizers keep track of all the containers they have created.  When
> a container is destroy()'ed the containerizer removes the container from
> its internal hashmap.  This means that wait() must be called before the
> container is destroyed.
> 
> When the docker containerizer fails because of a "docker pull" or
> "docker run" failure, we end up with a race condition where wait() does
> not properly return the Termination, so we ended up with a default
> failure message of "Abnormal Executor Termination" which isn't very
> useful.
> 
> This commit solves this issue by not destroying the container until
> after wait is called in the failure case.
> 
> Fixes MESOS-2656
> 
> Commit #3:
> send scheduler containerizer launch failure errors
> 
> If a method in the launch sequence returns a failure the error message
> wasn't making it back to the scheduler.
> 
> Fixes MESOS-2020
> 
> 
> Diffs
> -
> 
>   include/mesos/containerizer/containerizer.proto 
> 95c84dfbecef28ca9a220a98293c5fc6215a692f 
>   src/examples/docker_no_executor_framework.cpp 
> d5385d9295d30a6039bab9938f3270006582d3da 
>   src/slave/containerizer/containerizer.hpp 
> 56c088abb036aa26c323c3142fd27ea29ed51d4f 
>   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
>   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
>   src/slave/containerizer/mesos/containerizer.hpp 
> 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
>   src/slave/containerizer/mesos/containerizer.cpp 
> f2587280dc0e1d566d2b856a80358c7b3896c603 
>   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
>   src/tests/docker_containerizer_tests.cpp 
> c9d66b3fbc7d081f36c26781573dca50de823c44 
> 
> Diff: https://reviews.apache.org/r/33249/diff/
> 
> 
> Testing
> ---
> 
> I used examples/docker_no_executor_framework.cpp, and wrote a few tests and 
> ran make check
> 
> 
> Thanks,
> 
> Jay Buffington
> 
>



Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-08 Thread Jay Buffington

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/
---

(Updated May 8, 2015, 7:42 p.m.)


Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.


Changes
---

added tests.


Bugs: MESOS-2020 and MESOS-2656
https://issues.apache.org/jira/browse/MESOS-2020
https://issues.apache.org/jira/browse/MESOS-2656


Repository: mesos


Description (updated)
---

Send statusUpdate to scheduler on containerizer launch failure

This review is actually for three commits.  The three commits are separated out 
on github here:
https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1

Commit #1:
docker-no-executor-framework example should send ContainerInfo

docker-no-executor-framework stopped working when the protobuf message
changed during development.  Apparently noone noticed because we don't
run it as part of our test suite.  I used it to develop a fix for
proper error reporting, so I fixed it.

Commit #2:
serialize wait and destroy in containerizer

Containerizers keep track of all the containers they have created.  When
a container is destroy()'ed the containerizer removes the container from
its internal hashmap.  This means that wait() must be called before the
container is destroyed.

When the docker containerizer fails because of a "docker pull" or
"docker run" failure, we end up with a race condition where wait() does
not properly return the Termination, so we ended up with a default
failure message of "Abnormal Executor Termination" which isn't very
useful.

This commit solves this issue by not destroying the container until
after wait is called in the failure case.

Fixes MESOS-2656

Commit #3:
send scheduler containerizer launch failure errors

If a method in the launch sequence returns a failure the error message
wasn't making it back to the scheduler.

Fixes MESOS-2020


Diffs (updated)
-

  include/mesos/containerizer/containerizer.proto 
95c84dfbecef28ca9a220a98293c5fc6215a692f 
  src/examples/docker_no_executor_framework.cpp 
d5385d9295d30a6039bab9938f3270006582d3da 
  src/slave/containerizer/containerizer.hpp 
56c088abb036aa26c323c3142fd27ea29ed51d4f 
  src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
  src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
  src/slave/containerizer/mesos/containerizer.hpp 
5e5f13ed8a71ff9510b40b6032d00fd16d312622 
  src/slave/containerizer/mesos/containerizer.cpp 
f2587280dc0e1d566d2b856a80358c7b3896c603 
  src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
  src/tests/docker_containerizer_tests.cpp 
c9d66b3fbc7d081f36c26781573dca50de823c44 

Diff: https://reviews.apache.org/r/33249/diff/


Testing (updated)
---

I used examples/docker_no_executor_framework.cpp, and wrote a few tests and ran 
make check


Thanks,

Jay Buffington



Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-07 Thread Jay Buffington


> On May 7, 2015, 4:53 p.m., Jie Yu wrote:
> > Overall, LGTM! The only reason I don't give a ship it here is because the 
> > tests. Could you please add an integration test in 
> > docker_containerizer_tests.cpp?

I'll add the tests tomorrow and update the diff.


> On May 7, 2015, 4:53 p.m., Jie Yu wrote:
> > src/examples/docker_no_executor_framework.cpp, line 49
> > 
> >
> > Why this change?

I was using this to test/develop, so it was noisy and unnecessary to have 5 
tasks started.  I'll revert.


> On May 7, 2015, 4:53 p.m., Jie Yu wrote:
> > src/examples/docker_no_executor_framework.cpp, lines 145-148
> > 
> >
> > This looks wiered to me. Are you testing the failure case here? If 
> > that's the case, could you please instead write a gtest based test (not in 
> > the example framework)?

Sure, I'll write test.


> On May 7, 2015, 4:53 p.m., Jie Yu wrote:
> > src/slave/containerizer/mesos/containerizer.cpp, line 586
> > 
> >
> > This changes the semantics a little bit. Previously, `killed` in the 
> > Termination will be set to false (framework will thus receive TASK_LOST). 
> > Now `killed` will be set to true (framework will receive TASK_FAILED).
> > 
> > However, I think that's fine! The semantics around `killed` in 
> > Termination is a definitely weired to me. With a REASON field in the 
> > Termination, we probably can get rid of the `killed`. Could you please add 
> > a TODO in Termination protobuf?

Gr... killed is required.  But since this is an internal protobuf does that 
matter?  Maybe we can just break backwards compatibility?

Adding Reason (or rather setting it) was much less straight forward than I 
expected it to be.  I talked to Jörg about this and he said he was willing to 
take it on next week.  I'll note it here and reference MESOS-2035.


- Jay


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review82940
---


On May 7, 2015, 11:57 a.m., Jay Buffington wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33249/
> ---
> 
> (Updated May 7, 2015, 11:57 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
> 
> 
> Bugs: MESOS-2020 and MESOS-2656
> https://issues.apache.org/jira/browse/MESOS-2020
> https://issues.apache.org/jira/browse/MESOS-2656
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Send statusUpdate to scheduler on containerizer launch failure
> 
> This review is actually for three commits.  The three commits are separated 
> out
> on github here:
> https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
> 
> Commit #1:
> docker-no-executor-framework example should send ContainerInfo
> 
> docker-no-executor-framework stopped working when the protobuf message
> changed during development.  Apparently noone noticed because we don't
> run it as part of our test suite.  I used it to develop a fix for
> proper error reporting, so I fixed it.
> 
> Commit #2:
> serialize wait and destroy in containerizer
> 
> Containerizers keep track of all the containers they have created.  When
> a container is destroy()'ed the containerizer removes the container from
> its internal hashmap.  This means that wait() must be called before the
> container is destroyed.
> 
> When the docker containerizer fails because of a "docker pull" or
> "docker run" failure, we end up with a race condition where wait() does
> not properly return the Termination, so we ended up with a default
> failure message of "Abnormal Executor Termination" which isn't very
> useful.
> 
> This commit solves this issue by not destroying the container until
> after wait is called in the failure case.
> 
> Fixes MESOS-2656
> 
> Commit #3:
> send scheduler containerizer launch failure errors
> 
> If a method in the launch sequence returns a failure the error message
> wasn't making it back to the scheduler.
> 
> Fixes MESOS-2020
> 
> 
> Diffs
> -
> 
>   src/examples/docker_no_executor_framework.cpp 
> d5385d9295d30a6039bab9938f3270006582d3da 
>   src/slave/containerizer/containerizer.hpp 
> 56c088abb036aa26c323c3142fd27ea29ed51d4f 
>   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
>   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
>   src/slave/containerizer/mesos/containerizer.hpp 
> 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
>   src/slave/containerizer/mesos/containerizer.cpp 
> f2587280dc0e1d566d2b856a80358c7b3

Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-07 Thread Jie Yu

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review82940
---


Overall, LGTM! The only reason I don't give a ship it here is because the 
tests. Could you please add an integration test in 
docker_containerizer_tests.cpp?


src/examples/docker_no_executor_framework.cpp


Why this change?



src/examples/docker_no_executor_framework.cpp


This looks wiered to me. Are you testing the failure case here? If that's 
the case, could you please instead write a gtest based test (not in the example 
framework)?



src/slave/containerizer/containerizer.hpp


s/you/You/



src/slave/containerizer/mesos/containerizer.cpp


This changes the semantics a little bit. Previously, `killed` in the 
Termination will be set to false (framework will thus receive TASK_LOST). Now 
`killed` will be set to true (framework will receive TASK_FAILED).

However, I think that's fine! The semantics around `killed` in Termination 
is a definitely weired to me. With a REASON field in the Termination, we 
probably can get rid of the `killed`. Could you please add a TODO in 
Termination protobuf?


- Jie Yu


On May 7, 2015, 6:57 p.m., Jay Buffington wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33249/
> ---
> 
> (Updated May 7, 2015, 6:57 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
> 
> 
> Bugs: MESOS-2020 and MESOS-2656
> https://issues.apache.org/jira/browse/MESOS-2020
> https://issues.apache.org/jira/browse/MESOS-2656
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Send statusUpdate to scheduler on containerizer launch failure
> 
> This review is actually for three commits.  The three commits are separated 
> out
> on github here:
> https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
> 
> Commit #1:
> docker-no-executor-framework example should send ContainerInfo
> 
> docker-no-executor-framework stopped working when the protobuf message
> changed during development.  Apparently noone noticed because we don't
> run it as part of our test suite.  I used it to develop a fix for
> proper error reporting, so I fixed it.
> 
> Commit #2:
> serialize wait and destroy in containerizer
> 
> Containerizers keep track of all the containers they have created.  When
> a container is destroy()'ed the containerizer removes the container from
> its internal hashmap.  This means that wait() must be called before the
> container is destroyed.
> 
> When the docker containerizer fails because of a "docker pull" or
> "docker run" failure, we end up with a race condition where wait() does
> not properly return the Termination, so we ended up with a default
> failure message of "Abnormal Executor Termination" which isn't very
> useful.
> 
> This commit solves this issue by not destroying the container until
> after wait is called in the failure case.
> 
> Fixes MESOS-2656
> 
> Commit #3:
> send scheduler containerizer launch failure errors
> 
> If a method in the launch sequence returns a failure the error message
> wasn't making it back to the scheduler.
> 
> Fixes MESOS-2020
> 
> 
> Diffs
> -
> 
>   src/examples/docker_no_executor_framework.cpp 
> d5385d9295d30a6039bab9938f3270006582d3da 
>   src/slave/containerizer/containerizer.hpp 
> 56c088abb036aa26c323c3142fd27ea29ed51d4f 
>   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
>   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
>   src/slave/containerizer/mesos/containerizer.hpp 
> 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
>   src/slave/containerizer/mesos/containerizer.cpp 
> f2587280dc0e1d566d2b856a80358c7b3896c603 
>   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
> 
> Diff: https://reviews.apache.org/r/33249/diff/
> 
> 
> Testing
> ---
> 
> I used examples/docker_no_executor_framework.cpp and make check
> 
> 
> Thanks,
> 
> Jay Buffington
> 
>



Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-07 Thread Mesos ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review82883
---


Patch looks great!

Reviews applied: [33249]

All tests passed.

- Mesos ReviewBot


On May 7, 2015, 6:57 p.m., Jay Buffington wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33249/
> ---
> 
> (Updated May 7, 2015, 6:57 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
> 
> 
> Bugs: MESOS-2020 and MESOS-2656
> https://issues.apache.org/jira/browse/MESOS-2020
> https://issues.apache.org/jira/browse/MESOS-2656
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Send statusUpdate to scheduler on containerizer launch failure
> 
> This review is actually for three commits.  The three commits are separated 
> out
> on github here:
> https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
> 
> Commit #1:
> docker-no-executor-framework example should send ContainerInfo
> 
> docker-no-executor-framework stopped working when the protobuf message
> changed during development.  Apparently noone noticed because we don't
> run it as part of our test suite.  I used it to develop a fix for
> proper error reporting, so I fixed it.
> 
> Commit #2:
> serialize wait and destroy in containerizer
> 
> Containerizers keep track of all the containers they have created.  When
> a container is destroy()'ed the containerizer removes the container from
> its internal hashmap.  This means that wait() must be called before the
> container is destroyed.
> 
> When the docker containerizer fails because of a "docker pull" or
> "docker run" failure, we end up with a race condition where wait() does
> not properly return the Termination, so we ended up with a default
> failure message of "Abnormal Executor Termination" which isn't very
> useful.
> 
> This commit solves this issue by not destroying the container until
> after wait is called in the failure case.
> 
> Fixes MESOS-2656
> 
> Commit #3:
> send scheduler containerizer launch failure errors
> 
> If a method in the launch sequence returns a failure the error message
> wasn't making it back to the scheduler.
> 
> Fixes MESOS-2020
> 
> 
> Diffs
> -
> 
>   src/examples/docker_no_executor_framework.cpp 
> d5385d9295d30a6039bab9938f3270006582d3da 
>   src/slave/containerizer/containerizer.hpp 
> 56c088abb036aa26c323c3142fd27ea29ed51d4f 
>   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
>   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
>   src/slave/containerizer/mesos/containerizer.hpp 
> 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
>   src/slave/containerizer/mesos/containerizer.cpp 
> f2587280dc0e1d566d2b856a80358c7b3896c603 
>   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
> 
> Diff: https://reviews.apache.org/r/33249/diff/
> 
> 
> Testing
> ---
> 
> I used examples/docker_no_executor_framework.cpp and make check
> 
> 
> Thanks,
> 
> Jay Buffington
> 
>



Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-07 Thread Jay Buffington

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/
---

(Updated May 7, 2015, 11:57 a.m.)


Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.


Bugs: MESOS-2020 and MESOS-2656
https://issues.apache.org/jira/browse/MESOS-2020
https://issues.apache.org/jira/browse/MESOS-2656


Repository: mesos


Description
---

Send statusUpdate to scheduler on containerizer launch failure

This review is actually for three commits.  The three commits are separated out
on github here:
https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1

Commit #1:
docker-no-executor-framework example should send ContainerInfo

docker-no-executor-framework stopped working when the protobuf message
changed during development.  Apparently noone noticed because we don't
run it as part of our test suite.  I used it to develop a fix for
proper error reporting, so I fixed it.

Commit #2:
serialize wait and destroy in containerizer

Containerizers keep track of all the containers they have created.  When
a container is destroy()'ed the containerizer removes the container from
its internal hashmap.  This means that wait() must be called before the
container is destroyed.

When the docker containerizer fails because of a "docker pull" or
"docker run" failure, we end up with a race condition where wait() does
not properly return the Termination, so we ended up with a default
failure message of "Abnormal Executor Termination" which isn't very
useful.

This commit solves this issue by not destroying the container until
after wait is called in the failure case.

Fixes MESOS-2656

Commit #3:
send scheduler containerizer launch failure errors

If a method in the launch sequence returns a failure the error message
wasn't making it back to the scheduler.

Fixes MESOS-2020


Diffs (updated)
-

  src/examples/docker_no_executor_framework.cpp 
d5385d9295d30a6039bab9938f3270006582d3da 
  src/slave/containerizer/containerizer.hpp 
56c088abb036aa26c323c3142fd27ea29ed51d4f 
  src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
  src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
  src/slave/containerizer/mesos/containerizer.hpp 
5e5f13ed8a71ff9510b40b6032d00fd16d312622 
  src/slave/containerizer/mesos/containerizer.cpp 
f2587280dc0e1d566d2b856a80358c7b3896c603 
  src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 

Diff: https://reviews.apache.org/r/33249/diff/


Testing
---

I used examples/docker_no_executor_framework.cpp and make check


Thanks,

Jay Buffington



Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-07 Thread Jay Buffington


> On May 7, 2015, 10:02 a.m., Timothy Chen wrote:
> > src/examples/docker_no_executor_framework.cpp, line 109
> > 
> >
> > is this something you like to commit?

Whops!  That snuck in.  I'll fix.


> On May 7, 2015, 10:02 a.m., Timothy Chen wrote:
> > src/slave/containerizer/docker.cpp, line 1187
> > 
> >
> > This seems to not respect anymore the explicit state we try to track 
> > what stage the containerizer was running when we destroy, and what stage it 
> > failed.
> > 
> > Does the error message has clear message saying it failed to 
> > fetch/pull/run?

tracking the stage the container is in when we destroyed is done by the 
container->state enum now, right?  

Yes, the status message on TASK_FAIL that the scheduler gets back now looks 
like this:

Failed to launch container: Failed to 'docker pull bogus-image:latest': 
exit status = exited with status 1 stderr = time="2015-05-07T17:45:35Z" 
level=fatal msg="Error: image library/bogus-image:latest not found"

I tested fetch failures by providing a bogus URI and run failures by starting 
the slave with --docker=/tmp/docker which looked like this:

$ cat /tmp/docker
#!/bin/bash

if [[ $1 == "run" ]]; then
echo "docker run failure!" 1>&2
exit 123
fi

exec docker $*


> On May 7, 2015, 10:02 a.m., Timothy Chen wrote:
> > src/slave/slave.cpp, line 3125
> > 
> >
> > We shouldn't need to destroy if none of the containerizers return true.

Ah, that's right.  I will just remove this line.


- Jay


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review82832
---


On May 7, 2015, 9:17 a.m., Jay Buffington wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33249/
> ---
> 
> (Updated May 7, 2015, 9:17 a.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
> 
> 
> Bugs: MESOS-2020 and MESOS-2656
> https://issues.apache.org/jira/browse/MESOS-2020
> https://issues.apache.org/jira/browse/MESOS-2656
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Send statusUpdate to scheduler on containerizer launch failure
> 
> This review is actually for three commits.  The three commits are separated 
> out
> on github here:
> https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
> 
> Commit #1:
> docker-no-executor-framework example should send ContainerInfo
> 
> docker-no-executor-framework stopped working when the protobuf message
> changed during development.  Apparently noone noticed because we don't
> run it as part of our test suite.  I used it to develop a fix for
> proper error reporting, so I fixed it.
> 
> Commit #2:
> serialize wait and destroy in containerizer
> 
> Containerizers keep track of all the containers they have created.  When
> a container is destroy()'ed the containerizer removes the container from
> its internal hashmap.  This means that wait() must be called before the
> container is destroyed.
> 
> When the docker containerizer fails because of a "docker pull" or
> "docker run" failure, we end up with a race condition where wait() does
> not properly return the Termination, so we ended up with a default
> failure message of "Abnormal Executor Termination" which isn't very
> useful.
> 
> This commit solves this issue by not destroying the container until
> after wait is called in the failure case.
> 
> Fixes MESOS-2656
> 
> Commit #3:
> send scheduler containerizer launch failure errors
> 
> If a method in the launch sequence returns a failure the error message
> wasn't making it back to the scheduler.
> 
> Fixes MESOS-2020
> 
> 
> Diffs
> -
> 
>   src/examples/docker_no_executor_framework.cpp 
> d5385d9295d30a6039bab9938f3270006582d3da 
>   src/slave/containerizer/containerizer.hpp 
> 56c088abb036aa26c323c3142fd27ea29ed51d4f 
>   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
>   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
>   src/slave/containerizer/mesos/containerizer.hpp 
> 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
>   src/slave/containerizer/mesos/containerizer.cpp 
> f2587280dc0e1d566d2b856a80358c7b3896c603 
>   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
> 
> Diff: https://reviews.apache.org/r/33249/diff/
> 
> 
> Testing
> ---
> 
> I used examples/docker_no_executor_framework.cpp and make check
> 
> 
> Thanks,
> 
> Jay Buffington
> 
>



Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-07 Thread Timothy Chen

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review82832
---


Do we have any tests to check for the change in mesos containerizer and docker 
containerizer?


src/examples/docker_no_executor_framework.cpp


is this something you like to commit?



src/slave/containerizer/docker.cpp


This seems to not respect anymore the explicit state we try to track what 
stage the containerizer was running when we destroy, and what stage it failed.

Does the error message has clear message saying it failed to fetch/pull/run?



src/slave/slave.cpp


We shouldn't need to destroy if none of the containerizers return true.


- Timothy Chen


On May 7, 2015, 4:17 p.m., Jay Buffington wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33249/
> ---
> 
> (Updated May 7, 2015, 4:17 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
> 
> 
> Bugs: MESOS-2020 and MESOS-2656
> https://issues.apache.org/jira/browse/MESOS-2020
> https://issues.apache.org/jira/browse/MESOS-2656
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Send statusUpdate to scheduler on containerizer launch failure
> 
> This review is actually for three commits.  The three commits are separated 
> out
> on github here:
> https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
> 
> Commit #1:
> docker-no-executor-framework example should send ContainerInfo
> 
> docker-no-executor-framework stopped working when the protobuf message
> changed during development.  Apparently noone noticed because we don't
> run it as part of our test suite.  I used it to develop a fix for
> proper error reporting, so I fixed it.
> 
> Commit #2:
> serialize wait and destroy in containerizer
> 
> Containerizers keep track of all the containers they have created.  When
> a container is destroy()'ed the containerizer removes the container from
> its internal hashmap.  This means that wait() must be called before the
> container is destroyed.
> 
> When the docker containerizer fails because of a "docker pull" or
> "docker run" failure, we end up with a race condition where wait() does
> not properly return the Termination, so we ended up with a default
> failure message of "Abnormal Executor Termination" which isn't very
> useful.
> 
> This commit solves this issue by not destroying the container until
> after wait is called in the failure case.
> 
> Fixes MESOS-2656
> 
> Commit #3:
> send scheduler containerizer launch failure errors
> 
> If a method in the launch sequence returns a failure the error message
> wasn't making it back to the scheduler.
> 
> Fixes MESOS-2020
> 
> 
> Diffs
> -
> 
>   src/examples/docker_no_executor_framework.cpp 
> d5385d9295d30a6039bab9938f3270006582d3da 
>   src/slave/containerizer/containerizer.hpp 
> 56c088abb036aa26c323c3142fd27ea29ed51d4f 
>   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
>   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
>   src/slave/containerizer/mesos/containerizer.hpp 
> 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
>   src/slave/containerizer/mesos/containerizer.cpp 
> f2587280dc0e1d566d2b856a80358c7b3896c603 
>   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
> 
> Diff: https://reviews.apache.org/r/33249/diff/
> 
> 
> Testing
> ---
> 
> I used examples/docker_no_executor_framework.cpp and make check
> 
> 
> Thanks,
> 
> Jay Buffington
> 
>



Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-07 Thread Mesos ReviewBot

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review82833
---


Patch looks great!

Reviews applied: [33249]

All tests passed.

- Mesos ReviewBot


On May 7, 2015, 4:17 p.m., Jay Buffington wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33249/
> ---
> 
> (Updated May 7, 2015, 4:17 p.m.)
> 
> 
> Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.
> 
> 
> Bugs: MESOS-2020 and MESOS-2656
> https://issues.apache.org/jira/browse/MESOS-2020
> https://issues.apache.org/jira/browse/MESOS-2656
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> Send statusUpdate to scheduler on containerizer launch failure
> 
> This review is actually for three commits.  The three commits are separated 
> out
> on github here:
> https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1
> 
> Commit #1:
> docker-no-executor-framework example should send ContainerInfo
> 
> docker-no-executor-framework stopped working when the protobuf message
> changed during development.  Apparently noone noticed because we don't
> run it as part of our test suite.  I used it to develop a fix for
> proper error reporting, so I fixed it.
> 
> Commit #2:
> serialize wait and destroy in containerizer
> 
> Containerizers keep track of all the containers they have created.  When
> a container is destroy()'ed the containerizer removes the container from
> its internal hashmap.  This means that wait() must be called before the
> container is destroyed.
> 
> When the docker containerizer fails because of a "docker pull" or
> "docker run" failure, we end up with a race condition where wait() does
> not properly return the Termination, so we ended up with a default
> failure message of "Abnormal Executor Termination" which isn't very
> useful.
> 
> This commit solves this issue by not destroying the container until
> after wait is called in the failure case.
> 
> Fixes MESOS-2656
> 
> Commit #3:
> send scheduler containerizer launch failure errors
> 
> If a method in the launch sequence returns a failure the error message
> wasn't making it back to the scheduler.
> 
> Fixes MESOS-2020
> 
> 
> Diffs
> -
> 
>   src/examples/docker_no_executor_framework.cpp 
> d5385d9295d30a6039bab9938f3270006582d3da 
>   src/slave/containerizer/containerizer.hpp 
> 56c088abb036aa26c323c3142fd27ea29ed51d4f 
>   src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
>   src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
>   src/slave/containerizer/mesos/containerizer.hpp 
> 5e5f13ed8a71ff9510b40b6032d00fd16d312622 
>   src/slave/containerizer/mesos/containerizer.cpp 
> f2587280dc0e1d566d2b856a80358c7b3896c603 
>   src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 
> 
> Diff: https://reviews.apache.org/r/33249/diff/
> 
> 
> Testing
> ---
> 
> I used examples/docker_no_executor_framework.cpp and make check
> 
> 
> Thanks,
> 
> Jay Buffington
> 
>



Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-07 Thread Jay Buffington

---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/
---

(Updated May 7, 2015, 9:17 a.m.)


Review request for mesos, Benjamin Hindman, Jie Yu, and Timothy Chen.


Bugs: MESOS-2020 and MESOS-2656
https://issues.apache.org/jira/browse/MESOS-2020
https://issues.apache.org/jira/browse/MESOS-2656


Repository: mesos


Description (updated)
---

Send statusUpdate to scheduler on containerizer launch failure

This review is actually for three commits.  The three commits are separated out
on github here:
https://github.com/apache/mesos/compare/master...jaybuff:report-launch-failure?expand=1

Commit #1:
docker-no-executor-framework example should send ContainerInfo

docker-no-executor-framework stopped working when the protobuf message
changed during development.  Apparently noone noticed because we don't
run it as part of our test suite.  I used it to develop a fix for
proper error reporting, so I fixed it.

Commit #2:
serialize wait and destroy in containerizer

Containerizers keep track of all the containers they have created.  When
a container is destroy()'ed the containerizer removes the container from
its internal hashmap.  This means that wait() must be called before the
container is destroyed.

When the docker containerizer fails because of a "docker pull" or
"docker run" failure, we end up with a race condition where wait() does
not properly return the Termination, so we ended up with a default
failure message of "Abnormal Executor Termination" which isn't very
useful.

This commit solves this issue by not destroying the container until
after wait is called in the failure case.

Fixes MESOS-2656

Commit #3:
send scheduler containerizer launch failure errors

If a method in the launch sequence returns a failure the error message
wasn't making it back to the scheduler.

Fixes MESOS-2020


Diffs (updated)
-

  src/examples/docker_no_executor_framework.cpp 
d5385d9295d30a6039bab9938f3270006582d3da 
  src/slave/containerizer/containerizer.hpp 
56c088abb036aa26c323c3142fd27ea29ed51d4f 
  src/slave/containerizer/docker.hpp b25ec55bf3cd30d6e8a804d09d90c632a7d12e3f 
  src/slave/containerizer/docker.cpp f9fc89ad7e3c853c3f9f6dcf9aa68e54dc1888c6 
  src/slave/containerizer/mesos/containerizer.hpp 
5e5f13ed8a71ff9510b40b6032d00fd16d312622 
  src/slave/containerizer/mesos/containerizer.cpp 
f2587280dc0e1d566d2b856a80358c7b3896c603 
  src/slave/slave.cpp c78ee3c9e7fc38ad364e83f4abe267e86bfbbc13 

Diff: https://reviews.apache.org/r/33249/diff/


Testing (updated)
---

I used examples/docker_no_executor_framework.cpp and make check


Thanks,

Jay Buffington



Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-06 Thread Jay Buffington


> On April 21, 2015, 4:25 p.m., Jie Yu wrote:
> > src/slave/slave.cpp, lines 3065-3078
> > 
> >
> > Instead of doing that in your way, can we just try to make sure 
> > `containerizer->wait` here will return a failure (or a Termination with 
> > some reason) when `containerizer->launch` fails. In that way, the 
> > `executorTerminated` will properly send status updates to the slave 
> > (TASK_LOST/TASK_FAILED).
> > 
> > Or am I missing something?
> 
> Jie Yu wrote:
> OK, I think I got confused by the ticket. There are actually two problems 
> here. The problem I am refering to is the fact that we don't send status 
> update to the scheduler if containerizer launch fails until executor 
> reregistration timeout happens. Since for docker containerizer, someone might 
> use a very large timeout value, ideally, the slave should send a status 
> update to the scheduler right after containerizer launch fails.
> 
> After chat with Jay, the problem you guys are refering to is the fact 
> that the scheduler cannot disinguish between the case where the task has 
> failed vs. the case where the configuration of a task is not correct, because 
> in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
> 
> Jie Yu wrote:
> To address the first problem, I think the simplest way is to add a 
> containerizer->destroy(..) in executorLaunched when containerizer->launch 
> fails. In that way, it's going to trigger containerizer->wait and thus send 
> status update to the scheduler.
> 
> Jie Yu wrote:
> Regarding the second problem, IMO, we should include a reason field in 
> Termination (https://issues.apache.org/jira/browse/MESOS-2035) and let 
> sendExecutorTerminatedStatusUpdate to propagate the termination reason to the 
> scheduler.
> 
> Timothy Chen wrote:
> Reason field sounds good, I think what you proposed makes sense, in 
> docker containerizer at least we also need to make sure termination message 
> is set correctly as currently it doesn't contain all the error information 
> that we pass back to the launch future.
> 
> Jay Buffington wrote:
> Sorry for the confusion.   There are three problems that are all related. 
>  Yes, we need to send statusUpdates as soon as containerizer launch fails and 
> yes, we need to set the reason so the scheduler can distinguish a slave 
> failure from a bad request.  However, my intention with this patch is not to 
> address either of those two problems.
> 
> My goal with this patch is to simply send the containerizer launch 
> failure message back to the scheduler.  I am using Aurora with the docker 
> containerizer.  There are a myriad of reasons that dockerd could fail to 
> "docker run" a container.  Currently, when that fails the user sees a useless 
> and incorrect "Abnormal Executor Termination" message in the Aurora web ui.  
> With this patch they see the stderr output of "docker run."  This way they 
> can report meaningful error messages to the operations team.
> 
> I can update this patch to address the other two issues, but the key is 
> that when the containerizer launch fails we have to send a statusUpdate with 
> a message that contains future.failure().  Before this patch we were only 
> logging it.  The scheduler needs to get that error message.
> 
> Jie Yu wrote:
> Thanks for clarifying it, Jay! In fact, my proposal above should be able 
> to solve the third problem cleanly. Check out 
> `Slave::sendExecutorTerminatedStatusUpdate`. The containerizer should 
> properly set the message and reason fields in the Termination protobuf 
> (basically why the container gets terminated and what's the error message). 
> The slave will forward the reason and message to the scheduler through status 
> update.
> 
> I chatted with BenM about this yesterday, and there are a couple of notes 
> I want to write down here.
> 
> 1. We probably need multiple levels for TaskStatus::Reason. In other 
> words, we probably need a "repeated Reason reasons" field in status update 
> message. For instance, for a containerizer launch failure, we probably need 
> two reasons: 1) the top level reason REASON_EXECUTOR_TERMINATED; 2) the 
> second level reason REASON_DOCKER_PULL_FAILURE;
> 
> 2. We probably want to allow extension to TaskStatus::Reason enum. For 
> example, some framework/executor may want to add customized reasons. We could 
> leverage the protobuf extension support to achieve that 
> (https://developers.google.com/protocol-buffers/docs/proto#extensions).
> 
> 3. The semantics around Termination is broken currently and we need to 
> clean it up. For instance, the following code is broken,
> ```
> void Slave::sendExecutorTerminatedStatusUpdate(...)
> {
>   ...
>   if (termination.isReady() && termination.get().killed()) {
> taskState = TASK_FAILED;
> // TODO(dhamon): 

Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-06 Thread Timothy Chen


> On April 21, 2015, 11:25 p.m., Jie Yu wrote:
> > src/slave/slave.cpp, lines 3065-3078
> > 
> >
> > Instead of doing that in your way, can we just try to make sure 
> > `containerizer->wait` here will return a failure (or a Termination with 
> > some reason) when `containerizer->launch` fails. In that way, the 
> > `executorTerminated` will properly send status updates to the slave 
> > (TASK_LOST/TASK_FAILED).
> > 
> > Or am I missing something?
> 
> Jie Yu wrote:
> OK, I think I got confused by the ticket. There are actually two problems 
> here. The problem I am refering to is the fact that we don't send status 
> update to the scheduler if containerizer launch fails until executor 
> reregistration timeout happens. Since for docker containerizer, someone might 
> use a very large timeout value, ideally, the slave should send a status 
> update to the scheduler right after containerizer launch fails.
> 
> After chat with Jay, the problem you guys are refering to is the fact 
> that the scheduler cannot disinguish between the case where the task has 
> failed vs. the case where the configuration of a task is not correct, because 
> in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
> 
> Jie Yu wrote:
> To address the first problem, I think the simplest way is to add a 
> containerizer->destroy(..) in executorLaunched when containerizer->launch 
> fails. In that way, it's going to trigger containerizer->wait and thus send 
> status update to the scheduler.
> 
> Jie Yu wrote:
> Regarding the second problem, IMO, we should include a reason field in 
> Termination (https://issues.apache.org/jira/browse/MESOS-2035) and let 
> sendExecutorTerminatedStatusUpdate to propagate the termination reason to the 
> scheduler.
> 
> Timothy Chen wrote:
> Reason field sounds good, I think what you proposed makes sense, in 
> docker containerizer at least we also need to make sure termination message 
> is set correctly as currently it doesn't contain all the error information 
> that we pass back to the launch future.
> 
> Jay Buffington wrote:
> Sorry for the confusion.   There are three problems that are all related. 
>  Yes, we need to send statusUpdates as soon as containerizer launch fails and 
> yes, we need to set the reason so the scheduler can distinguish a slave 
> failure from a bad request.  However, my intention with this patch is not to 
> address either of those two problems.
> 
> My goal with this patch is to simply send the containerizer launch 
> failure message back to the scheduler.  I am using Aurora with the docker 
> containerizer.  There are a myriad of reasons that dockerd could fail to 
> "docker run" a container.  Currently, when that fails the user sees a useless 
> and incorrect "Abnormal Executor Termination" message in the Aurora web ui.  
> With this patch they see the stderr output of "docker run."  This way they 
> can report meaningful error messages to the operations team.
> 
> I can update this patch to address the other two issues, but the key is 
> that when the containerizer launch fails we have to send a statusUpdate with 
> a message that contains future.failure().  Before this patch we were only 
> logging it.  The scheduler needs to get that error message.
> 
> Jie Yu wrote:
> Thanks for clarifying it, Jay! In fact, my proposal above should be able 
> to solve the third problem cleanly. Check out 
> `Slave::sendExecutorTerminatedStatusUpdate`. The containerizer should 
> properly set the message and reason fields in the Termination protobuf 
> (basically why the container gets terminated and what's the error message). 
> The slave will forward the reason and message to the scheduler through status 
> update.
> 
> I chatted with BenM about this yesterday, and there are a couple of notes 
> I want to write down here.
> 
> 1. We probably need multiple levels for TaskStatus::Reason. In other 
> words, we probably need a "repeated Reason reasons" field in status update 
> message. For instance, for a containerizer launch failure, we probably need 
> two reasons: 1) the top level reason REASON_EXECUTOR_TERMINATED; 2) the 
> second level reason REASON_DOCKER_PULL_FAILURE;
> 
> 2. We probably want to allow extension to TaskStatus::Reason enum. For 
> example, some framework/executor may want to add customized reasons. We could 
> leverage the protobuf extension support to achieve that 
> (https://developers.google.com/protocol-buffers/docs/proto#extensions).
> 
> 3. The semantics around Termination is broken currently and we need to 
> clean it up. For instance, the following code is broken,
> ```
> void Slave::sendExecutorTerminatedStatusUpdate(...)
> {
>   ...
>   if (termination.isReady() && termination.get().killed()) {
> taskState = TASK_FAILED;
> // TODO(dhamon):

Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-05 Thread Jay Buffington


> On April 21, 2015, 4:25 p.m., Jie Yu wrote:
> > src/slave/slave.cpp, lines 3065-3078
> > 
> >
> > Instead of doing that in your way, can we just try to make sure 
> > `containerizer->wait` here will return a failure (or a Termination with 
> > some reason) when `containerizer->launch` fails. In that way, the 
> > `executorTerminated` will properly send status updates to the slave 
> > (TASK_LOST/TASK_FAILED).
> > 
> > Or am I missing something?
> 
> Jie Yu wrote:
> OK, I think I got confused by the ticket. There are actually two problems 
> here. The problem I am refering to is the fact that we don't send status 
> update to the scheduler if containerizer launch fails until executor 
> reregistration timeout happens. Since for docker containerizer, someone might 
> use a very large timeout value, ideally, the slave should send a status 
> update to the scheduler right after containerizer launch fails.
> 
> After chat with Jay, the problem you guys are refering to is the fact 
> that the scheduler cannot disinguish between the case where the task has 
> failed vs. the case where the configuration of a task is not correct, because 
> in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
> 
> Jie Yu wrote:
> To address the first problem, I think the simplest way is to add a 
> containerizer->destroy(..) in executorLaunched when containerizer->launch 
> fails. In that way, it's going to trigger containerizer->wait and thus send 
> status update to the scheduler.
> 
> Jie Yu wrote:
> Regarding the second problem, IMO, we should include a reason field in 
> Termination (https://issues.apache.org/jira/browse/MESOS-2035) and let 
> sendExecutorTerminatedStatusUpdate to propagate the termination reason to the 
> scheduler.
> 
> Timothy Chen wrote:
> Reason field sounds good, I think what you proposed makes sense, in 
> docker containerizer at least we also need to make sure termination message 
> is set correctly as currently it doesn't contain all the error information 
> that we pass back to the launch future.
> 
> Jay Buffington wrote:
> Sorry for the confusion.   There are three problems that are all related. 
>  Yes, we need to send statusUpdates as soon as containerizer launch fails and 
> yes, we need to set the reason so the scheduler can distinguish a slave 
> failure from a bad request.  However, my intention with this patch is not to 
> address either of those two problems.
> 
> My goal with this patch is to simply send the containerizer launch 
> failure message back to the scheduler.  I am using Aurora with the docker 
> containerizer.  There are a myriad of reasons that dockerd could fail to 
> "docker run" a container.  Currently, when that fails the user sees a useless 
> and incorrect "Abnormal Executor Termination" message in the Aurora web ui.  
> With this patch they see the stderr output of "docker run."  This way they 
> can report meaningful error messages to the operations team.
> 
> I can update this patch to address the other two issues, but the key is 
> that when the containerizer launch fails we have to send a statusUpdate with 
> a message that contains future.failure().  Before this patch we were only 
> logging it.  The scheduler needs to get that error message.
> 
> Jie Yu wrote:
> Thanks for clarifying it, Jay! In fact, my proposal above should be able 
> to solve the third problem cleanly. Check out 
> `Slave::sendExecutorTerminatedStatusUpdate`. The containerizer should 
> properly set the message and reason fields in the Termination protobuf 
> (basically why the container gets terminated and what's the error message). 
> The slave will forward the reason and message to the scheduler through status 
> update.
> 
> I chatted with BenM about this yesterday, and there are a couple of notes 
> I want to write down here.
> 
> 1. We probably need multiple levels for TaskStatus::Reason. In other 
> words, we probably need a "repeated Reason reasons" field in status update 
> message. For instance, for a containerizer launch failure, we probably need 
> two reasons: 1) the top level reason REASON_EXECUTOR_TERMINATED; 2) the 
> second level reason REASON_DOCKER_PULL_FAILURE;
> 
> 2. We probably want to allow extension to TaskStatus::Reason enum. For 
> example, some framework/executor may want to add customized reasons. We could 
> leverage the protobuf extension support to achieve that 
> (https://developers.google.com/protocol-buffers/docs/proto#extensions).
> 
> 3. The semantics around Termination is broken currently and we need to 
> clean it up. For instance, the following code is broken,
> ```
> void Slave::sendExecutorTerminatedStatusUpdate(...)
> {
>   ...
>   if (termination.isReady() && termination.get().killed()) {
> taskState = TASK_FAILED;
> // TODO(dhamon): 

Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-05 Thread Jie Yu


> On April 21, 2015, 11:25 p.m., Jie Yu wrote:
> > src/slave/slave.cpp, lines 3065-3078
> > 
> >
> > Instead of doing that in your way, can we just try to make sure 
> > `containerizer->wait` here will return a failure (or a Termination with 
> > some reason) when `containerizer->launch` fails. In that way, the 
> > `executorTerminated` will properly send status updates to the slave 
> > (TASK_LOST/TASK_FAILED).
> > 
> > Or am I missing something?
> 
> Jie Yu wrote:
> OK, I think I got confused by the ticket. There are actually two problems 
> here. The problem I am refering to is the fact that we don't send status 
> update to the scheduler if containerizer launch fails until executor 
> reregistration timeout happens. Since for docker containerizer, someone might 
> use a very large timeout value, ideally, the slave should send a status 
> update to the scheduler right after containerizer launch fails.
> 
> After chat with Jay, the problem you guys are refering to is the fact 
> that the scheduler cannot disinguish between the case where the task has 
> failed vs. the case where the configuration of a task is not correct, because 
> in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
> 
> Jie Yu wrote:
> To address the first problem, I think the simplest way is to add a 
> containerizer->destroy(..) in executorLaunched when containerizer->launch 
> fails. In that way, it's going to trigger containerizer->wait and thus send 
> status update to the scheduler.
> 
> Jie Yu wrote:
> Regarding the second problem, IMO, we should include a reason field in 
> Termination (https://issues.apache.org/jira/browse/MESOS-2035) and let 
> sendExecutorTerminatedStatusUpdate to propagate the termination reason to the 
> scheduler.
> 
> Timothy Chen wrote:
> Reason field sounds good, I think what you proposed makes sense, in 
> docker containerizer at least we also need to make sure termination message 
> is set correctly as currently it doesn't contain all the error information 
> that we pass back to the launch future.
> 
> Jay Buffington wrote:
> Sorry for the confusion.   There are three problems that are all related. 
>  Yes, we need to send statusUpdates as soon as containerizer launch fails and 
> yes, we need to set the reason so the scheduler can distinguish a slave 
> failure from a bad request.  However, my intention with this patch is not to 
> address either of those two problems.
> 
> My goal with this patch is to simply send the containerizer launch 
> failure message back to the scheduler.  I am using Aurora with the docker 
> containerizer.  There are a myriad of reasons that dockerd could fail to 
> "docker run" a container.  Currently, when that fails the user sees a useless 
> and incorrect "Abnormal Executor Termination" message in the Aurora web ui.  
> With this patch they see the stderr output of "docker run."  This way they 
> can report meaningful error messages to the operations team.
> 
> I can update this patch to address the other two issues, but the key is 
> that when the containerizer launch fails we have to send a statusUpdate with 
> a message that contains future.failure().  Before this patch we were only 
> logging it.  The scheduler needs to get that error message.
> 
> Jie Yu wrote:
> Thanks for clarifying it, Jay! In fact, my proposal above should be able 
> to solve the third problem cleanly. Check out 
> `Slave::sendExecutorTerminatedStatusUpdate`. The containerizer should 
> properly set the message and reason fields in the Termination protobuf 
> (basically why the container gets terminated and what's the error message). 
> The slave will forward the reason and message to the scheduler through status 
> update.
> 
> I chatted with BenM about this yesterday, and there are a couple of notes 
> I want to write down here.
> 
> 1. We probably need multiple levels for TaskStatus::Reason. In other 
> words, we probably need a "repeated Reason reasons" field in status update 
> message. For instance, for a containerizer launch failure, we probably need 
> two reasons: 1) the top level reason REASON_EXECUTOR_TERMINATED; 2) the 
> second level reason REASON_DOCKER_PULL_FAILURE;
> 
> 2. We probably want to allow extension to TaskStatus::Reason enum. For 
> example, some framework/executor may want to add customized reasons. We could 
> leverage the protobuf extension support to achieve that 
> (https://developers.google.com/protocol-buffers/docs/proto#extensions).
> 
> 3. The semantics around Termination is broken currently and we need to 
> clean it up. For instance, the following code is broken,
> ```
> void Slave::sendExecutorTerminatedStatusUpdate(...)
> {
>   ...
>   if (termination.isReady() && termination.get().killed()) {
> taskState = TASK_FAILED;
> // TODO(dhamon):

Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-04 Thread Timothy Chen


> On April 21, 2015, 11:25 p.m., Jie Yu wrote:
> > src/slave/slave.cpp, lines 3065-3078
> > 
> >
> > Instead of doing that in your way, can we just try to make sure 
> > `containerizer->wait` here will return a failure (or a Termination with 
> > some reason) when `containerizer->launch` fails. In that way, the 
> > `executorTerminated` will properly send status updates to the slave 
> > (TASK_LOST/TASK_FAILED).
> > 
> > Or am I missing something?
> 
> Jie Yu wrote:
> OK, I think I got confused by the ticket. There are actually two problems 
> here. The problem I am refering to is the fact that we don't send status 
> update to the scheduler if containerizer launch fails until executor 
> reregistration timeout happens. Since for docker containerizer, someone might 
> use a very large timeout value, ideally, the slave should send a status 
> update to the scheduler right after containerizer launch fails.
> 
> After chat with Jay, the problem you guys are refering to is the fact 
> that the scheduler cannot disinguish between the case where the task has 
> failed vs. the case where the configuration of a task is not correct, because 
> in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
> 
> Jie Yu wrote:
> To address the first problem, I think the simplest way is to add a 
> containerizer->destroy(..) in executorLaunched when containerizer->launch 
> fails. In that way, it's going to trigger containerizer->wait and thus send 
> status update to the scheduler.
> 
> Jie Yu wrote:
> Regarding the second problem, IMO, we should include a reason field in 
> Termination (https://issues.apache.org/jira/browse/MESOS-2035) and let 
> sendExecutorTerminatedStatusUpdate to propagate the termination reason to the 
> scheduler.
> 
> Timothy Chen wrote:
> Reason field sounds good, I think what you proposed makes sense, in 
> docker containerizer at least we also need to make sure termination message 
> is set correctly as currently it doesn't contain all the error information 
> that we pass back to the launch future.
> 
> Jay Buffington wrote:
> Sorry for the confusion.   There are three problems that are all related. 
>  Yes, we need to send statusUpdates as soon as containerizer launch fails and 
> yes, we need to set the reason so the scheduler can distinguish a slave 
> failure from a bad request.  However, my intention with this patch is not to 
> address either of those two problems.
> 
> My goal with this patch is to simply send the containerizer launch 
> failure message back to the scheduler.  I am using Aurora with the docker 
> containerizer.  There are a myriad of reasons that dockerd could fail to 
> "docker run" a container.  Currently, when that fails the user sees a useless 
> and incorrect "Abnormal Executor Termination" message in the Aurora web ui.  
> With this patch they see the stderr output of "docker run."  This way they 
> can report meaningful error messages to the operations team.
> 
> I can update this patch to address the other two issues, but the key is 
> that when the containerizer launch fails we have to send a statusUpdate with 
> a message that contains future.failure().  Before this patch we were only 
> logging it.  The scheduler needs to get that error message.
> 
> Jie Yu wrote:
> Thanks for clarifying it, Jay! In fact, my proposal above should be able 
> to solve the third problem cleanly. Check out 
> `Slave::sendExecutorTerminatedStatusUpdate`. The containerizer should 
> properly set the message and reason fields in the Termination protobuf 
> (basically why the container gets terminated and what's the error message). 
> The slave will forward the reason and message to the scheduler through status 
> update.
> 
> I chatted with BenM about this yesterday, and there are a couple of notes 
> I want to write down here.
> 
> 1. We probably need multiple levels for TaskStatus::Reason. In other 
> words, we probably need a "repeated Reason reasons" field in status update 
> message. For instance, for a containerizer launch failure, we probably need 
> two reasons: 1) the top level reason REASON_EXECUTOR_TERMINATED; 2) the 
> second level reason REASON_DOCKER_PULL_FAILURE;
> 
> 2. We probably want to allow extension to TaskStatus::Reason enum. For 
> example, some framework/executor may want to add customized reasons. We could 
> leverage the protobuf extension support to achieve that 
> (https://developers.google.com/protocol-buffers/docs/proto#extensions).
> 
> 3. The semantics around Termination is broken currently and we need to 
> clean it up. For instance, the following code is broken,
> ```
> void Slave::sendExecutorTerminatedStatusUpdate(...)
> {
>   ...
>   if (termination.isReady() && termination.get().killed()) {
> taskState = TASK_FAILED;
> // TODO(dhamon):

Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-04 Thread Jay Buffington


> On April 21, 2015, 4:25 p.m., Jie Yu wrote:
> > src/slave/slave.cpp, lines 3065-3078
> > 
> >
> > Instead of doing that in your way, can we just try to make sure 
> > `containerizer->wait` here will return a failure (or a Termination with 
> > some reason) when `containerizer->launch` fails. In that way, the 
> > `executorTerminated` will properly send status updates to the slave 
> > (TASK_LOST/TASK_FAILED).
> > 
> > Or am I missing something?
> 
> Jie Yu wrote:
> OK, I think I got confused by the ticket. There are actually two problems 
> here. The problem I am refering to is the fact that we don't send status 
> update to the scheduler if containerizer launch fails until executor 
> reregistration timeout happens. Since for docker containerizer, someone might 
> use a very large timeout value, ideally, the slave should send a status 
> update to the scheduler right after containerizer launch fails.
> 
> After chat with Jay, the problem you guys are refering to is the fact 
> that the scheduler cannot disinguish between the case where the task has 
> failed vs. the case where the configuration of a task is not correct, because 
> in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
> 
> Jie Yu wrote:
> To address the first problem, I think the simplest way is to add a 
> containerizer->destroy(..) in executorLaunched when containerizer->launch 
> fails. In that way, it's going to trigger containerizer->wait and thus send 
> status update to the scheduler.
> 
> Jie Yu wrote:
> Regarding the second problem, IMO, we should include a reason field in 
> Termination (https://issues.apache.org/jira/browse/MESOS-2035) and let 
> sendExecutorTerminatedStatusUpdate to propagate the termination reason to the 
> scheduler.
> 
> Timothy Chen wrote:
> Reason field sounds good, I think what you proposed makes sense, in 
> docker containerizer at least we also need to make sure termination message 
> is set correctly as currently it doesn't contain all the error information 
> that we pass back to the launch future.
> 
> Jay Buffington wrote:
> Sorry for the confusion.   There are three problems that are all related. 
>  Yes, we need to send statusUpdates as soon as containerizer launch fails and 
> yes, we need to set the reason so the scheduler can distinguish a slave 
> failure from a bad request.  However, my intention with this patch is not to 
> address either of those two problems.
> 
> My goal with this patch is to simply send the containerizer launch 
> failure message back to the scheduler.  I am using Aurora with the docker 
> containerizer.  There are a myriad of reasons that dockerd could fail to 
> "docker run" a container.  Currently, when that fails the user sees a useless 
> and incorrect "Abnormal Executor Termination" message in the Aurora web ui.  
> With this patch they see the stderr output of "docker run."  This way they 
> can report meaningful error messages to the operations team.
> 
> I can update this patch to address the other two issues, but the key is 
> that when the containerizer launch fails we have to send a statusUpdate with 
> a message that contains future.failure().  Before this patch we were only 
> logging it.  The scheduler needs to get that error message.
> 
> Jie Yu wrote:
> Thanks for clarifying it, Jay! In fact, my proposal above should be able 
> to solve the third problem cleanly. Check out 
> `Slave::sendExecutorTerminatedStatusUpdate`. The containerizer should 
> properly set the message and reason fields in the Termination protobuf 
> (basically why the container gets terminated and what's the error message). 
> The slave will forward the reason and message to the scheduler through status 
> update.
> 
> I chatted with BenM about this yesterday, and there are a couple of notes 
> I want to write down here.
> 
> 1. We probably need multiple levels for TaskStatus::Reason. In other 
> words, we probably need a "repeated Reason reasons" field in status update 
> message. For instance, for a containerizer launch failure, we probably need 
> two reasons: 1) the top level reason REASON_EXECUTOR_TERMINATED; 2) the 
> second level reason REASON_DOCKER_PULL_FAILURE;
> 
> 2. We probably want to allow extension to TaskStatus::Reason enum. For 
> example, some framework/executor may want to add customized reasons. We could 
> leverage the protobuf extension support to achieve that 
> (https://developers.google.com/protocol-buffers/docs/proto#extensions).
> 
> 3. The semantics around Termination is broken currently and we need to 
> clean it up. For instance, the following code is broken,
> ```
> void Slave::sendExecutorTerminatedStatusUpdate(...)
> {
>   ...
>   if (termination.isReady() && termination.get().killed()) {
> taskState = TASK_FAILED;
> // TODO(dhamon): 

Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-05-01 Thread Jie Yu


> On April 21, 2015, 11:25 p.m., Jie Yu wrote:
> > src/slave/slave.cpp, lines 3065-3078
> > 
> >
> > Instead of doing that in your way, can we just try to make sure 
> > `containerizer->wait` here will return a failure (or a Termination with 
> > some reason) when `containerizer->launch` fails. In that way, the 
> > `executorTerminated` will properly send status updates to the slave 
> > (TASK_LOST/TASK_FAILED).
> > 
> > Or am I missing something?
> 
> Jie Yu wrote:
> OK, I think I got confused by the ticket. There are actually two problems 
> here. The problem I am refering to is the fact that we don't send status 
> update to the scheduler if containerizer launch fails until executor 
> reregistration timeout happens. Since for docker containerizer, someone might 
> use a very large timeout value, ideally, the slave should send a status 
> update to the scheduler right after containerizer launch fails.
> 
> After chat with Jay, the problem you guys are refering to is the fact 
> that the scheduler cannot disinguish between the case where the task has 
> failed vs. the case where the configuration of a task is not correct, because 
> in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
> 
> Jie Yu wrote:
> To address the first problem, I think the simplest way is to add a 
> containerizer->destroy(..) in executorLaunched when containerizer->launch 
> fails. In that way, it's going to trigger containerizer->wait and thus send 
> status update to the scheduler.
> 
> Jie Yu wrote:
> Regarding the second problem, IMO, we should include a reason field in 
> Termination (https://issues.apache.org/jira/browse/MESOS-2035) and let 
> sendExecutorTerminatedStatusUpdate to propagate the termination reason to the 
> scheduler.
> 
> Timothy Chen wrote:
> Reason field sounds good, I think what you proposed makes sense, in 
> docker containerizer at least we also need to make sure termination message 
> is set correctly as currently it doesn't contain all the error information 
> that we pass back to the launch future.
> 
> Jay Buffington wrote:
> Sorry for the confusion.   There are three problems that are all related. 
>  Yes, we need to send statusUpdates as soon as containerizer launch fails and 
> yes, we need to set the reason so the scheduler can distinguish a slave 
> failure from a bad request.  However, my intention with this patch is not to 
> address either of those two problems.
> 
> My goal with this patch is to simply send the containerizer launch 
> failure message back to the scheduler.  I am using Aurora with the docker 
> containerizer.  There are a myriad of reasons that dockerd could fail to 
> "docker run" a container.  Currently, when that fails the user sees a useless 
> and incorrect "Abnormal Executor Termination" message in the Aurora web ui.  
> With this patch they see the stderr output of "docker run."  This way they 
> can report meaningful error messages to the operations team.
> 
> I can update this patch to address the other two issues, but the key is 
> that when the containerizer launch fails we have to send a statusUpdate with 
> a message that contains future.failure().  Before this patch we were only 
> logging it.  The scheduler needs to get that error message.
> 
> Jie Yu wrote:
> Thanks for clarifying it, Jay! In fact, my proposal above should be able 
> to solve the third problem cleanly. Check out 
> `Slave::sendExecutorTerminatedStatusUpdate`. The containerizer should 
> properly set the message and reason fields in the Termination protobuf 
> (basically why the container gets terminated and what's the error message). 
> The slave will forward the reason and message to the scheduler through status 
> update.
> 
> I chatted with BenM about this yesterday, and there are a couple of notes 
> I want to write down here.
> 
> 1. We probably need multiple levels for TaskStatus::Reason. In other 
> words, we probably need a "repeated Reason reasons" field in status update 
> message. For instance, for a containerizer launch failure, we probably need 
> two reasons: 1) the top level reason REASON_EXECUTOR_TERMINATED; 2) the 
> second level reason REASON_DOCKER_PULL_FAILURE;
> 
> 2. We probably want to allow extension to TaskStatus::Reason enum. For 
> example, some framework/executor may want to add customized reasons. We could 
> leverage the protobuf extension support to achieve that 
> (https://developers.google.com/protocol-buffers/docs/proto#extensions).
> 
> 3. The semantics around Termination is broken currently and we need to 
> clean it up. For instance, the following code is broken,
> ```
> void Slave::sendExecutorTerminatedStatusUpdate(...)
> {
>   ...
>   if (termination.isReady() && termination.get().killed()) {
> taskState = TASK_FAILED;
> // TODO(dhamon):

Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-04-30 Thread Jay Buffington


> On April 21, 2015, 4:25 p.m., Jie Yu wrote:
> > src/slave/slave.cpp, lines 3065-3078
> > 
> >
> > Instead of doing that in your way, can we just try to make sure 
> > `containerizer->wait` here will return a failure (or a Termination with 
> > some reason) when `containerizer->launch` fails. In that way, the 
> > `executorTerminated` will properly send status updates to the slave 
> > (TASK_LOST/TASK_FAILED).
> > 
> > Or am I missing something?
> 
> Jie Yu wrote:
> OK, I think I got confused by the ticket. There are actually two problems 
> here. The problem I am refering to is the fact that we don't send status 
> update to the scheduler if containerizer launch fails until executor 
> reregistration timeout happens. Since for docker containerizer, someone might 
> use a very large timeout value, ideally, the slave should send a status 
> update to the scheduler right after containerizer launch fails.
> 
> After chat with Jay, the problem you guys are refering to is the fact 
> that the scheduler cannot disinguish between the case where the task has 
> failed vs. the case where the configuration of a task is not correct, because 
> in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
> 
> Jie Yu wrote:
> To address the first problem, I think the simplest way is to add a 
> containerizer->destroy(..) in executorLaunched when containerizer->launch 
> fails. In that way, it's going to trigger containerizer->wait and thus send 
> status update to the scheduler.
> 
> Jie Yu wrote:
> Regarding the second problem, IMO, we should include a reason field in 
> Termination (https://issues.apache.org/jira/browse/MESOS-2035) and let 
> sendExecutorTerminatedStatusUpdate to propagate the termination reason to the 
> scheduler.
> 
> Timothy Chen wrote:
> Reason field sounds good, I think what you proposed makes sense, in 
> docker containerizer at least we also need to make sure termination message 
> is set correctly as currently it doesn't contain all the error information 
> that we pass back to the launch future.
> 
> Jay Buffington wrote:
> Sorry for the confusion.   There are three problems that are all related. 
>  Yes, we need to send statusUpdates as soon as containerizer launch fails and 
> yes, we need to set the reason so the scheduler can distinguish a slave 
> failure from a bad request.  However, my intention with this patch is not to 
> address either of those two problems.
> 
> My goal with this patch is to simply send the containerizer launch 
> failure message back to the scheduler.  I am using Aurora with the docker 
> containerizer.  There are a myriad of reasons that dockerd could fail to 
> "docker run" a container.  Currently, when that fails the user sees a useless 
> and incorrect "Abnormal Executor Termination" message in the Aurora web ui.  
> With this patch they see the stderr output of "docker run."  This way they 
> can report meaningful error messages to the operations team.
> 
> I can update this patch to address the other two issues, but the key is 
> that when the containerizer launch fails we have to send a statusUpdate with 
> a message that contains future.failure().  Before this patch we were only 
> logging it.  The scheduler needs to get that error message.
> 
> Jie Yu wrote:
> Thanks for clarifying it, Jay! In fact, my proposal above should be able 
> to solve the third problem cleanly. Check out 
> `Slave::sendExecutorTerminatedStatusUpdate`. The containerizer should 
> properly set the message and reason fields in the Termination protobuf 
> (basically why the container gets terminated and what's the error message). 
> The slave will forward the reason and message to the scheduler through status 
> update.
> 
> I chatted with BenM about this yesterday, and there are a couple of notes 
> I want to write down here.
> 
> 1. We probably need multiple levels for TaskStatus::Reason. In other 
> words, we probably need a "repeated Reason reasons" field in status update 
> message. For instance, for a containerizer launch failure, we probably need 
> two reasons: 1) the top level reason REASON_EXECUTOR_TERMINATED; 2) the 
> second level reason REASON_DOCKER_PULL_FAILURE;
> 
> 2. We probably want to allow extension to TaskStatus::Reason enum. For 
> example, some framework/executor may want to add customized reasons. We could 
> leverage the protobuf extension support to achieve that 
> (https://developers.google.com/protocol-buffers/docs/proto#extensions).
> 
> 3. The semantics around Termination is broken currently and we need to 
> clean it up. For instance, the following code is broken,
> ```
> void Slave::sendExecutorTerminatedStatusUpdate(...)
> {
>   ...
>   if (termination.isReady() && termination.get().killed()) {
> taskState = TASK_FAILED;
> // TODO(dhamon): 

Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-04-22 Thread Jie Yu


> On April 21, 2015, 11:25 p.m., Jie Yu wrote:
> > src/slave/slave.cpp, lines 3065-3078
> > 
> >
> > Instead of doing that in your way, can we just try to make sure 
> > `containerizer->wait` here will return a failure (or a Termination with 
> > some reason) when `containerizer->launch` fails. In that way, the 
> > `executorTerminated` will properly send status updates to the slave 
> > (TASK_LOST/TASK_FAILED).
> > 
> > Or am I missing something?
> 
> Jie Yu wrote:
> OK, I think I got confused by the ticket. There are actually two problems 
> here. The problem I am refering to is the fact that we don't send status 
> update to the scheduler if containerizer launch fails until executor 
> reregistration timeout happens. Since for docker containerizer, someone might 
> use a very large timeout value, ideally, the slave should send a status 
> update to the scheduler right after containerizer launch fails.
> 
> After chat with Jay, the problem you guys are refering to is the fact 
> that the scheduler cannot disinguish between the case where the task has 
> failed vs. the case where the configuration of a task is not correct, because 
> in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
> 
> Jie Yu wrote:
> To address the first problem, I think the simplest way is to add a 
> containerizer->destroy(..) in executorLaunched when containerizer->launch 
> fails. In that way, it's going to trigger containerizer->wait and thus send 
> status update to the scheduler.
> 
> Jie Yu wrote:
> Regarding the second problem, IMO, we should include a reason field in 
> Termination (https://issues.apache.org/jira/browse/MESOS-2035) and let 
> sendExecutorTerminatedStatusUpdate to propagate the termination reason to the 
> scheduler.
> 
> Timothy Chen wrote:
> Reason field sounds good, I think what you proposed makes sense, in 
> docker containerizer at least we also need to make sure termination message 
> is set correctly as currently it doesn't contain all the error information 
> that we pass back to the launch future.
> 
> Jay Buffington wrote:
> Sorry for the confusion.   There are three problems that are all related. 
>  Yes, we need to send statusUpdates as soon as containerizer launch fails and 
> yes, we need to set the reason so the scheduler can distinguish a slave 
> failure from a bad request.  However, my intention with this patch is not to 
> address either of those two problems.
> 
> My goal with this patch is to simply send the containerizer launch 
> failure message back to the scheduler.  I am using Aurora with the docker 
> containerizer.  There are a myriad of reasons that dockerd could fail to 
> "docker run" a container.  Currently, when that fails the user sees a useless 
> and incorrect "Abnormal Executor Termination" message in the Aurora web ui.  
> With this patch they see the stderr output of "docker run."  This way they 
> can report meaningful error messages to the operations team.
> 
> I can update this patch to address the other two issues, but the key is 
> that when the containerizer launch fails we have to send a statusUpdate with 
> a message that contains future.failure().  Before this patch we were only 
> logging it.  The scheduler needs to get that error message.
> 
> Jie Yu wrote:
> Thanks for clarifying it, Jay! In fact, my proposal above should be able 
> to solve the third problem cleanly. Check out 
> `Slave::sendExecutorTerminatedStatusUpdate`. The containerizer should 
> properly set the message and reason fields in the Termination protobuf 
> (basically why the container gets terminated and what's the error message). 
> The slave will forward the reason and message to the scheduler through status 
> update.
> 
> I chatted with BenM about this yesterday, and there are a couple of notes 
> I want to write down here.
> 
> 1. We probably need multiple levels for TaskStatus::Reason. In other 
> words, we probably need a "repeated Reason reasons" field in status update 
> message. For instance, for a containerizer launch failure, we probably need 
> two reasons: 1) the top level reason REASON_EXECUTOR_TERMINATED; 2) the 
> second level reason REASON_DOCKER_PULL_FAILURE;
> 
> 2. We probably want to allow extension to TaskStatus::Reason enum. For 
> example, some framework/executor may want to add customized reasons. We could 
> leverage the protobuf extension support to achieve that 
> (https://developers.google.com/protocol-buffers/docs/proto#extensions).
> 
> 3. The semantics around Termination is broken currently and we need to 
> clean it up. For instance, the following code is broken,
> ```
> void Slave::sendExecutorTerminatedStatusUpdate(...)
> {
>   ...
>   if (termination.isReady() && termination.get().killed()) {
> taskState = TASK_FAILED;
> // TODO(dhamon):

Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-04-22 Thread Jay Buffington


> On April 21, 2015, 4:25 p.m., Jie Yu wrote:
> > src/slave/slave.cpp, lines 3065-3078
> > 
> >
> > Instead of doing that in your way, can we just try to make sure 
> > `containerizer->wait` here will return a failure (or a Termination with 
> > some reason) when `containerizer->launch` fails. In that way, the 
> > `executorTerminated` will properly send status updates to the slave 
> > (TASK_LOST/TASK_FAILED).
> > 
> > Or am I missing something?
> 
> Jie Yu wrote:
> OK, I think I got confused by the ticket. There are actually two problems 
> here. The problem I am refering to is the fact that we don't send status 
> update to the scheduler if containerizer launch fails until executor 
> reregistration timeout happens. Since for docker containerizer, someone might 
> use a very large timeout value, ideally, the slave should send a status 
> update to the scheduler right after containerizer launch fails.
> 
> After chat with Jay, the problem you guys are refering to is the fact 
> that the scheduler cannot disinguish between the case where the task has 
> failed vs. the case where the configuration of a task is not correct, because 
> in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
> 
> Jie Yu wrote:
> To address the first problem, I think the simplest way is to add a 
> containerizer->destroy(..) in executorLaunched when containerizer->launch 
> fails. In that way, it's going to trigger containerizer->wait and thus send 
> status update to the scheduler.
> 
> Jie Yu wrote:
> Regarding the second problem, IMO, we should include a reason field in 
> Termination (https://issues.apache.org/jira/browse/MESOS-2035) and let 
> sendExecutorTerminatedStatusUpdate to propagate the termination reason to the 
> scheduler.
> 
> Timothy Chen wrote:
> Reason field sounds good, I think what you proposed makes sense, in 
> docker containerizer at least we also need to make sure termination message 
> is set correctly as currently it doesn't contain all the error information 
> that we pass back to the launch future.
> 
> Jay Buffington wrote:
> Sorry for the confusion.   There are three problems that are all related. 
>  Yes, we need to send statusUpdates as soon as containerizer launch fails and 
> yes, we need to set the reason so the scheduler can distinguish a slave 
> failure from a bad request.  However, my intention with this patch is not to 
> address either of those two problems.
> 
> My goal with this patch is to simply send the containerizer launch 
> failure message back to the scheduler.  I am using Aurora with the docker 
> containerizer.  There are a myriad of reasons that dockerd could fail to 
> "docker run" a container.  Currently, when that fails the user sees a useless 
> and incorrect "Abnormal Executor Termination" message in the Aurora web ui.  
> With this patch they see the stderr output of "docker run."  This way they 
> can report meaningful error messages to the operations team.
> 
> I can update this patch to address the other two issues, but the key is 
> that when the containerizer launch fails we have to send a statusUpdate with 
> a message that contains future.failure().  Before this patch we were only 
> logging it.  The scheduler needs to get that error message.
> 
> Jie Yu wrote:
> Thanks for clarifying it, Jay! In fact, my proposal above should be able 
> to solve the third problem cleanly. Check out 
> `Slave::sendExecutorTerminatedStatusUpdate`. The containerizer should 
> properly set the message and reason fields in the Termination protobuf 
> (basically why the container gets terminated and what's the error message). 
> The slave will forward the reason and message to the scheduler through status 
> update.
> 
> I chatted with BenM about this yesterday, and there are a couple of notes 
> I want to write down here.
> 
> 1. We probably need multiple levels for TaskStatus::Reason. In other 
> words, we probably need a "repeated Reason reasons" field in status update 
> message. For instance, for a containerizer launch failure, we probably need 
> two reasons: 1) the top level reason REASON_EXECUTOR_TERMINATED; 2) the 
> second level reason REASON_DOCKER_PULL_FAILURE;
> 
> 2. We probably want to allow extension to TaskStatus::Reason enum. For 
> example, some framework/executor may want to add customized reasons. We could 
> leverage the protobuf extension support to achieve that 
> (https://developers.google.com/protocol-buffers/docs/proto#extensions).
> 
> 3. The semantics around Termination is broken currently and we need to 
> clean it up. For instance, the following code is broken,
> ```
> void Slave::sendExecutorTerminatedStatusUpdate(...)
> {
>   ...
>   if (termination.isReady() && termination.get().killed()) {
> taskState = TASK_FAILED;
> // TODO(dhamon): 

Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-04-22 Thread Jie Yu


> On April 21, 2015, 11:25 p.m., Jie Yu wrote:
> > src/slave/slave.cpp, lines 3065-3078
> > 
> >
> > Instead of doing that in your way, can we just try to make sure 
> > `containerizer->wait` here will return a failure (or a Termination with 
> > some reason) when `containerizer->launch` fails. In that way, the 
> > `executorTerminated` will properly send status updates to the slave 
> > (TASK_LOST/TASK_FAILED).
> > 
> > Or am I missing something?
> 
> Jie Yu wrote:
> OK, I think I got confused by the ticket. There are actually two problems 
> here. The problem I am refering to is the fact that we don't send status 
> update to the scheduler if containerizer launch fails until executor 
> reregistration timeout happens. Since for docker containerizer, someone might 
> use a very large timeout value, ideally, the slave should send a status 
> update to the scheduler right after containerizer launch fails.
> 
> After chat with Jay, the problem you guys are refering to is the fact 
> that the scheduler cannot disinguish between the case where the task has 
> failed vs. the case where the configuration of a task is not correct, because 
> in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
> 
> Jie Yu wrote:
> To address the first problem, I think the simplest way is to add a 
> containerizer->destroy(..) in executorLaunched when containerizer->launch 
> fails. In that way, it's going to trigger containerizer->wait and thus send 
> status update to the scheduler.
> 
> Jie Yu wrote:
> Regarding the second problem, IMO, we should include a reason field in 
> Termination (https://issues.apache.org/jira/browse/MESOS-2035) and let 
> sendExecutorTerminatedStatusUpdate to propagate the termination reason to the 
> scheduler.
> 
> Timothy Chen wrote:
> Reason field sounds good, I think what you proposed makes sense, in 
> docker containerizer at least we also need to make sure termination message 
> is set correctly as currently it doesn't contain all the error information 
> that we pass back to the launch future.
> 
> Jay Buffington wrote:
> Sorry for the confusion.   There are three problems that are all related. 
>  Yes, we need to send statusUpdates as soon as containerizer launch fails and 
> yes, we need to set the reason so the scheduler can distinguish a slave 
> failure from a bad request.  However, my intention with this patch is not to 
> address either of those two problems.
> 
> My goal with this patch is to simply send the containerizer launch 
> failure message back to the scheduler.  I am using Aurora with the docker 
> containerizer.  There are a myriad of reasons that dockerd could fail to 
> "docker run" a container.  Currently, when that fails the user sees a useless 
> and incorrect "Abnormal Executor Termination" message in the Aurora web ui.  
> With this patch they see the stderr output of "docker run."  This way they 
> can report meaningful error messages to the operations team.
> 
> I can update this patch to address the other two issues, but the key is 
> that when the containerizer launch fails we have to send a statusUpdate with 
> a message that contains future.failure().  Before this patch we were only 
> logging it.  The scheduler needs to get that error message.

Thanks for clarifying it, Jay! In fact, my proposal above should be able to 
solve the third problem cleanly. Check out 
`Slave::sendExecutorTerminatedStatusUpdate`. The containerizer should properly 
set the message and reason fields in the Termination protobuf (basically why 
the container gets terminated and what's the error message). The slave will 
forward the reason and message to the scheduler through status update.

I chatted with BenM about this yesterday, and there are a couple of notes I 
want to write down here.

1. We probably need multiple levels for TaskStatus::Reason. In other words, we 
probably need a "repeated Reason reasons" field in status update message. For 
instance, for a containerizer launch failure, we probably need two reasons: 1) 
the top level reason REASON_EXECUTOR_TERMINATED; 2) the second level reason 
REASON_DOCKER_PULL_FAILURE;

2. We probably want to allow extension to TaskStatus::Reason enum. For example, 
some framework/executor may want to add customized reasons. We could leverage 
the protobuf extension support to achieve that 
(https://developers.google.com/protocol-buffers/docs/proto#extensions).

3. The semantics around Termination is broken currently and we need to clean it 
up. For instance, the following code is broken,
```
void Slave::sendExecutorTerminatedStatusUpdate(...)
{
  ...
  if (termination.isReady() && termination.get().killed()) {
taskState = TASK_FAILED;
// TODO(dhamon): MESOS-2035: Add 'reason' to containerizer::Termination.
reason = TaskStatus::REASON_MEMORY_LIMIT;
  }
}
```
because we now have disk limit a

Re: Review Request 33249: Send statusUpdate to scheduler on containerizer launch failure

2015-04-22 Thread Jay Buffington


> On April 21, 2015, 4:25 p.m., Jie Yu wrote:
> > src/slave/slave.cpp, lines 3065-3078
> > 
> >
> > Instead of doing that in your way, can we just try to make sure 
> > `containerizer->wait` here will return a failure (or a Termination with 
> > some reason) when `containerizer->launch` fails. In that way, the 
> > `executorTerminated` will properly send status updates to the slave 
> > (TASK_LOST/TASK_FAILED).
> > 
> > Or am I missing something?
> 
> Jie Yu wrote:
> OK, I think I got confused by the ticket. There are actually two problems 
> here. The problem I am refering to is the fact that we don't send status 
> update to the scheduler if containerizer launch fails until executor 
> reregistration timeout happens. Since for docker containerizer, someone might 
> use a very large timeout value, ideally, the slave should send a status 
> update to the scheduler right after containerizer launch fails.
> 
> After chat with Jay, the problem you guys are refering to is the fact 
> that the scheduler cannot disinguish between the case where the task has 
> failed vs. the case where the configuration of a task is not correct, because 
> in both cases, the scheduler will receive a TASK_FAILED/TASK_LOST.
> 
> Jie Yu wrote:
> To address the first problem, I think the simplest way is to add a 
> containerizer->destroy(..) in executorLaunched when containerizer->launch 
> fails. In that way, it's going to trigger containerizer->wait and thus send 
> status update to the scheduler.
> 
> Jie Yu wrote:
> Regarding the second problem, IMO, we should include a reason field in 
> Termination (https://issues.apache.org/jira/browse/MESOS-2035) and let 
> sendExecutorTerminatedStatusUpdate to propagate the termination reason to the 
> scheduler.
> 
> Timothy Chen wrote:
> Reason field sounds good, I think what you proposed makes sense, in 
> docker containerizer at least we also need to make sure termination message 
> is set correctly as currently it doesn't contain all the error information 
> that we pass back to the launch future.

Sorry for the confusion.   There are three problems that are all related.  Yes, 
we need to send statusUpdates as soon as containerizer launch fails and yes, we 
need to set the reason so the scheduler can distinguish a slave failure from a 
bad request.  However, my intention with this patch is not to address either of 
those two problems.

My goal with this patch is to simply send the containerizer launch failure 
message back to the scheduler.  I am using Aurora with the docker 
containerizer.  There are a myriad of reasons that dockerd could fail to 
"docker run" a container.  Currently, when that fails the user sees a useless 
and incorrect "Abnormal Executor Termination" message in the Aurora web ui.  
With this patch they see the stderr output of "docker run."  This way they can 
report meaningful error messages to the operations team.

I can update this patch to address the other two issues, but the key is that 
when the containerizer launch fails we have to send a statusUpdate with a 
message that contains future.failure().  Before this patch we were only logging 
it.  The scheduler needs to get that error message.


- Jay


---
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/33249/#review81090
---


On April 21, 2015, 10:14 a.m., Jay Buffington wrote:
> 
> ---
> This is an automatically generated e-mail. To reply, visit:
> https://reviews.apache.org/r/33249/
> ---
> 
> (Updated April 21, 2015, 10:14 a.m.)
> 
> 
> Review request for mesos, Ben Mahler, Timothy Chen, and Vinod Kone.
> 
> 
> Bugs: MESOS-2020
> https://issues.apache.org/jira/browse/MESOS-2020
> 
> 
> Repository: mesos
> 
> 
> Description
> ---
> 
> When mesos is unable to launch the containerizer the scheduler should
> get a TASK_FAILED with a status message that includes the error the
> containerizer encounted when trying to launch.
> 
> Introduces a new TaskStatus: REASON_CONTAINERIZER_LAUNCH_FAILED
> 
> Fixes MESOS-2020
> 
> 
> Diffs
> -
> 
>   include/mesos/mesos.proto 3a8e8bf303e0576c212951f6028af77e54d93537 
>   src/slave/slave.cpp 8ec80ed26f338690e0a1e712065750ab77a724cd 
>   src/tests/slave_tests.cpp b826000e0a4221690f956ea51f49ad4c99d5e188 
> 
> Diff: https://reviews.apache.org/r/33249/diff/
> 
> 
> Testing
> ---
> 
> I added test case to slave_test.cpp.  I also tried this with Aurora, supplied 
> a bogus docker image url and saw the "docker pull" failure stderr message in 
> Aurora's web UI.
> 
> 
> Thanks,
> 
> Jay Buffington
> 
>