[jira] [Assigned] (MESOS-7428) Report exit code of tasks from default and command executors
[ https://issues.apache.org/jira/browse/MESOS-7428?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Chung reassigned MESOS-7428: - Assignee: Eric Chung (was: Zhitao Li) > Report exit code of tasks from default and command executors > > > Key: MESOS-7428 > URL: https://issues.apache.org/jira/browse/MESOS-7428 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: Zhitao Li >Assignee: Eric Chung >Priority: Major > > Use case: some tasks should only be retried if the exit code matches certain > user requirement. > Based on [~gilbert], we already checkpoint the exit code in containerizer > now, and we need to clarify how to report exit code for executor containers > v.s. nested containers, and we should do this consistently for command and > default executor. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7428) Report exit code of tasks from default and command executors
[ https://issues.apache.org/jira/browse/MESOS-7428?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446569#comment-16446569 ] Eric Chung commented on MESOS-7428: --- our users actually expect to be able to view their exit codes in the terminal. let me see what i can do. > Report exit code of tasks from default and command executors > > > Key: MESOS-7428 > URL: https://issues.apache.org/jira/browse/MESOS-7428 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: Zhitao Li >Assignee: Zhitao Li >Priority: Major > > Use case: some tasks should only be retried if the exit code matches certain > user requirement. > Based on [~gilbert], we already checkpoint the exit code in containerizer > now, and we need to clarify how to report exit code for executor containers > v.s. nested containers, and we should do this consistently for command and > default executor. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8816) Add missing fields in ResourceUsage to agent /monitor/statistics endpoint.
Gilbert Song created MESOS-8816: --- Summary: Add missing fields in ResourceUsage to agent /monitor/statistics endpoint. Key: MESOS-8816 URL: https://issues.apache.org/jira/browse/MESOS-8816 Project: Mesos Issue Type: Improvement Components: agent Reporter: Gilbert Song Add missing fields in ResourceUsage to agent /monitor/statistics endpoint, e.g., container_id, etc. https://github.com/apache/mesos/blob/1.5.0/src/slave/http.cpp#L2057 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8815) Authentication failure leads to test failure (KillTaskBetweenRunTaskParts)
Meng Zhu created MESOS-8815: --- Summary: Authentication failure leads to test failure (KillTaskBetweenRunTaskParts) Key: MESOS-8815 URL: https://issues.apache.org/jira/browse/MESOS-8815 Project: Mesos Issue Type: Bug Reporter: Meng Zhu Attachments: KillTaskBetweenRunTaskParts_fail_authentication_failure.txt {code:java} I0420 14:35:58.904254 17810 slave.cpp:1319] Detecting new master I0420 14:35:58.910846 17804 slave.cpp:1346] Authenticating with master master@10.0.49.2:45006 I0420 14:35:58.910960 17804 slave.cpp:1355] Using default CRAM-MD5 authenticatee I0420 14:35:58.911288 17812 authenticatee.cpp:121] Creating new client SASL connection W0420 14:36:03.912529 17791 slave.cpp:1457] Authentication timed out I0420 14:36:03.927395 17784 master.cpp:9213] Authenticating slave(78039)@10.0.49.2:45006 W0420 14:36:03.927515 17812 slave.cpp:1402] Failed to authenticate with master master@10.0.49.2:45006: Authentication discarded W0420 14:36:03.928010 17819 master.cpp:9240] Failed to authenticate slave(78039)@10.0.49.2:45006: Failed to communicate with authenticatee I0420 14:36:04.576311 17779 slave.cpp:1346] Authenticating with master master@10.0.49.2:45006 I0420 14:36:04.576406 17779 slave.cpp:1355] Using default CRAM-MD5 authenticatee I0420 14:36:04.576769 17788 authenticatee.cpp:121] Creating new client SASL connection ../../src/tests/slave_tests.cpp:4111: Failure Failed to wait 15secs for offers {code} log attached -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6417) Introduce an extra 'unknown' health check state.
[ https://issues.apache.org/jira/browse/MESOS-6417?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16446407#comment-16446407 ] Avinash Sridharan commented on MESOS-6417: -- [~alexr] any plans on introducing this state into Mesos. This is pretty important for entities that are pegging against the Mesos state and trying to ascertain if task is unhealthy vs health check hasn't been defined for the task. DC/OS minuteman being a consumer of this feature. > Introduce an extra 'unknown' health check state. > > > Key: MESOS-6417 > URL: https://issues.apache.org/jira/browse/MESOS-6417 > Project: Mesos > Issue Type: Improvement >Reporter: Alexander Rukletsov >Assignee: Alexander Rukletsov >Priority: Major > Labels: health-check, mesosphere > > There are three logical states regarding health checks: > 1) no health checks; > 2) a health check is defined, but no result is available yet; > 3) a health check is defined, it is either healthy or not. > Currently, we do not distinguish between 1) and 2), which can be problematic > for framework authors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-6417) Introduce an extra 'unknown' health check state.
[ https://issues.apache.org/jira/browse/MESOS-6417?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Avinash Sridharan reassigned MESOS-6417: Assignee: Alexander Rukletsov > Introduce an extra 'unknown' health check state. > > > Key: MESOS-6417 > URL: https://issues.apache.org/jira/browse/MESOS-6417 > Project: Mesos > Issue Type: Improvement >Reporter: Alexander Rukletsov >Assignee: Alexander Rukletsov >Priority: Major > Labels: health-check, mesosphere > > There are three logical states regarding health checks: > 1) no health checks; > 2) a health check is defined, but no result is available yet; > 3) a health check is defined, it is either healthy or not. > Currently, we do not distinguish between 1) and 2), which can be problematic > for framework authors. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-8275) Remove use of ::_stat on Windows
[ https://issues.apache.org/jira/browse/MESOS-8275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441538#comment-16441538 ] Andrew Schwartzmeyer edited comment on MESOS-8275 at 4/20/18 6:49 PM: -- The {{dev}}, {{inode}}, and {{mode}} functions can be {{delete}} d for Windows; but {{mtime}} will need to be rewritten with e.g. {{GetFileTime}}. was (Author: andschwa): The {{dev}}, {{inode}}, and {{mode}} functions can be {{deleted}} d for Windows; but {{mtime}} will need to be rewritten with e.g. {{GetFileTime}}. > Remove use of ::_stat on Windows > > > Key: MESOS-8275 > URL: https://issues.apache.org/jira/browse/MESOS-8275 > Project: Mesos > Issue Type: Task > Environment: Windows >Reporter: Andrew Schwartzmeyer >Assignee: Andrew Schwartzmeyer >Priority: Major > Labels: stout, windows > > The Windows stat.hpp header has some remaining uses of non-long-path-aware > CRT APIs, specifically {{::_stat}}. This has been punted so far as not yet a > problem, but eventually should be fixed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-8275) Remove use of ::_stat on Windows
[ https://issues.apache.org/jira/browse/MESOS-8275?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16441538#comment-16441538 ] Andrew Schwartzmeyer edited comment on MESOS-8275 at 4/20/18 6:49 PM: -- The {{dev}}, {{inode}}, and {{mode}} functions can be {{deleted}} d for Windows; but {{mtime}} will need to be rewritten with e.g. {{GetFileTime}}. was (Author: andschwa): The {{dev}}, {{inode}}, and {{mode}} functions can be {{deleted}}d for Windows; but {{mtime}} will need to be rewritten with e.g. {{GetFileTime}}. > Remove use of ::_stat on Windows > > > Key: MESOS-8275 > URL: https://issues.apache.org/jira/browse/MESOS-8275 > Project: Mesos > Issue Type: Task > Environment: Windows >Reporter: Andrew Schwartzmeyer >Assignee: Andrew Schwartzmeyer >Priority: Major > Labels: stout, windows > > The Windows stat.hpp header has some remaining uses of non-long-path-aware > CRT APIs, specifically {{::_stat}}. This has been punted so far as not yet a > problem, but eventually should be fixed. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-8762) Farmework Teardown Leaves Task in Uninterruptible Sleep State D
[ https://issues.apache.org/jira/browse/MESOS-8762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445755#comment-16445755 ] Karsten edited comment on MESOS-8762 at 4/20/18 1:59 PM: - Take a look at this build https://jenkins.mesosphere.com/service/jenkins/view/Marathon/job/marathon-sandbox/job/marathon-loop-karsten-slim/142/. First check the console output. You'll see two processes in state D {code} root 12440 0.2 0.0 45380 13488 ?D02:48 0:00 python src/app_mock.py 33801 resident-pod-16322 2018-04-20T02:48:35.397Z http://www.example.com root 12601 0.5 0.0 45380 13352 ?D02:48 0:00 python src/app_mock.py 33793 resident-pod-16322-fail 2018-04-20T02:48:37.678Z http://www.example.com {code} Now, download the logs and search for {{Process tree before teardown}}. You'll find all process before we trigger the framework treardown on Mesos. You'll find PIDs {{12440}} and {{12440}} as children of {{systemd 1}} *not* as children of {{mesos-agent,11783}} as I would expect. The same is true for the tree from {{ps auxf}} in the console logs again. These are *after* all tests ran. was (Author: jeschkies): Take a look at this build https://jenkins.mesosphere.com/service/jenkins/view/Marathon/job/marathon-sandbox/job/marathon-loop-karsten-slim/142/. First check the console output. You'll see two processes in state D {code} root 12440 0.2 0.0 45380 13488 ?D02:48 0:00 python src/app_mock.py 33801 resident-pod-16322 2018-04-20T02:48:35.397Z http://www.example.com root 12601 0.5 0.0 45380 13352 ?D02:48 0:00 python src/app_mock.py 33793 resident-pod-16322-fail 2018-04-20T02:48:37.678Z http://www.example.com``` {code} Now, download the logs and search for {Process tree before teardown}. You'll find all process before we trigger the framework treardown on Mesos. You'll find PIDs {12440} and {12440} as children of {systemd 1} *not* as children of {mesos-agent,11783} as I would expect. The same is true for the tree from {ps auxf} in the console logs again. These are *after* all tests ran. > Farmework Teardown Leaves Task in Uninterruptible Sleep State D > --- > > Key: MESOS-8762 > URL: https://issues.apache.org/jira/browse/MESOS-8762 > Project: Mesos > Issue Type: Bug >Reporter: Karsten >Assignee: Till Toenshoff >Priority: Major > Attachments: UpgradeIntegrationTest.zip, happy-process-sandbox.zip, > happy-process.trace, master_agent.log, zombi-process.trace, > zombie-process-sandbox.zip > > > The Marathon has a testsuite that starts a Python simple HTTP server in a > task group aka pod in Marathon wit ha persistent volume. After the test run > we call {{/teardown}} and wait for the Marathon framework to be completed > (see > [MesosTest|https://github.com/mesosphere/marathon/blob/master/src/test/scala/mesosphere/marathon/integration/setup/MesosTest.scala#L311]). > > Our CI checks whether we leak any tasks after all test runs. It turns out we > do: > {code} > Will kill: > root 18084 0.0 0.0 45380 13612 ?D07:52 0:00 python > src/app_mock.py 35477 resident-pod-16322-fail 2018-04-06T07:52:16.924Z > http://www.example.com > Running 'sudo kill -9 18084 > Wait for processes being killed... > ... > Couldn't kill some leaked processes: > root 18084 0.0 0.0 45380 13612 ?D07:52 0:00 python > src/app_mock.py 35477 resident-pod-16322-fail 2018-04-06T07:52:16.924Z > http://www.example.com > ammonite.$file.ci.utils$StageException: Stage Compile and Test failed. > {code} > The attached Mesos master and agents logs (see attachment) show > {code} > Updating the state of task > resident-pod-16322-fail.instance-6d2d04ba-396f-11e8-b2e3-02425ff42cc9.task1.2 > of framework 3c1bf149-6b68-469c-beb9-9910f386fd5a- (latest state: > TASK_KILLED, status update state: TASK_KILLED) > {code} > The executor logs (see zipped sandbox attached) shows > {code} > I0406 07:52:39.925599 18078 default_executor.cpp:191] Received SHUTDOWN event > I0406 07:52:39.925624 18078 default_executor.cpp:962] Shutting down > I0406 07:52:39.925634 18078 default_executor.cpp:1058] Killing task > resident-pod-16322-fail.instance-6d2d04ba-396f-11e8-b2e3-02425ff42cc9.task1.2 > running in child container > cfce1f46-f565-4f45-aa0f-e6e6f6c5434b.0cc9c336-f0b4-448d-a8b4-dea33c02f0ae > with SIGTERM signal > I0406 07:52:39.925647 18078 default_executor.cpp:1080] Scheduling escalation > to SIGKILL in 3secs from now > {code} > The task logs > {code} > 2018-04-06 07:52:36,620 INFO: 2.7.13 > 2018-04-06 07:52:36,621 DEBUG : ['src/app_mock.py', '35477', > 'resident-pod-16322-fail', '2018-04-06T07:52:16.924Z', > 'http://www.example.com'] > 2018-04-06 07:52:36,621 INFO:
[jira] [Commented] (MESOS-8762) Farmework Teardown Leaves Task in Uninterruptible Sleep State D
[ https://issues.apache.org/jira/browse/MESOS-8762?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16445755#comment-16445755 ] Karsten commented on MESOS-8762: Take a look at this build https://jenkins.mesosphere.com/service/jenkins/view/Marathon/job/marathon-sandbox/job/marathon-loop-karsten-slim/142/. First check the console output. You'll see two processes in state D {code} root 12440 0.2 0.0 45380 13488 ?D02:48 0:00 python src/app_mock.py 33801 resident-pod-16322 2018-04-20T02:48:35.397Z http://www.example.com root 12601 0.5 0.0 45380 13352 ?D02:48 0:00 python src/app_mock.py 33793 resident-pod-16322-fail 2018-04-20T02:48:37.678Z http://www.example.com``` {code} Now, download the logs and search for {Process tree before teardown}. You'll find all process before we trigger the framework treardown on Mesos. You'll find PIDs {12440} and {12440} as children of {systemd 1} *not* as children of {mesos-agent,11783} as I would expect. The same is true for the tree from {ps auxf} in the console logs again. These are *after* all tests ran. > Farmework Teardown Leaves Task in Uninterruptible Sleep State D > --- > > Key: MESOS-8762 > URL: https://issues.apache.org/jira/browse/MESOS-8762 > Project: Mesos > Issue Type: Bug >Reporter: Karsten >Assignee: Till Toenshoff >Priority: Major > Attachments: UpgradeIntegrationTest.zip, happy-process-sandbox.zip, > happy-process.trace, master_agent.log, zombi-process.trace, > zombie-process-sandbox.zip > > > The Marathon has a testsuite that starts a Python simple HTTP server in a > task group aka pod in Marathon wit ha persistent volume. After the test run > we call {{/teardown}} and wait for the Marathon framework to be completed > (see > [MesosTest|https://github.com/mesosphere/marathon/blob/master/src/test/scala/mesosphere/marathon/integration/setup/MesosTest.scala#L311]). > > Our CI checks whether we leak any tasks after all test runs. It turns out we > do: > {code} > Will kill: > root 18084 0.0 0.0 45380 13612 ?D07:52 0:00 python > src/app_mock.py 35477 resident-pod-16322-fail 2018-04-06T07:52:16.924Z > http://www.example.com > Running 'sudo kill -9 18084 > Wait for processes being killed... > ... > Couldn't kill some leaked processes: > root 18084 0.0 0.0 45380 13612 ?D07:52 0:00 python > src/app_mock.py 35477 resident-pod-16322-fail 2018-04-06T07:52:16.924Z > http://www.example.com > ammonite.$file.ci.utils$StageException: Stage Compile and Test failed. > {code} > The attached Mesos master and agents logs (see attachment) show > {code} > Updating the state of task > resident-pod-16322-fail.instance-6d2d04ba-396f-11e8-b2e3-02425ff42cc9.task1.2 > of framework 3c1bf149-6b68-469c-beb9-9910f386fd5a- (latest state: > TASK_KILLED, status update state: TASK_KILLED) > {code} > The executor logs (see zipped sandbox attached) shows > {code} > I0406 07:52:39.925599 18078 default_executor.cpp:191] Received SHUTDOWN event > I0406 07:52:39.925624 18078 default_executor.cpp:962] Shutting down > I0406 07:52:39.925634 18078 default_executor.cpp:1058] Killing task > resident-pod-16322-fail.instance-6d2d04ba-396f-11e8-b2e3-02425ff42cc9.task1.2 > running in child container > cfce1f46-f565-4f45-aa0f-e6e6f6c5434b.0cc9c336-f0b4-448d-a8b4-dea33c02f0ae > with SIGTERM signal > I0406 07:52:39.925647 18078 default_executor.cpp:1080] Scheduling escalation > to SIGKILL in 3secs from now > {code} > The task logs > {code} > 2018-04-06 07:52:36,620 INFO: 2.7.13 > 2018-04-06 07:52:36,621 DEBUG : ['src/app_mock.py', '35477', > 'resident-pod-16322-fail', '2018-04-06T07:52:16.924Z', > 'http://www.example.com'] > 2018-04-06 07:52:36,621 INFO: AppMock[resident-pod-16322-fail > 2018-04-06T07:52:16.924Z]: > resident-pod-16322-fail.instance-6d2d04ba-396f-11e8-b2e3-02425ff42cc9.task1.2 > has taken the stage at port 35477. Will query http://www.example.com for > health and readiness status. > 2018-04-06 07:52:38,895 DEBUG : Got GET request > 172.16.10.198 - - [06/Apr/2018 07:52:38] "GET /pst1/foo HTTP/1.1" 200 - > {code} > Compare these to a single task without a persistent volume of the same run > {code} > I0406 07:52:39.925590 15585 exec.cpp:445] Executor asked to shutdown > I0406 07:52:39.925698 15585 executor.cpp:171] Received SHUTDOWN event > I0406 07:52:39.925717 15585 executor.cpp:748] Shutting down > I0406 07:52:39.925729 15585 executor.cpp:863] Sending SIGTERM to process tree > at pid 15591 > I0406 07:52:40.029878 15585 executor.cpp:876] Sent SIGTERM to the following > process trees: > [ > -+- 15591 sh -c echo APP PROXY $MESOS_TASK_ID RUNNING; > /home/admin/workspace/marathon-sandbox/marathon-loop-karsten/src/test/python/app_mock.py >
[jira] [Created] (MESOS-8813) Make multiple tasks with different users can access a shared persistent volume
Qian Zhang created MESOS-8813: - Summary: Make multiple tasks with different users can access a shared persistent volume Key: MESOS-8813 URL: https://issues.apache.org/jira/browse/MESOS-8813 Project: Mesos Issue Type: Task Reporter: Qian Zhang Assignee: Qian Zhang See [design doc|https://docs.google.com/document/d/1QyeDDX4Zr9E-0jKMoPTzsGE-v4KWwjmnCR0l8V4Tq2U/edit#heading=h.f4x59l41lxwx] for why we need to do this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8814) Mount the volume based on `Volume.mode`
Qian Zhang created MESOS-8814: - Summary: Mount the volume based on `Volume.mode` Key: MESOS-8814 URL: https://issues.apache.org/jira/browse/MESOS-8814 Project: Mesos Issue Type: Task Reporter: Qian Zhang Assignee: Qian Zhang See [design doc|https://docs.google.com/document/d/1QyeDDX4Zr9E-0jKMoPTzsGE-v4KWwjmnCR0l8V4Tq2U/edit#heading=h.kck8nfvxr80w] for why we need to do this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8812) Grant non-root task user the permissions to access the DOCKER_VOLUME volume
Qian Zhang created MESOS-8812: - Summary: Grant non-root task user the permissions to access the DOCKER_VOLUME volume Key: MESOS-8812 URL: https://issues.apache.org/jira/browse/MESOS-8812 Project: Mesos Issue Type: Task Reporter: Qian Zhang Assignee: Qian Zhang See [design doc|https://docs.google.com/document/d/1QyeDDX4Zr9E-0jKMoPTzsGE-v4KWwjmnCR0l8V4Tq2U/edit#heading=h.e6p985n775m] for why we need to do this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8811) Grant non-root task user the permissions to access the image volume
Qian Zhang created MESOS-8811: - Summary: Grant non-root task user the permissions to access the image volume Key: MESOS-8811 URL: https://issues.apache.org/jira/browse/MESOS-8811 Project: Mesos Issue Type: Task Reporter: Qian Zhang Assignee: Qian Zhang See [design doc|https://docs.google.com/document/d/1QyeDDX4Zr9E-0jKMoPTzsGE-v4KWwjmnCR0l8V4Tq2U/edit#heading=h.s78760cmtdz6] for why we need to do this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8810) Grant non-root task user the permissions to access the SANDBOX_PATH volume of PARENT type
Qian Zhang created MESOS-8810: - Summary: Grant non-root task user the permissions to access the SANDBOX_PATH volume of PARENT type Key: MESOS-8810 URL: https://issues.apache.org/jira/browse/MESOS-8810 Project: Mesos Issue Type: Task Reporter: Qian Zhang Assignee: Qian Zhang See [design doc|https://docs.google.com/document/d/1QyeDDX4Zr9E-0jKMoPTzsGE-v4KWwjmnCR0l8V4Tq2U/edit#heading=h.s6f8rmu65g2p] for why we need to do this. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8809) Add functions for manipulating POSIX ACLs into stout
Qian Zhang created MESOS-8809: - Summary: Add functions for manipulating POSIX ACLs into stout Key: MESOS-8809 URL: https://issues.apache.org/jira/browse/MESOS-8809 Project: Mesos Issue Type: Task Components: stout Reporter: Qian Zhang Assignee: Qian Zhang We need to add functions for setting/getting POSIX ACLs into stout so that we can leverage these functions to grant volume permissions to the specific task user. This will introduce a new dependency {{libacl-devel}} when building Mesos. -- This message was sent by Atlassian JIRA (v7.6.3#76005)