[jira] [Created] (MESOS-8622) Agent should send a task status update when upon receiving the task
Yan Xu created MESOS-8622: - Summary: Agent should send a task status update when upon receiving the task Key: MESOS-8622 URL: https://issues.apache.org/jira/browse/MESOS-8622 Project: Mesos Issue Type: Improvement Reporter: Yan Xu Currently before the first status update of a successful task launch is sent, the steps include filesystem imagine provisioning, artifact fetching whose duration highly depends on the tasks and not the performance of "the infrastructure", i.e., Mesos stack, host load or other problems, etc. Ideally the scheduler would be able to set of a timeout on such delay excluding the time spent on FS provisioning and artifact fetching so it can relaunch the task somewhere else instead of waiting indefinitely. {{TASK_STARTING}} wouldn't work for this purpose because it's sent only after the executor is registered. We can actually just have the agent send {{TASK_STAGING}}. Its {{TaskStatus.source = SOURCE_SLAVE}} and {{TaskStatus.reason = null}} can help the scheduler distinguish it from the updates as a result of reconciliation. Creating a new state for this feels unncessary? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8576) Improve discard handling of 'Docker::inspect()'
[ https://issues.apache.org/jira/browse/MESOS-8576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16379603#comment-16379603 ] Greg Mann commented on MESOS-8576: -- {code} commit a4805eff5dbef7146f34b239b834a688954991f9 Author: Greg Mann g...@mesosphere.io Date: Tue Feb 27 16:42:10 2018 -0800 Ensured that Docker containerizer returns a failed Future in one case. Because the Docker library did not immediately transition the Future returned by `inspect()` to DISCARDED, it was safe for the Docker containerizer to discard this Future before failing the Promise associated with the Future returned by `launch()`. However, the introduction of an `onDiscard` callback in `Docker::inspect()` makes this assumption invalid. This patch addresses this by failing the Promise before discarding the Future. Review: https://reviews.apache.org/r/65743/ {code} > Improve discard handling of 'Docker::inspect()' > --- > > Key: MESOS-8576 > URL: https://issues.apache.org/jira/browse/MESOS-8576 > Project: Mesos > Issue Type: Improvement > Components: containerization, docker >Affects Versions: 1.5.0 >Reporter: Greg Mann >Assignee: Greg Mann >Priority: Major > Labels: mesosphere > > In the call path of {{Docker::inspect()}}, each continuation currently checks > if {{promise->future().hasDiscard()}}, where the {{promise}} is associated > with the output of the {{docker inspect}} call. However, if the call to > {{docker inspect}} becomes hung indefinitely, then continuations are never > invoked, and a subsequent discard of the returned {{Future}} will have no > effect. We should add proper {{onDiscard}} handling to that {{Future}} so > that appropriate cleanup is performed in such cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (MESOS-8576) Improve discard handling of 'Docker::inspect()'
[ https://issues.apache.org/jira/browse/MESOS-8576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375252#comment-16375252 ] Greg Mann edited comment on MESOS-8576 at 2/28/18 1:05 AM: --- Still working on this one. The problem is that {{Docker::inspect()}} has retry logic embedded within the library function, since we often call it before a container has started running in order to detect that the container is up. So, to avoid repeatedly registering {{onDiscard}} callbacks with every retry (which would constitute a memory leak), we need to pass the "context" of the current {{docker inspect}} call through the async call chain, and also make it accessible to the {{onDiscard}} callback which we install onto the returned future. Since the Docker library is not currently a libprocess actor, this is a bit difficult. was (Author: greggomann): Still working on this one. The problem is that {{Docker::inspect()}} has retry logic embedded within the library function, since we often call it before a container has started running in order to detect that the container is up. So, to avoid repeatedly registering {{onDiscard}} callbacks with every retry (which would constitute a memory leak), we need to pass the "context" of the current {{docker inspect}} call through the async call chain, and also make it accessible to the {{onDiscard}} callback which we install onto the returned future. Since the Docker library is not currently a libprocess actor, this is a bit difficult. WIP patch here: https://reviews.apache.org/r/65683/ > Improve discard handling of 'Docker::inspect()' > --- > > Key: MESOS-8576 > URL: https://issues.apache.org/jira/browse/MESOS-8576 > Project: Mesos > Issue Type: Improvement > Components: containerization, docker >Affects Versions: 1.5.0 >Reporter: Greg Mann >Assignee: Greg Mann >Priority: Major > Labels: mesosphere > > In the call path of {{Docker::inspect()}}, each continuation currently checks > if {{promise->future().hasDiscard()}}, where the {{promise}} is associated > with the output of the {{docker inspect}} call. However, if the call to > {{docker inspect}} becomes hung indefinitely, then continuations are never > invoked, and a subsequent discard of the returned {{Future}} will have no > effect. We should add proper {{onDiscard}} handling to that {{Future}} so > that appropriate cleanup is performed in such cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8576) Improve discard handling of 'Docker::inspect()'
[ https://issues.apache.org/jira/browse/MESOS-8576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16379604#comment-16379604 ] Greg Mann commented on MESOS-8576: -- Review: https://reviews.apache.org/r/65683/ > Improve discard handling of 'Docker::inspect()' > --- > > Key: MESOS-8576 > URL: https://issues.apache.org/jira/browse/MESOS-8576 > Project: Mesos > Issue Type: Improvement > Components: containerization, docker >Affects Versions: 1.5.0 >Reporter: Greg Mann >Assignee: Greg Mann >Priority: Major > Labels: mesosphere > > In the call path of {{Docker::inspect()}}, each continuation currently checks > if {{promise->future().hasDiscard()}}, where the {{promise}} is associated > with the output of the {{docker inspect}} call. However, if the call to > {{docker inspect}} becomes hung indefinitely, then continuations are never > invoked, and a subsequent discard of the returned {{Future}} will have no > effect. We should add proper {{onDiscard}} handling to that {{Future}} so > that appropriate cleanup is performed in such cases. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8621) Adding a `/debug` master endpoint to gather debug info, e.g., filters.
Chun-Hung Hsiao created MESOS-8621: -- Summary: Adding a `/debug` master endpoint to gather debug info, e.g., filters. Key: MESOS-8621 URL: https://issues.apache.org/jira/browse/MESOS-8621 Project: Mesos Issue Type: Improvement Components: master Reporter: Chun-Hung Hsiao Currently it is hard to debug issues related to framework not being able to get offers. We could add a {{/debug}} endpoint that help the debugging easier, starting by adding the information about active offer filters. Since this endpoint is for debugging purpose, there might be no guarantee for backward compatibility in the future. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8620) Fetcher failures lead to sibling nested containers stuck in FETCHING
Chun-Hung Hsiao created MESOS-8620: -- Summary: Fetcher failures lead to sibling nested containers stuck in FETCHING Key: MESOS-8620 URL: https://issues.apache.org/jira/browse/MESOS-8620 Project: Mesos Issue Type: Bug Affects Versions: 1.5.0, 1.4.0 Reporter: Chun-Hung Hsiao Assignee: Chun-Hung Hsiao Two nested containers were launched and transitioned to FETCHING nearly at the same time, and tried to fetch the same artifacts. The first one failed to fetch some artifacts and transitioned to DESTROYING. However, the second nested container got stock in FETCHING and the LAUNCH_NESTED_CONTAINER call never returned. {noformat} I0226 06:27:15.00 9494 http.cpp:2581] Processing LAUNCH_NESTED_CONTAINER call for container 'fb23f12b-8ae3-4e03-9895-8df1b2865b11.435039fe-d6d1-4f86-abb8-61101ff64af4' ... I0226 06:27:15.00 9499 http.cpp:2581] Processing LAUNCH_NESTED_CONTAINER call for container 'fb23f12b-8ae3-4e03-9895-8df1b2865b11.3f6fead4-1857-4cbd-b226-fbc7337eb8cb' ... I0226 06:27:15.00 9493 containerizer.cpp:2968] Transitioning the state of container fb23f12b-8ae3-4e03-9895-8df1b2865b11.435039fe-d6d1-4f86-abb8-61101ff64af4 from ISOLATING to FETCHING I0226 06:27:15.00 9500 containerizer.cpp:2968] Transitioning the state of container fb23f12b-8ae3-4e03-9895-8df1b2865b11.3f6fead4-1857-4cbd-b226-fbc7337eb8cb from ISOLATING to FETCHING ... E0226 06:29:45.00 9496 fetcher.cpp:568] Failed to run mesos-fetcher: Failed to fetch all URIs for container 'fb23f12b-8ae3-4e03-9895-8df1b2865b11.435039fe-d6d1-4f86-abb8-61101ff64af4': exited with status 1 W0226 06:29:45.00 9497 http.cpp:2758] Failed to launch container fb23f12b-8ae3-4e03-9895-8df1b2865b11.435039fe-d6d1-4f86-abb8-61101ff64af4: Failed to fetch all URIs for container 'fb23f12b-8ae3-4e03-9895-8df1b2865b11.435039fe-d6d1-4f86-abb8-61101ff64af4': exited with status 1 I0226 06:29:45.00 9497 containerizer.cpp:2354] Destroying container fb23f12b-8ae3-4e03-9895-8df1b2865b11.435039fe-d6d1-4f86-abb8-61101ff64af4 in FETCHING state I0226 06:29:45.00 9497 containerizer.cpp:2968] Transitioning the state of container fb23f12b-8ae3-4e03-9895-8df1b2865b11.435039fe-d6d1-4f86-abb8-61101ff64af4 from FETCHING to DESTROYING {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8619) Docker on Windows uses USERPROFILE instead of HOME for credentials
Andrew Schwartzmeyer created MESOS-8619: --- Summary: Docker on Windows uses USERPROFILE instead of HOME for credentials Key: MESOS-8619 URL: https://issues.apache.org/jira/browse/MESOS-8619 Project: Mesos Issue Type: Improvement Environment: Windows 10 with Docker version 17.12.0-ce, build c97c6d6. Reporter: Andrew Schwartzmeyer Assignee: Andrew Schwartzmeyer The logic for doing a {{docker pull}} of an image for a private registry assumes that the {{.docker/config.json}} is to be found in {{$HOME}} (according to the [Mesosphere insutructions|https://mesosphere.github.io/marathon/docs/native-docker-private-registry.html#docker-containerizer] and the [code|https://github.com/apache/mesos/blob/b7933c176d719766bdb6459048ede6e94f6a7763/src/docker/docker.cpp#L1710]). However, this assumption was only true for Linux per the [Docker code|https://github.com/moby/moby/blob/3a633a712c8bbb863fe7e57ec132dd87a9c4eff7/pkg/homedir/homedir_unix.go#L14], but on Windows, Docker explicitly looks at the {{USERPROFILE}} environment variable, again [per the Docker code|https://github.com/moby/moby/blob/3a633a712c8bbb863fe7e57ec132dd87a9c4eff7/pkg/homedir/homedir_windows.go#L10]. So in order for Docker to pick up the config file correctly, we need to change the variable used on Windows in the Docker containerizer. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8618) ReconciliationTest.ReconcileStatusUpdateTaskState is flaky.
Alexander Rukletsov created MESOS-8618: -- Summary: ReconciliationTest.ReconcileStatusUpdateTaskState is flaky. Key: MESOS-8618 URL: https://issues.apache.org/jira/browse/MESOS-8618 Project: Mesos Issue Type: Bug Components: test Environment: ec Debian 9 with SSL Reporter: Alexander Rukletsov Attachments: ReconciliationTest.ReconcileStatusUpdateTaskState-badrun.txt {noformat} ../../src/tests/reconciliation_tests.cpp:1129 Expected: TASK_RUNNING To be equal to: update->state() Which is: TASK_FINISHED {noformat} Full log attached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8617) Tests using default executor occasionally fail.
[ https://issues.apache.org/jira/browse/MESOS-8617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378765#comment-16378765 ] Alexander Rukletsov commented on MESOS-8617: >From a brief look over the logs, the default executor either has not started >at all or crashed before writing anything into the log. > Tests using default executor occasionally fail. > --- > > Key: MESOS-8617 > URL: https://issues.apache.org/jira/browse/MESOS-8617 > Project: Mesos > Issue Type: Bug >Reporter: Alexander Rukletsov >Priority: Major > Labels: flaky-test > Attachments: MasterTest.TasksEndpoint-badrun.txt, > MasterTest.TasksEndpoint-goodrun.txt > > > Task transition expectation can be violated resulting in a failing test, e.g.: > {noformat} > ../../src/tests/master_tests.cpp:4134: Failure > Expected: TASK_RUNNING > To be equal to: status1->state() > Which is: TASK_LOST > {noformat} > List of known affected tests: > {noformat} > MasterTest.TasksEndpoint > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-8617) Tests using default executor occasionally fail.
Alexander Rukletsov created MESOS-8617: -- Summary: Tests using default executor occasionally fail. Key: MESOS-8617 URL: https://issues.apache.org/jira/browse/MESOS-8617 Project: Mesos Issue Type: Bug Reporter: Alexander Rukletsov Attachments: MasterTest.TasksEndpoint-badrun.txt Task transition expectation can be violated resulting in a failing test, e.g.: {noformat} ../../src/tests/master_tests.cpp:4134: Failure Expected: TASK_RUNNING To be equal to: status1->state() Which is: TASK_LOST {noformat} List of known affected tests: {noformat} MasterTest.TasksEndpoint {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8521) Various IOSwitchboard related tests fail on macOS High Sierra.
[ https://issues.apache.org/jira/browse/MESOS-8521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378719#comment-16378719 ] Alexander Rukletsov commented on MESOS-8521: I've just successfully run a complete {{ninja check}} on macOS High Sierra 10.13.3 (17D47) with {{Apple LLVM version 9.0.0 (clang-900.0.39.2)}}: {noformat} [==] Running 6 tests from 2 test cases. [--] Global test environment set-up. [--] 4 tests from IOSwitchboardTest [ RUN ] IOSwitchboardTest.ContainerAttach [ OK ] IOSwitchboardTest.ContainerAttach (179 ms) [ RUN ] IOSwitchboardTest.OutputRedirectionWithTTY [ OK ] IOSwitchboardTest.OutputRedirectionWithTTY (124 ms) [ RUN ] IOSwitchboardTest.KillSwitchboardContainerDestroyed [ OK ] IOSwitchboardTest.KillSwitchboardContainerDestroyed (283 ms) [ RUN ] IOSwitchboardTest.ContainerAttachAfterSlaveRestart [ OK ] IOSwitchboardTest.ContainerAttachAfterSlaveRestart (413 ms) [--] 4 tests from IOSwitchboardTest (1006 ms total) [--] 2 tests from ContentType/AgentAPITest [ RUN ] ContentType/AgentAPITest.LaunchNestedContainerSessionWithTTY/0 [ OK ] ContentType/AgentAPITest.LaunchNestedContainerSessionWithTTY/0 (404 ms) [ RUN ] ContentType/AgentAPITest.LaunchNestedContainerSessionWithTTY/1 [ OK ] ContentType/AgentAPITest.LaunchNestedContainerSessionWithTTY/1 (409 ms) [--] 2 tests from ContentType/AgentAPITest (819 ms total) [--] Global test environment tear-down [==] 6 tests from 2 test cases ran. (1847 ms total) [ PASSED ] 6 tests. {noformat} > Various IOSwitchboard related tests fail on macOS High Sierra. > --- > > Key: MESOS-8521 > URL: https://issues.apache.org/jira/browse/MESOS-8521 > Project: Mesos > Issue Type: Bug > Environment: macOS 10.13.2 (17C88) > Apple LLVM version 9.0.0 (clang-900.0.39.2) >Reporter: Till Toenshoff >Priority: Major > > The problem appears to cause several switchboard tests to fail. Note that > this problem does not manifest on older Apple systems. > The failure rate on this system is 100%. > List of currently failing tests: > {noformat} > IOSwitchboardTest.ContainerAttach > IOSwitchboardTest.ContainerAttachAfterSlaveRestart > IOSwitchboardTest.OutputRedirectionWithTTY > ContentType/AgentAPITest.LaunchNestedContainerSessionWithTTY/0 > ContentType/AgentAPITest.LaunchNestedContainerSessionWithTTY/1 > {noformat} > This is an example using {{GLOG=v1}} verbose logging: > {noformat} > [ RUN ] IOSwitchboardTest.ContainerAttach > I0201 03:02:51.925930 2385417024 containerizer.cpp:304] Using isolation { > environment_secret, filesystem/posix, posix/cpu } > I0201 03:02:51.926230 2385417024 provisioner.cpp:299] Using default backend > 'copy' > I0201 03:02:51.927325 107409408 containerizer.cpp:674] Recovering > containerizer > I0201 03:02:51.928336 109019136 provisioner.cpp:495] Provisioner recovery > complete > I0201 03:02:51.934250 105799680 containerizer.cpp:1202] Starting container > 1b1af888-9e39-4c13-a647-ac43c0df9fad > I0201 03:02:51.936218 105799680 containerizer.cpp:1368] Checkpointed > ContainerConfig at > '/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/IOSwitchboardTest_ContainerAttach_1nkPYl/containers/1b1af888-9e39-4c13-a647-ac43c0df9fad/config' > I0201 03:02:51.936251 105799680 containerizer.cpp:2952] Transitioning the > state of container 1b1af888-9e39-4c13-a647-ac43c0df9fad from PROVISIONING to > PREPARING > I0201 03:02:51.937369 109019136 switchboard.cpp:429] Allocated pseudo > terminal '/dev/ttys003' for container 1b1af888-9e39-4c13-a647-ac43c0df9fad > I0201 03:02:51.943632 109019136 switchboard.cpp:557] Launching > 'mesos-io-switchboard' with flags '--heartbeat_interval="30secs" > --help="false" > --socket_address="/tmp/mesos-io-switchboard-d3bcec3f-7c29-4630-b374-55fabb6034d8" > --stderr_from_fd="7" --stderr_to_fd="2" --stdin_to_fd="7" > --stdout_from_fd="7" --stdout_to_fd="1" --tty="true" > --wait_for_connection="false"' for container > 1b1af888-9e39-4c13-a647-ac43c0df9fad > I0201 03:02:51.945106 109019136 switchboard.cpp:587] Created I/O switchboard > server (pid: 83716) listening on socket file > '/tmp/mesos-io-switchboard-d3bcec3f-7c29-4630-b374-55fabb6034d8' for > container 1b1af888-9e39-4c13-a647-ac43c0df9fad > I0201 03:02:51.947762 106336256 containerizer.cpp:1844] Launching > 'mesos-containerizer' with flags '--help="false" > --launch_info="{"command":{"shell":true,"value":"sleep >