[jira] [Created] (MESOS-8622) Agent should send a task status update when upon receiving the task

2018-02-27 Thread Yan Xu (JIRA)
Yan Xu created MESOS-8622:
-

 Summary: Agent should send a task status update when upon 
receiving the task
 Key: MESOS-8622
 URL: https://issues.apache.org/jira/browse/MESOS-8622
 Project: Mesos
  Issue Type: Improvement
Reporter: Yan Xu


Currently before the first status update of a successful task launch is sent, 
the steps include filesystem imagine provisioning, artifact fetching whose 
duration highly depends on the tasks and not the performance of "the 
infrastructure", i.e., Mesos stack, host load or other problems, etc.

Ideally the scheduler would be able to set of a timeout on such delay excluding 
the time spent on FS provisioning and artifact fetching so it can relaunch the 
task somewhere else instead of waiting indefinitely.

{{TASK_STARTING}} wouldn't work for this purpose because it's sent only after 
the executor is registered.

We can actually just have the agent send {{TASK_STAGING}}. Its 
{{TaskStatus.source =  SOURCE_SLAVE}} and {{TaskStatus.reason = null}} can help 
the scheduler distinguish it from the updates as a result of reconciliation. 
Creating a new state for this feels unncessary?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8576) Improve discard handling of 'Docker::inspect()'

2018-02-27 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16379603#comment-16379603
 ] 

Greg Mann commented on MESOS-8576:
--

{code}
commit a4805eff5dbef7146f34b239b834a688954991f9
Author: Greg Mann g...@mesosphere.io
Date:   Tue Feb 27 16:42:10 2018 -0800


Ensured that Docker containerizer returns a failed Future in one case.

Because the Docker library did not immediately transition the
Future returned by `inspect()` to DISCARDED, it was safe for the
Docker containerizer to discard this Future before failing the
Promise associated with the Future returned by `launch()`.

However, the introduction of an `onDiscard` callback in
`Docker::inspect()` makes this assumption invalid. This patch
addresses this by failing the Promise before discarding the
Future.

Review: https://reviews.apache.org/r/65743/
{code}

> Improve discard handling of 'Docker::inspect()'
> ---
>
> Key: MESOS-8576
> URL: https://issues.apache.org/jira/browse/MESOS-8576
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization, docker
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Major
>  Labels: mesosphere
>
> In the call path of {{Docker::inspect()}}, each continuation currently checks 
> if {{promise->future().hasDiscard()}}, where the {{promise}} is associated 
> with the output of the {{docker inspect}} call. However, if the call to 
> {{docker inspect}} becomes hung indefinitely, then continuations are never 
> invoked, and a subsequent discard of the returned {{Future}} will have no 
> effect. We should add proper {{onDiscard}} handling to that {{Future}} so 
> that appropriate cleanup is performed in such cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8576) Improve discard handling of 'Docker::inspect()'

2018-02-27 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16375252#comment-16375252
 ] 

Greg Mann edited comment on MESOS-8576 at 2/28/18 1:05 AM:
---

Still working on this one. The problem is that {{Docker::inspect()}} has retry 
logic embedded within the library function, since we often call it before a 
container has started running in order to detect that the container is up. So, 
to avoid repeatedly registering {{onDiscard}} callbacks with every retry (which 
would constitute a memory leak), we need to pass the "context" of the current 
{{docker inspect}} call through the async call chain, and also make it 
accessible to the {{onDiscard}} callback which we install onto the returned 
future. Since the Docker library is not currently a libprocess actor, this is a 
bit difficult.


was (Author: greggomann):
Still working on this one. The problem is that {{Docker::inspect()}} has retry 
logic embedded within the library function, since we often call it before a 
container has started running in order to detect that the container is up. So, 
to avoid repeatedly registering {{onDiscard}} callbacks with every retry (which 
would constitute a memory leak), we need to pass the "context" of the current 
{{docker inspect}} call through the async call chain, and also make it 
accessible to the {{onDiscard}} callback which we install onto the returned 
future. Since the Docker library is not currently a libprocess actor, this is a 
bit difficult.

WIP patch here: https://reviews.apache.org/r/65683/

> Improve discard handling of 'Docker::inspect()'
> ---
>
> Key: MESOS-8576
> URL: https://issues.apache.org/jira/browse/MESOS-8576
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization, docker
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Major
>  Labels: mesosphere
>
> In the call path of {{Docker::inspect()}}, each continuation currently checks 
> if {{promise->future().hasDiscard()}}, where the {{promise}} is associated 
> with the output of the {{docker inspect}} call. However, if the call to 
> {{docker inspect}} becomes hung indefinitely, then continuations are never 
> invoked, and a subsequent discard of the returned {{Future}} will have no 
> effect. We should add proper {{onDiscard}} handling to that {{Future}} so 
> that appropriate cleanup is performed in such cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8576) Improve discard handling of 'Docker::inspect()'

2018-02-27 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16379604#comment-16379604
 ] 

Greg Mann commented on MESOS-8576:
--

Review: https://reviews.apache.org/r/65683/

> Improve discard handling of 'Docker::inspect()'
> ---
>
> Key: MESOS-8576
> URL: https://issues.apache.org/jira/browse/MESOS-8576
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization, docker
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Major
>  Labels: mesosphere
>
> In the call path of {{Docker::inspect()}}, each continuation currently checks 
> if {{promise->future().hasDiscard()}}, where the {{promise}} is associated 
> with the output of the {{docker inspect}} call. However, if the call to 
> {{docker inspect}} becomes hung indefinitely, then continuations are never 
> invoked, and a subsequent discard of the returned {{Future}} will have no 
> effect. We should add proper {{onDiscard}} handling to that {{Future}} so 
> that appropriate cleanup is performed in such cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8621) Adding a `/debug` master endpoint to gather debug info, e.g., filters.

2018-02-27 Thread Chun-Hung Hsiao (JIRA)
Chun-Hung Hsiao created MESOS-8621:
--

 Summary: Adding a `/debug` master endpoint to gather debug info, 
e.g., filters.
 Key: MESOS-8621
 URL: https://issues.apache.org/jira/browse/MESOS-8621
 Project: Mesos
  Issue Type: Improvement
  Components: master
Reporter: Chun-Hung Hsiao


Currently it is hard to debug issues related to framework not being able to get 
offers. We could add a {{/debug}} endpoint that help the debugging easier, 
starting by adding the information about active offer filters. Since this 
endpoint is for debugging purpose, there might be no guarantee for backward 
compatibility in the future.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8620) Fetcher failures lead to sibling nested containers stuck in FETCHING

2018-02-27 Thread Chun-Hung Hsiao (JIRA)
Chun-Hung Hsiao created MESOS-8620:
--

 Summary: Fetcher failures lead to sibling nested containers stuck 
in FETCHING
 Key: MESOS-8620
 URL: https://issues.apache.org/jira/browse/MESOS-8620
 Project: Mesos
  Issue Type: Bug
Affects Versions: 1.5.0, 1.4.0
Reporter: Chun-Hung Hsiao
Assignee: Chun-Hung Hsiao


Two nested containers were launched and transitioned to FETCHING nearly at the 
same time, and tried to fetch the same artifacts. The first one failed to fetch 
some artifacts and transitioned to DESTROYING. However, the second nested 
container got stock in FETCHING and the LAUNCH_NESTED_CONTAINER call never 
returned.

{noformat}
I0226 06:27:15.00  9494 http.cpp:2581] Processing LAUNCH_NESTED_CONTAINER 
call for container 
'fb23f12b-8ae3-4e03-9895-8df1b2865b11.435039fe-d6d1-4f86-abb8-61101ff64af4'
...
I0226 06:27:15.00  9499 http.cpp:2581] Processing LAUNCH_NESTED_CONTAINER 
call for container 
'fb23f12b-8ae3-4e03-9895-8df1b2865b11.3f6fead4-1857-4cbd-b226-fbc7337eb8cb'
...
I0226 06:27:15.00  9493 containerizer.cpp:2968] Transitioning the state of 
container 
fb23f12b-8ae3-4e03-9895-8df1b2865b11.435039fe-d6d1-4f86-abb8-61101ff64af4 from 
ISOLATING to FETCHING
I0226 06:27:15.00  9500 containerizer.cpp:2968] Transitioning the state of 
container 
fb23f12b-8ae3-4e03-9895-8df1b2865b11.3f6fead4-1857-4cbd-b226-fbc7337eb8cb from 
ISOLATING to FETCHING
...
E0226 06:29:45.00  9496 fetcher.cpp:568] Failed to run mesos-fetcher: 
Failed to fetch all URIs for container 
'fb23f12b-8ae3-4e03-9895-8df1b2865b11.435039fe-d6d1-4f86-abb8-61101ff64af4': 
exited with status 1
W0226 06:29:45.00  9497 http.cpp:2758] Failed to launch container 
fb23f12b-8ae3-4e03-9895-8df1b2865b11.435039fe-d6d1-4f86-abb8-61101ff64af4: 
Failed to fetch all URIs for container 
'fb23f12b-8ae3-4e03-9895-8df1b2865b11.435039fe-d6d1-4f86-abb8-61101ff64af4': 
exited with status 1
I0226 06:29:45.00  9497 containerizer.cpp:2354] Destroying container 
fb23f12b-8ae3-4e03-9895-8df1b2865b11.435039fe-d6d1-4f86-abb8-61101ff64af4 in 
FETCHING state
I0226 06:29:45.00  9497 containerizer.cpp:2968] Transitioning the state of 
container 
fb23f12b-8ae3-4e03-9895-8df1b2865b11.435039fe-d6d1-4f86-abb8-61101ff64af4 from 
FETCHING to DESTROYING
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8619) Docker on Windows uses USERPROFILE instead of HOME for credentials

2018-02-27 Thread Andrew Schwartzmeyer (JIRA)
Andrew Schwartzmeyer created MESOS-8619:
---

 Summary: Docker on Windows uses USERPROFILE instead of HOME for 
credentials
 Key: MESOS-8619
 URL: https://issues.apache.org/jira/browse/MESOS-8619
 Project: Mesos
  Issue Type: Improvement
 Environment: Windows 10 with Docker version 17.12.0-ce, build c97c6d6.
Reporter: Andrew Schwartzmeyer
Assignee: Andrew Schwartzmeyer


The logic for doing a {{docker pull}} of an image for a private registry 
assumes that the {{.docker/config.json}} is to be found in {{$HOME}} (according 
to the [Mesosphere 
insutructions|https://mesosphere.github.io/marathon/docs/native-docker-private-registry.html#docker-containerizer]
 and the 
[code|https://github.com/apache/mesos/blob/b7933c176d719766bdb6459048ede6e94f6a7763/src/docker/docker.cpp#L1710]).

However, this assumption was only true for Linux per the [Docker 
code|https://github.com/moby/moby/blob/3a633a712c8bbb863fe7e57ec132dd87a9c4eff7/pkg/homedir/homedir_unix.go#L14],
 but on Windows, Docker explicitly looks at the {{USERPROFILE}} environment 
variable, again [per the Docker 
code|https://github.com/moby/moby/blob/3a633a712c8bbb863fe7e57ec132dd87a9c4eff7/pkg/homedir/homedir_windows.go#L10].

So in order for Docker to pick up the config file correctly, we need to change 
the variable used on Windows in the Docker containerizer.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8618) ReconciliationTest.ReconcileStatusUpdateTaskState is flaky.

2018-02-27 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-8618:
--

 Summary: ReconciliationTest.ReconcileStatusUpdateTaskState is 
flaky.
 Key: MESOS-8618
 URL: https://issues.apache.org/jira/browse/MESOS-8618
 Project: Mesos
  Issue Type: Bug
  Components: test
 Environment: ec Debian 9 with SSL
Reporter: Alexander Rukletsov
 Attachments: 
ReconciliationTest.ReconcileStatusUpdateTaskState-badrun.txt

{noformat}
../../src/tests/reconciliation_tests.cpp:1129
  Expected: TASK_RUNNING
To be equal to: update->state()
  Which is: TASK_FINISHED
{noformat}
Full log attached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8617) Tests using default executor occasionally fail.

2018-02-27 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8617?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378765#comment-16378765
 ] 

Alexander Rukletsov commented on MESOS-8617:


>From a brief look over the logs, the default executor either has not started 
>at all or crashed before writing anything into the log.

> Tests using default executor occasionally fail.
> ---
>
> Key: MESOS-8617
> URL: https://issues.apache.org/jira/browse/MESOS-8617
> Project: Mesos
>  Issue Type: Bug
>Reporter: Alexander Rukletsov
>Priority: Major
>  Labels: flaky-test
> Attachments: MasterTest.TasksEndpoint-badrun.txt, 
> MasterTest.TasksEndpoint-goodrun.txt
>
>
> Task transition expectation can be violated resulting in a failing test, e.g.:
> {noformat}
> ../../src/tests/master_tests.cpp:4134: Failure
>   Expected: TASK_RUNNING
> To be equal to: status1->state()
>   Which is: TASK_LOST
> {noformat}
> List of known affected tests:
> {noformat}
> MasterTest.TasksEndpoint
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8617) Tests using default executor occasionally fail.

2018-02-27 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-8617:
--

 Summary: Tests using default executor occasionally fail.
 Key: MESOS-8617
 URL: https://issues.apache.org/jira/browse/MESOS-8617
 Project: Mesos
  Issue Type: Bug
Reporter: Alexander Rukletsov
 Attachments: MasterTest.TasksEndpoint-badrun.txt

Task transition expectation can be violated resulting in a failing test, e.g.:
{noformat}
../../src/tests/master_tests.cpp:4134: Failure
  Expected: TASK_RUNNING
To be equal to: status1->state()
  Which is: TASK_LOST
{noformat}
List of known affected tests:
{noformat}
MasterTest.TasksEndpoint
{noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8521) Various IOSwitchboard related tests fail on macOS High Sierra.

2018-02-27 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8521?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16378719#comment-16378719
 ] 

Alexander Rukletsov commented on MESOS-8521:


I've just successfully run a complete {{ninja check}} on macOS High Sierra 
10.13.3 (17D47) with {{Apple LLVM version 9.0.0 (clang-900.0.39.2)}}:
{noformat}
[==] Running 6 tests from 2 test cases.
[--] Global test environment set-up.
[--] 4 tests from IOSwitchboardTest
[ RUN  ] IOSwitchboardTest.ContainerAttach
[   OK ] IOSwitchboardTest.ContainerAttach (179 ms)
[ RUN  ] IOSwitchboardTest.OutputRedirectionWithTTY
[   OK ] IOSwitchboardTest.OutputRedirectionWithTTY (124 ms)
[ RUN  ] IOSwitchboardTest.KillSwitchboardContainerDestroyed
[   OK ] IOSwitchboardTest.KillSwitchboardContainerDestroyed (283 ms)
[ RUN  ] IOSwitchboardTest.ContainerAttachAfterSlaveRestart
[   OK ] IOSwitchboardTest.ContainerAttachAfterSlaveRestart (413 ms)
[--] 4 tests from IOSwitchboardTest (1006 ms total)

[--] 2 tests from ContentType/AgentAPITest
[ RUN  ] ContentType/AgentAPITest.LaunchNestedContainerSessionWithTTY/0
[   OK ] ContentType/AgentAPITest.LaunchNestedContainerSessionWithTTY/0 
(404 ms)
[ RUN  ] ContentType/AgentAPITest.LaunchNestedContainerSessionWithTTY/1
[   OK ] ContentType/AgentAPITest.LaunchNestedContainerSessionWithTTY/1 
(409 ms)
[--] 2 tests from ContentType/AgentAPITest (819 ms total)

[--] Global test environment tear-down
[==] 6 tests from 2 test cases ran. (1847 ms total)
[  PASSED  ] 6 tests.
{noformat}

> Various IOSwitchboard related tests fail on macOS High Sierra. 
> ---
>
> Key: MESOS-8521
> URL: https://issues.apache.org/jira/browse/MESOS-8521
> Project: Mesos
>  Issue Type: Bug
> Environment: macOS 10.13.2 (17C88)
> Apple LLVM version 9.0.0 (clang-900.0.39.2)
>Reporter: Till Toenshoff
>Priority: Major
>
> The problem appears to cause several switchboard tests to fail. Note that 
> this problem does not manifest on older Apple systems.
> The failure rate on this system is 100%.
> List of currently failing tests:
> {noformat}
> IOSwitchboardTest.ContainerAttach
> IOSwitchboardTest.ContainerAttachAfterSlaveRestart
> IOSwitchboardTest.OutputRedirectionWithTTY
> ContentType/AgentAPITest.LaunchNestedContainerSessionWithTTY/0
> ContentType/AgentAPITest.LaunchNestedContainerSessionWithTTY/1
> {noformat}
> This is an example using {{GLOG=v1}} verbose logging:
> {noformat}
> [ RUN  ] IOSwitchboardTest.ContainerAttach
> I0201 03:02:51.925930 2385417024 containerizer.cpp:304] Using isolation { 
> environment_secret, filesystem/posix, posix/cpu }
> I0201 03:02:51.926230 2385417024 provisioner.cpp:299] Using default backend 
> 'copy'
> I0201 03:02:51.927325 107409408 containerizer.cpp:674] Recovering 
> containerizer
> I0201 03:02:51.928336 109019136 provisioner.cpp:495] Provisioner recovery 
> complete
> I0201 03:02:51.934250 105799680 containerizer.cpp:1202] Starting container 
> 1b1af888-9e39-4c13-a647-ac43c0df9fad
> I0201 03:02:51.936218 105799680 containerizer.cpp:1368] Checkpointed 
> ContainerConfig at 
> '/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/IOSwitchboardTest_ContainerAttach_1nkPYl/containers/1b1af888-9e39-4c13-a647-ac43c0df9fad/config'
> I0201 03:02:51.936251 105799680 containerizer.cpp:2952] Transitioning the 
> state of container 1b1af888-9e39-4c13-a647-ac43c0df9fad from PROVISIONING to 
> PREPARING
> I0201 03:02:51.937369 109019136 switchboard.cpp:429] Allocated pseudo 
> terminal '/dev/ttys003' for container 1b1af888-9e39-4c13-a647-ac43c0df9fad
> I0201 03:02:51.943632 109019136 switchboard.cpp:557] Launching 
> 'mesos-io-switchboard' with flags '--heartbeat_interval="30secs" 
> --help="false" 
> --socket_address="/tmp/mesos-io-switchboard-d3bcec3f-7c29-4630-b374-55fabb6034d8"
>  --stderr_from_fd="7" --stderr_to_fd="2" --stdin_to_fd="7" 
> --stdout_from_fd="7" --stdout_to_fd="1" --tty="true" 
> --wait_for_connection="false"' for container 
> 1b1af888-9e39-4c13-a647-ac43c0df9fad
> I0201 03:02:51.945106 109019136 switchboard.cpp:587] Created I/O switchboard 
> server (pid: 83716) listening on socket file 
> '/tmp/mesos-io-switchboard-d3bcec3f-7c29-4630-b374-55fabb6034d8' for 
> container 1b1af888-9e39-4c13-a647-ac43c0df9fad
> I0201 03:02:51.947762 106336256 containerizer.cpp:1844] Launching 
> 'mesos-containerizer' with flags '--help="false" 
> --launch_info="{"command":{"shell":true,"value":"sleep 
>