[jira] [Comment Edited] (MESOS-8573) Container stuck in PULLING when Docker daemon hangs

2018-02-19 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367705#comment-16367705
 ] 

Gilbert Song edited comment on MESOS-8573 at 2/20/18 7:21 AM:
--

https://reviews.apache.org/r/65689/
 [https://reviews.apache.org/r/65712/]


was (Author: gilbert):
[https://reviews.apache.org/r/65689/
https://reviews.apache.org/r/65712/
|https://reviews.apache.org/r/65689/]

> Container stuck in PULLING when Docker daemon hangs
> ---
>
> Key: MESOS-8573
> URL: https://issues.apache.org/jira/browse/MESOS-8573
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Assignee: Gilbert Song
>Priority: Major
>  Labels: mesosphere
>
> When the {{force}} argument is not set to {{true}}, {{Docker::pull}} will 
> always perform a {{docker inspect}} call before it does a {{docker pull}}. If 
> either of these two Docker CLI calls hangs indefinitely, the Docker container 
> will be stuck in the PULLING state. This means that we make no further 
> progress in the {{launch()}} call path, so the executor binary is never 
> executed, the {{Future}} associated with the {{launch()}} call is never 
> failed or satisfied, and {{wait()}} is never called on the container. The 
> agent chains the executor cleanup onto that {{wait()}} call which is never 
> made. So, when the executor registration timeout elapses, 
> {{containerizer->destroy()}} is called on the executor container, but the 
> rest of the executor cleanup is never performed, and no terminal task status 
> update is sent.
> This leaves the task destined for that Docker executor stuck in TASK_STAGING 
> from the framework's perspective, and attempts to kill the task will fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (MESOS-8573) Container stuck in PULLING when Docker daemon hangs

2018-02-19 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16367705#comment-16367705
 ] 

Gilbert Song edited comment on MESOS-8573 at 2/20/18 7:21 AM:
--

[https://reviews.apache.org/r/65689/
https://reviews.apache.org/r/65712/
|https://reviews.apache.org/r/65689/]


was (Author: gilbert):
https://reviews.apache.org/r/65689/

> Container stuck in PULLING when Docker daemon hangs
> ---
>
> Key: MESOS-8573
> URL: https://issues.apache.org/jira/browse/MESOS-8573
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Assignee: Gilbert Song
>Priority: Major
>  Labels: mesosphere
>
> When the {{force}} argument is not set to {{true}}, {{Docker::pull}} will 
> always perform a {{docker inspect}} call before it does a {{docker pull}}. If 
> either of these two Docker CLI calls hangs indefinitely, the Docker container 
> will be stuck in the PULLING state. This means that we make no further 
> progress in the {{launch()}} call path, so the executor binary is never 
> executed, the {{Future}} associated with the {{launch()}} call is never 
> failed or satisfied, and {{wait()}} is never called on the container. The 
> agent chains the executor cleanup onto that {{wait()}} call which is never 
> made. So, when the executor registration timeout elapses, 
> {{containerizer->destroy()}} is called on the executor container, but the 
> rest of the executor cleanup is never performed, and no terminal task status 
> update is sent.
> This leaves the task destined for that Docker executor stuck in TASK_STAGING 
> from the framework's perspective, and attempts to kill the task will fail.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8595) Mesos agent's use of /tmp for overlayfs could be confusing

2018-02-19 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369442#comment-16369442
 ] 

Yan Xu commented on MESOS-8595:
---

/cc [~gilbert] [~zhitao]

> Mesos agent's use of /tmp for overlayfs could be confusing
> --
>
> Key: MESOS-8595
> URL: https://issues.apache.org/jira/browse/MESOS-8595
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>Priority: Minor
>
> With MESOS-6000 Mesos creates temp directories under {{/tmp}}, this could be 
> surprising for operators who see no Mesos flags specified with a {{/tmp}} 
> prefix or with default value as such but discover such directories on the 
> host.
> We should at least group them under {{/tmp/mesos}} to suggest that Mesos 
> created those.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8595) Mesos agent

2018-02-19 Thread Yan Xu (JIRA)
Yan Xu created MESOS-8595:
-

 Summary: Mesos agent
 Key: MESOS-8595
 URL: https://issues.apache.org/jira/browse/MESOS-8595
 Project: Mesos
  Issue Type: Bug
Reporter: Yan Xu


With MESOS-6000 Mesos creates temp directories under {{/tmp}}, this could be 
surprising for operators who see no Mesos flags specified with a {{/tmp}} 
prefix or with default value as such but discover such directories on the host.

We should at least group them under {{/tmp/mesos}} to suggest that Mesos 
created those.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8594) Mesos master crash (under load)

2018-02-19 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369305#comment-16369305
 ] 

Jie Yu commented on MESOS-8594:
---

cc [~bmahler], [~benjaminhindman]

This will likely to be resolved by using `loop` in libprocess, which prevent 
infinite stack overflow.

> Mesos master crash (under load)
> ---
>
> Key: MESOS-8594
> URL: https://issues.apache.org/jira/browse/MESOS-8594
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.5.0, 1.6.0
>Reporter: A. Dukhovniy
>Priority: Major
>  Labels: reliability
> Attachments: lldb-bt.txt, lldb-di-f.txt, lldb-image-section.txt, 
> lldb-regiser-read.txt
>
>
> Mesos master crashes under load. Attached are some infos from the `lldb`:
> {code:java}
> Process 41933 resuming
> Process 41933 stopped
> * thread #10, stop reason = EXC_BAD_ACCESS (code=2, address=0x789ecff8)
> frame #0: 0x00010c30ddb6 libmesos-1.6.0.dylib`::_Some() at some.hpp:35
> 32 template 
> 33 struct _Some
> 34 {
> -> 35 _Some(T _t) : t(std::move(_t)) {}
> 36
> 37 T t;
> 38 };
> Target 0: (mesos-master) stopped.
> (lldb)
> {code}
> To quote [~abudnik]
> {quote}it’s the stack overflow bug in libprocess due to the way 
> `internal::send()` and `internal::_send()` are implemented in `process.cpp`
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8594) Mesos master crash (under load)

2018-02-19 Thread Benno Evers (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369238#comment-16369238
 ] 

Benno Evers commented on MESOS-8594:


The analysis by [~abudnik] seems to be correct, the actual site of the crash 
looks completely harmless with no dangling pointers or anything, and the call 
stack is very deep, going repeatedly through `process::internal::send()` and 
`process::internal::_send()`. (although

 

The root cause seems to be this ancient TODO in `Future::onAny()`
{noformat}
  synchronized (data->lock) {
    if (data->state == PENDING) {
  data->onAnyCallbacks.emplace_back(std::move(callback));
    } else {
  run = true;
    }
  }

  // TODO(*): Invoke callback in another execution context.
  if (run) {
    std::move(callback)(*this); // NOLINT(misc-use-after-move)
  }{noformat}
 

so whenever we arrive in `send()` and the future returned by the socket is 
already finished, we add another 5-10 functions to the stack frame.

 

Most likely, due the large number of big packets being sent over a loopback 
interface, there is always enough data to allow a large enough build-up to 
cause the program to run out of stack space.

 

> Mesos master crash (under load)
> ---
>
> Key: MESOS-8594
> URL: https://issues.apache.org/jira/browse/MESOS-8594
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.5.0, 1.6.0
>Reporter: A. Dukhovniy
>Priority: Major
> Attachments: lldb-bt.txt, lldb-di-f.txt, lldb-image-section.txt, 
> lldb-regiser-read.txt
>
>
> Mesos master crashes under load. Attached are some infos from the `lldb`:
> {code:java}
> Process 41933 resuming
> Process 41933 stopped
> * thread #10, stop reason = EXC_BAD_ACCESS (code=2, address=0x789ecff8)
> frame #0: 0x00010c30ddb6 libmesos-1.6.0.dylib`::_Some() at some.hpp:35
> 32 template 
> 33 struct _Some
> 34 {
> -> 35 _Some(T _t) : t(std::move(_t)) {}
> 36
> 37 T t;
> 38 };
> Target 0: (mesos-master) stopped.
> (lldb)
> {code}
> To quote [~abudnik] 
> {quote}
> it’s the stack overflow bug in libprocess due to a way `internal::send()` and 
> `internal::_send()` are implemented in `process.cpp`
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8553) Implement a test to reproduce a bug in launch nested container call.

2018-02-19 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-8553:
--

Assignee: Benno Evers

> Implement a test to reproduce a bug in launch nested container call.
> 
>
> Key: MESOS-8553
> URL: https://issues.apache.org/jira/browse/MESOS-8553
> Project: Mesos
>  Issue Type: Task
>  Components: test
>Reporter: Andrei Budnik
>Assignee: Benno Evers
>Priority: Major
>  Labels: flaky-test, mesosphere
>
> It's known that in some circumstances an attempt to launch a nested container 
> session might fail with the following error message:
> {code:java}
> Failed to enter mount namespace: Failed to open '/proc/29473/ns/mnt': No such 
> file or directory
> {code}
> That message is written by [linux 
> launcher|https://github.com/apache/mesos/blob/f7dbd29bd9809d1dd254041537ca875e7ea26613/src/slave/containerizer/mesos/launch.cpp#L742-L743]
>  to stdout. This bug is most likely caused by 
> [getMountNamespaceTarget()|https://github.com/apache/mesos/blob/f7dbd29bd9809d1dd254041537ca875e7ea26613/src/slave/containerizer/mesos/utils.cpp#L59].
> Steps for the test could be:
>  1) Start a long running task in its own container (e.g. `sleep 1000`)
>  2) Start a new short-living nested container via `LAUNCH_NESTED_CONTAINER` 
> (e.g. `echo echo`)
>  3) Call `WAIT_NESTED_CONTAINER` on that nested container
>  4) Start long-living nested container via `LAUNCH_NESTED_CONTAINER` (e.g. 
> `cat`)
>  5) Kill that nested container via `KILL_NESTED_CONTAINER`
>  6) Start another long-living nested container via 
> `LAUNCH_NESTED_CONTAINER_SESSION`  (e.g. `cat`)
>  7) Attach to that container via `ATTACH_CONTAINER_INPUT` and write non-empty 
> message M to container's stdin
>  8) Check the output of the nested container: it should contain message M
> The bug might pop up during step 8.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8521) IOSwitchboardTest::ContainerAttach fails on macOS.

2018-02-19 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8521?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-8521:
--

Assignee: (was: Andrei Budnik)

> IOSwitchboardTest::ContainerAttach fails on macOS. 
> ---
>
> Key: MESOS-8521
> URL: https://issues.apache.org/jira/browse/MESOS-8521
> Project: Mesos
>  Issue Type: Bug
> Environment: macOS 10.13.2 (17C88)
> Apple LLVM version 9.0.0 (clang-900.0.39.2)
>Reporter: Till Toenshoff
>Priority: Major
>
> The problem appears to cause several switchboard tests to fail. Note that 
> this problem does not manifest on older Apple systems.
> The failure rate on this system is 100%.
> This is an example using {{GLOG=v1}} verbose logging:
> {noformat}
> [ RUN  ] IOSwitchboardTest.ContainerAttach
> I0201 03:02:51.925930 2385417024 containerizer.cpp:304] Using isolation { 
> environment_secret, filesystem/posix, posix/cpu }
> I0201 03:02:51.926230 2385417024 provisioner.cpp:299] Using default backend 
> 'copy'
> I0201 03:02:51.927325 107409408 containerizer.cpp:674] Recovering 
> containerizer
> I0201 03:02:51.928336 109019136 provisioner.cpp:495] Provisioner recovery 
> complete
> I0201 03:02:51.934250 105799680 containerizer.cpp:1202] Starting container 
> 1b1af888-9e39-4c13-a647-ac43c0df9fad
> I0201 03:02:51.936218 105799680 containerizer.cpp:1368] Checkpointed 
> ContainerConfig at 
> '/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/IOSwitchboardTest_ContainerAttach_1nkPYl/containers/1b1af888-9e39-4c13-a647-ac43c0df9fad/config'
> I0201 03:02:51.936251 105799680 containerizer.cpp:2952] Transitioning the 
> state of container 1b1af888-9e39-4c13-a647-ac43c0df9fad from PROVISIONING to 
> PREPARING
> I0201 03:02:51.937369 109019136 switchboard.cpp:429] Allocated pseudo 
> terminal '/dev/ttys003' for container 1b1af888-9e39-4c13-a647-ac43c0df9fad
> I0201 03:02:51.943632 109019136 switchboard.cpp:557] Launching 
> 'mesos-io-switchboard' with flags '--heartbeat_interval="30secs" 
> --help="false" 
> --socket_address="/tmp/mesos-io-switchboard-d3bcec3f-7c29-4630-b374-55fabb6034d8"
>  --stderr_from_fd="7" --stderr_to_fd="2" --stdin_to_fd="7" 
> --stdout_from_fd="7" --stdout_to_fd="1" --tty="true" 
> --wait_for_connection="false"' for container 
> 1b1af888-9e39-4c13-a647-ac43c0df9fad
> I0201 03:02:51.945106 109019136 switchboard.cpp:587] Created I/O switchboard 
> server (pid: 83716) listening on socket file 
> '/tmp/mesos-io-switchboard-d3bcec3f-7c29-4630-b374-55fabb6034d8' for 
> container 1b1af888-9e39-4c13-a647-ac43c0df9fad
> I0201 03:02:51.947762 106336256 containerizer.cpp:1844] Launching 
> 'mesos-containerizer' with flags '--help="false" 
> --launch_info="{"command":{"shell":true,"value":"sleep 
> 1000"},"environment":{"variables":[{"name":"MESOS_SANDBOX","type":"VALUE","value":"\/var\/folders\/_t\/rdp354gx7j5fjww270kbk6_rgn\/T\/IOSwitchboardTest_ContainerAttach_W9gDw0"}]},"task_environment":{},"tty_slave_path":"\/dev\/ttys003","working_directory":"\/var\/folders\/_t\/rdp354gx7j5fjww270kbk6_rgn\/T\/IOSwitchboardTest_ContainerAttach_W9gDw0"}"
>  --pipe_read="7" --pipe_write="10" 
> --runtime_directory="/var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/IOSwitchboardTest_ContainerAttach_1nkPYl/containers/1b1af888-9e39-4c13-a647-ac43c0df9fad"'
> I0201 03:02:51.949144 106336256 launcher.cpp:140] Forked child with pid 
> '83717' for container '1b1af888-9e39-4c13-a647-ac43c0df9fad'
> I0201 03:02:51.949896 106336256 containerizer.cpp:2952] Transitioning the 
> state of container 1b1af888-9e39-4c13-a647-ac43c0df9fad from PREPARING to 
> ISOLATING
> I0201 03:02:51.951071 106336256 containerizer.cpp:2952] Transitioning the 
> state of container 1b1af888-9e39-4c13-a647-ac43c0df9fad from ISOLATING to 
> FETCHING
> I0201 03:02:51.951190 108482560 fetcher.cpp:369] Starting to fetch URIs for 
> container: 1b1af888-9e39-4c13-a647-ac43c0df9fad, directory: 
> /var/folders/_t/rdp354gx7j5fjww270kbk6_rgn/T/IOSwitchboardTest_ContainerAttach_W9gDw0
> I0201 03:02:51.951791 109019136 containerizer.cpp:2952] Transitioning the 
> state of container 1b1af888-9e39-4c13-a647-ac43c0df9fad from FETCHING to 
> RUNNING
> I0201 03:02:52.076602 106872832 containerizer.cpp:2338] Destroying container 
> 1b1af888-9e39-4c13-a647-ac43c0df9fad in RUNNING state
> I0201 03:02:52.076644 106872832 containerizer.cpp:2952] Transitioning the 
> state of container 1b1af888-9e39-4c13-a647-ac43c0df9fad from RUNNING to 
> DESTROYING
> I0201 03:02:52.076920 106872832 launcher.cpp:156] Asked to destroy container 
> 1b1af888-9e39-4c13-a647-ac43c0df9fad
> I0201 03:02:52.158571 107945984 containerizer.cpp:2791] Container 
> 1b1af888-9e39-4c13-a647-ac43c0df9fad has exited
> I0201 03:02:57.162788 110092288 switchboard.cpp:790] Sending SIGTERM to I/O 
> switchboard server (pid: 

[jira] [Commented] (MESOS-8534) Allow nested containers in TaskGroups to have separate network namespaces

2018-02-19 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369148#comment-16369148
 ] 

Alexander Rukletsov commented on MESOS-8534:


[~sagar8192] Do I understand it correctly, that after your change, if a 
container requests a separate network namespace, tcp and http health checks 
will not work? I am concerned that this can be surprising for the users.

> Allow nested containers in TaskGroups to have separate network namespaces
> -
>
> Key: MESOS-8534
> URL: https://issues.apache.org/jira/browse/MESOS-8534
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Sagar Sadashiv Patwardhan
>Priority: Minor
>  Labels: cni
>
> As per the discussion with [~jieyu] and [~avinash.mesos] , I am going to 
> allow nested containers in TaskGroups to have separate namespaces. I am also 
> going to retain the existing functionality, where nested containers can share 
> namespaces with parent/root container.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8594) Mesos master crash (under load)

2018-02-19 Thread A. Dukhovniy (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16369138#comment-16369138
 ] 

A. Dukhovniy commented on MESOS-8594:
-

I could also attach a core dump but it's 1.6G big.

> Mesos master crash (under load)
> ---
>
> Key: MESOS-8594
> URL: https://issues.apache.org/jira/browse/MESOS-8594
> Project: Mesos
>  Issue Type: Bug
>  Components: master
>Affects Versions: 1.5.0, 1.6.0
>Reporter: A. Dukhovniy
>Priority: Major
> Attachments: lldb-bt.txt, lldb-di-f.txt, lldb-image-section.txt, 
> lldb-regiser-read.txt
>
>
> Mesos master crashes under load. Attached are some infos from the `lldb`:
> {code:java}
> Process 41933 resuming
> Process 41933 stopped
> * thread #10, stop reason = EXC_BAD_ACCESS (code=2, address=0x789ecff8)
> frame #0: 0x00010c30ddb6 libmesos-1.6.0.dylib`::_Some() at some.hpp:35
> 32 template 
> 33 struct _Some
> 34 {
> -> 35 _Some(T _t) : t(std::move(_t)) {}
> 36
> 37 T t;
> 38 };
> Target 0: (mesos-master) stopped.
> (lldb)
> {code}
> To quote [~abudnik] 
> {quote}
> it’s the stack overflow bug in libprocess due to a way `internal::send()` and 
> `internal::_send()` are implemented in `process.cpp`
> {quote}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8594) Mesos master crash (under load)

2018-02-19 Thread A. Dukhovniy (JIRA)
A. Dukhovniy created MESOS-8594:
---

 Summary: Mesos master crash (under load)
 Key: MESOS-8594
 URL: https://issues.apache.org/jira/browse/MESOS-8594
 Project: Mesos
  Issue Type: Bug
  Components: master
Affects Versions: 1.5.0, 1.6.0
Reporter: A. Dukhovniy


Mesos master crashes under load. Attached are some infos from the `lldb`:
{code:java}
Process 41933 resuming
Process 41933 stopped
* thread #10, stop reason = EXC_BAD_ACCESS (code=2, address=0x789ecff8)
frame #0: 0x00010c30ddb6 libmesos-1.6.0.dylib`::_Some() at some.hpp:35
32 template 
33 struct _Some
34 {
-> 35 _Some(T _t) : t(std::move(_t)) {}
36
37 T t;
38 };
Target 0: (mesos-master) stopped.
(lldb)
{code}

To quote [~abudnik] 
{quote}
it’s the stack overflow bug in libprocess due to a way `internal::send()` and 
`internal::_send()` are implemented in `process.cpp`
{quote}





--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8593) Support credential updates in Docker config without restarting the agent

2018-02-19 Thread Jan Schlicht (JIRA)
Jan Schlicht created MESOS-8593:
---

 Summary: Support credential updates in Docker config without 
restarting the agent
 Key: MESOS-8593
 URL: https://issues.apache.org/jira/browse/MESOS-8593
 Project: Mesos
  Issue Type: Improvement
  Components: containerization, docker
Reporter: Jan Schlicht


When using the Mesos containerizer with a private Docker repository with 
{{--docker_config}} option, the repository might expire credentials after some 
time, forcing the user to login again. In that case the Docker config in use 
will change and the agent needs to be restarted to reflect the change. Instead 
of restarting, the agent could reload the Docker config file every time before 
fetching.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)