[jira] [Commented] (MESOS-8534) Allow nested containers in TaskGroups to have separate network namespaces

2018-02-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366502#comment-16366502
 ] 

ASF GitHub Bot commented on MESOS-8534:
---

Github user jieyu commented on a diff in the pull request:

https://github.com/apache/mesos/pull/263#discussion_r168600405
  
--- Diff: src/slave/containerizer/mesos/isolators/network/cni/cni.cpp ---
@@ -570,10 +570,17 @@ Future

[jira] [Commented] (MESOS-8534) Allow nested containers in TaskGroups to have separate network namespaces

2018-02-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366505#comment-16366505
 ] 

ASF GitHub Bot commented on MESOS-8534:
---

Github user jieyu commented on a diff in the pull request:

https://github.com/apache/mesos/pull/263#discussion_r168652532
  
--- Diff: src/slave/containerizer/mesos/isolators/network/cni/cni.cpp ---
@@ -751,10 +751,11 @@ Future

[jira] [Commented] (MESOS-8534) Allow nested containers in TaskGroups to have separate network namespaces

2018-02-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366504#comment-16366504
 ] 

ASF GitHub Bot commented on MESOS-8534:
---

Github user jieyu commented on a diff in the pull request:

https://github.com/apache/mesos/pull/263#discussion_r168655002
  
--- Diff: src/slave/containerizer/mesos/isolators/network/cni/cni.cpp ---
@@ -751,10 +751,11 @@ Future

[jira] [Commented] (MESOS-8534) Allow nested containers in TaskGroups to have separate network namespaces

2018-02-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366508#comment-16366508
 ] 

ASF GitHub Bot commented on MESOS-8534:
---

Github user jieyu commented on a diff in the pull request:

https://github.com/apache/mesos/pull/263#discussion_r168654738
  
--- Diff: src/slave/containerizer/mesos/isolators/network/cni/cni.cpp ---
@@ -820,18 +821,16 @@ Future NetworkCniIsolatorProcess::isolate(
   CHECK_SOME(rootDir);
   CHECK_SOME(pluginDir);
 
-  if (containerId.has_parent()) {
+  if (!infos[containerId]->needsSeparateNs) {
--- End diff --

I'd make this more explicit.
```
// NOTE: DEBUG container should not have Info struct. Thus if the control
// reaches here, the container is not a DEBUG container.
if (isNestedContainer && joinParentNetwork)
```


> Allow nested containers in TaskGroups to have separate network namespaces
> -
>
> Key: MESOS-8534
> URL: https://issues.apache.org/jira/browse/MESOS-8534
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Sagar Sadashiv Patwardhan
>Priority: Minor
>  Labels: cni
>
> As per the discussion with [~jieyu] and [~avinash.mesos] , I am going to 
> allow nested containers in TaskGroups to have separate namespaces. I am also 
> going to retain the existing functionality, where nested containers can share 
> namespaces with parent/root container.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8534) Allow nested containers in TaskGroups to have separate network namespaces

2018-02-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366506#comment-16366506
 ] 

ASF GitHub Bot commented on MESOS-8534:
---

Github user jieyu commented on a diff in the pull request:

https://github.com/apache/mesos/pull/263#discussion_r168647358
  
--- Diff: src/slave/containerizer/mesos/isolators/network/cni/cni.cpp ---
@@ -570,10 +570,17 @@ Future

[jira] [Commented] (MESOS-8534) Allow nested containers in TaskGroups to have separate network namespaces

2018-02-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366507#comment-16366507
 ] 

ASF GitHub Bot commented on MESOS-8534:
---

Github user jieyu commented on a diff in the pull request:

https://github.com/apache/mesos/pull/263#discussion_r168651591
  
--- Diff: src/slave/containerizer/mesos/isolators/network/cni/cni.cpp ---
@@ -570,10 +570,17 @@ Future

[jira] [Commented] (MESOS-8534) Allow nested containers in TaskGroups to have separate network namespaces

2018-02-15 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366503#comment-16366503
 ] 

ASF GitHub Bot commented on MESOS-8534:
---

Github user jieyu commented on a diff in the pull request:

https://github.com/apache/mesos/pull/263#discussion_r168652041
  
--- Diff: src/slave/containerizer/mesos/isolators/network/cni/cni.cpp ---
@@ -721,7 +721,7 @@ Future

[jira] [Comment Edited] (MESOS-8534) Allow nested containers in TaskGroups to have separate network namespaces

2018-02-15 Thread Sagar Sadashiv Patwardhan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16362737#comment-16362737
 ] 

Sagar Sadashiv Patwardhan edited comment on MESOS-8534 at 2/16/18 1:16 AM:
---

[~alexr] Yes, this will affect both HTTP and TCP healthchecks. Let me figure 
what can be done to retain the existing functionality.


was (Author: sagar8192):
[~alexr] Yes, I think this will affect both HTTP and TCP healthchecks. Let me 
figure what can be done to retain the existing functionality.

> Allow nested containers in TaskGroups to have separate network namespaces
> -
>
> Key: MESOS-8534
> URL: https://issues.apache.org/jira/browse/MESOS-8534
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Sagar Sadashiv Patwardhan
>Priority: Minor
>  Labels: cni
>
> As per the discussion with [~jieyu] and [~avinash.mesos] , I am going to 
> allow nested containers in TaskGroups to have separate namespaces. I am also 
> going to retain the existing functionality, where nested containers can share 
> namespaces with parent/root container.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8534) Allow nested containers in TaskGroups to have separate network namespaces

2018-02-15 Thread Sagar Sadashiv Patwardhan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366490#comment-16366490
 ] 

Sagar Sadashiv Patwardhan commented on MESOS-8534:
--

I discussed this with [~jieyu] today. Making TCP and HTTP healthchecks work is 
not straightforward and will require a lot of work. He suggested that we can 
use command check instead. Command check for nested containers already executes 
commands under the target nested container namespaces. So, we can use this 
`curl 127.0.0.1:` instead of HTTP healthcheck. This solution works for 
our use case.

> Allow nested containers in TaskGroups to have separate network namespaces
> -
>
> Key: MESOS-8534
> URL: https://issues.apache.org/jira/browse/MESOS-8534
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Sagar Sadashiv Patwardhan
>Priority: Minor
>  Labels: cni
>
> As per the discussion with [~jieyu] and [~avinash.mesos] , I am going to 
> allow nested containers in TaskGroups to have separate namespaces. I am also 
> going to retain the existing functionality, where nested containers can 
> connect to parent/root containers namespace.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7499) Allow "insecure registry" in Mesos containerizer through some operator configuration

2018-02-15 Thread Chun-Hung Hsiao (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7499?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366348#comment-16366348
 ] 

Chun-Hung Hsiao commented on MESOS-7499:


[~gilbert] Is there any plan on supporting non-SSL registry on any port?

> Allow "insecure registry" in Mesos containerizer through some operator 
> configuration
> 
>
> Key: MESOS-7499
> URL: https://issues.apache.org/jira/browse/MESOS-7499
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Zhitao Li
>Priority: Major
>
> Similar to {{--insecure-registry}} for docker daemon, Mesos containerizer 
> should allow cluster operator to relax https requirement on certain docker 
> registries.
> A practical use case is internal registry addresses hosted on private network 
> in a corp: we often trust these addresses and do not want to configure extra 
> cert for them. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8588) Introduce a work stealing Process scheduler.

2018-02-15 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-8588:
--

 Summary: Introduce a work stealing Process scheduler.
 Key: MESOS-8588
 URL: https://issues.apache.org/jira/browse/MESOS-8588
 Project: Mesos
  Issue Type: Epic
  Components: libprocess
Reporter: Benjamin Mahler


Currently, libprocess uses a work sharing Process scheduler, by which all 
workers take work from a global shared runnable Process queue. This has some 
performance implications, for example the shared global queue can have a high 
degree of contention, and Processes can migrate across cores frequently (this 
can be even more expensive on NUMA systems).

We can introduce an alternative work stealing scheduler, in which each worker 
has its own queue. When this worker runs out of items, it steals from another 
worker's queue (stealing attempts could be done with cache locality / NUMA in 
mind).

The process scheduler could ideally be a run-time option if it does not 
introduce performance overhead over a compile time option.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8587) Introduce a parallel for each loop (and other parallel algorithms).

2018-02-15 Thread Benjamin Mahler (JIRA)
Benjamin Mahler created MESOS-8587:
--

 Summary: Introduce a parallel for each loop (and other parallel 
algorithms).
 Key: MESOS-8587
 URL: https://issues.apache.org/jira/browse/MESOS-8587
 Project: Mesos
  Issue Type: Improvement
  Components: libprocess
Reporter: Benjamin Mahler


Consider the following code:

{code}
SomeProcess::func()
{
  foreach (const Item& item, items) {
// Perform some const work on item.
  }
}
{code}

When {{items}} becomes very large, this code would benefit from some 
parallelism. With a parallel loop construct, we could improve the performance 
of this type of code significantly:

{code}
SomeProcess::func()
{
  foreach_parallel (items, [=](const Item& item) {
// Perform some const work on item.
  });
}
{code}

Ideally, this could enforce const-access to the current Process for safety. An 
implementation of this would need to do something like:

# Split the iteration of {{items}} into 1 <= N <= num_worker_threads segments.
# Spawn N-1 additional temporary execution Processes (or re-use from a pool)
# Dispatch to these N-1 additional processes for them to perform their segment 
of the iteration.
# Perform the 1st segment on the current Process.
# Have the current Process block to wait for the others to finish. (note need 
to avoid deadlocking the worker threads here!)

This generalizes to many other algorithms rather than just iteration. It may be 
good to align this with the C++ Parallelism TS, which shows how many of the C++ 
algorithms have potential for parallel counterparts.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8586) apply-reviews.py silently does nothing when a review was submitted already.

2018-02-15 Thread Till Toenshoff (JIRA)
Till Toenshoff created MESOS-8586:
-

 Summary: apply-reviews.py silently does nothing when a review was 
submitted already.
 Key: MESOS-8586
 URL: https://issues.apache.org/jira/browse/MESOS-8586
 Project: Mesos
  Issue Type: Bug
Reporter: Till Toenshoff


When using {{apply-reviews.py}} on a review that had been submitted already, we 
don't get any feedback.

This seems not ideal to me as it;
1. should at least tell me that it ignores my request due to 
https://github.com/apache/mesos/blob/b5fbfe8c5064e1ff3d81279679e75a84b1abfcef/support/apply-reviews.py#L113
(2.) prevents me from using it for backporting - this may be desired and I can 
work around by using cherry-picking




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-02-15 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-8568:
-

Shepherd: Alexander Rukletsov
Assignee: Benno Evers  (was: Alexander Rukletsov)
Story Points: 5

> Command checks should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`
> --
>
> Key: MESOS-8568
> URL: https://issues.apache.org/jira/browse/MESOS-8568
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Budnik
>Assignee: Benno Evers
>Priority: Major
>  Labels: default-executor, health-check, mesosphere
>
> After successful launch of a nested container via 
> `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls 
> [waitNestedContainer 
> |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657]
>  for the container. Checker library 
> [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487]
>  `REMOVE_NESTED_CONTAINER` to remove a previous nested container before 
> launching a nested container for a subsequent check. Hence, 
> `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that 
> the nested container has been terminated and can be removed/cleaned up.
> In case of failure, the library [doesn't 
> call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636]
>  `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be 
> launched and the following attempt to remove the container without call 
> `WAIT_NESTED_CONTAINER` leads to errors like:
> {code:java}
> W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal 
> Server Error' (Nested container has not terminated yet) while removing the 
> nested container 
> '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125'
>  used for the COMMAND check for task 
> 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91
> {code}
> The checker library should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-02-15 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-8568:
-

Assignee: Alexander Rukletsov

> Command checks should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`
> --
>
> Key: MESOS-8568
> URL: https://issues.apache.org/jira/browse/MESOS-8568
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Budnik
>Assignee: Alexander Rukletsov
>Priority: Major
>  Labels: default-executor, health-check, mesosphere
>
> After successful launch of a nested container via 
> `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls 
> [waitNestedContainer 
> |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657]
>  for the container. Checker library 
> [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487]
>  `REMOVE_NESTED_CONTAINER` to remove a previous nested container before 
> launching a nested container for a subsequent check. Hence, 
> `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that 
> the nested container has been terminated and can be removed/cleaned up.
> In case of failure, the library [doesn't 
> call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636]
>  `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be 
> launched and the following attempt to remove the container without call 
> `WAIT_NESTED_CONTAINER` leads to errors like:
> {code:java}
> W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal 
> Server Error' (Nested container has not terminated yet) while removing the 
> nested container 
> '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125'
>  used for the COMMAND check for task 
> 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91
> {code}
> The checker library should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-02-15 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-8568:
-

Assignee: (was: Benno Evers)

> Command checks should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`
> --
>
> Key: MESOS-8568
> URL: https://issues.apache.org/jira/browse/MESOS-8568
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Budnik
>Priority: Major
>  Labels: default-executor, health-check, mesosphere
>
> After successful launch of a nested container via 
> `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls 
> [waitNestedContainer 
> |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657]
>  for the container. Checker library 
> [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487]
>  `REMOVE_NESTED_CONTAINER` to remove a previous nested container before 
> launching a nested container for a subsequent check. Hence, 
> `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that 
> the nested container has been terminated and can be removed/cleaned up.
> In case of failure, the library [doesn't 
> call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636]
>  `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be 
> launched and the following attempt to remove the container without call 
> `WAIT_NESTED_CONTAINER` leads to errors like:
> {code:java}
> W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal 
> Server Error' (Nested container has not terminated yet) while removing the 
> nested container 
> '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125'
>  used for the COMMAND check for task 
> 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91
> {code}
> The checker library should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8568) Command checks should always call `WAIT_NESTED_CONTAINER` before `REMOVE_NESTED_CONTAINER`

2018-02-15 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-8568:
-

Assignee: Benno Evers

> Command checks should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`
> --
>
> Key: MESOS-8568
> URL: https://issues.apache.org/jira/browse/MESOS-8568
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Budnik
>Assignee: Benno Evers
>Priority: Major
>  Labels: default-executor, health-check, mesosphere
>
> After successful launch of a nested container via 
> `LAUNCH_NESTED_CONTAINER_SESSION` in a checker library, it calls 
> [waitNestedContainer 
> |https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L657]
>  for the container. Checker library 
> [calls|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L466-L487]
>  `REMOVE_NESTED_CONTAINER` to remove a previous nested container before 
> launching a nested container for a subsequent check. Hence, 
> `REMOVE_NESTED_CONTAINER` call follows `WAIT_NESTED_CONTAINER` to ensure that 
> the nested container has been terminated and can be removed/cleaned up.
> In case of failure, the library [doesn't 
> call|https://github.com/apache/mesos/blob/0a40243c6a35dc9dc41774d43ee3c19cdf9e54be/src/checks/checker_process.cpp#L627-L636]
>  `WAIT_NESTED_CONTAINER`. Despite the failure, the container might be 
> launched and the following attempt to remove the container without call 
> `WAIT_NESTED_CONTAINER` leads to errors like:
> {code:java}
> W0202 20:03:08.895830 7 checker_process.cpp:503] Received '500 Internal 
> Server Error' (Nested container has not terminated yet) while removing the 
> nested container 
> '2b0c542c-1f5f-42f7-b914-2c1cadb4aeca.da0a7cca-516c-4ec9-b215-b34412b670fa.check-49adc5f1-37a3-4f26-8708-e27d2d6cd125'
>  used for the COMMAND check for task 
> 'node-0-server__e26a82b0-fbab-46a0-a1ea-e7ac6cfa4c91
> {code}
> The checker library should always call `WAIT_NESTED_CONTAINER` before 
> `REMOVE_NESTED_CONTAINER`.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8512) Fetcher doesn't log it's stdout/stderr properly to the log file

2018-02-15 Thread Andrew Schwartzmeyer (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8512?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16366000#comment-16366000
 ] 

Andrew Schwartzmeyer commented on MESOS-8512:
-

Tests here: https://reviews.apache.org/r/65624/

Waiting on reviews.

> Fetcher doesn't log it's stdout/stderr properly to the log file
> ---
>
> Key: MESOS-8512
> URL: https://issues.apache.org/jira/browse/MESOS-8512
> Project: Mesos
>  Issue Type: Bug
> Environment: Windows 10
>Reporter: Jeff Coffler
>Assignee: Andrew Schwartzmeyer
>Priority: Major
>  Labels: fetcher, libprocess, stout, windows
>
> The fetcher doesn't log it's stdout or stderr to the task's output files as 
> it does on Linux. This makes it extraordinarily difficult to diagnose fetcher 
> failures (bad URI, or permissions problems, or whatever).
> It does not appear to be a glog issue. I added output to the fetcher via cout 
> and cerr, and that output didn't show up in the log files either. So it 
> appears to be a logging capture issue.
> Note that the container launcher, launched from 
> src/slave/containerizer/mesos/launcher.cpp, does appear to log properly. 
> However, when launching the fetcher itself from 
> src/slave/containerizer/fetcher.cpp (FetcherProcess::run), logging does not 
> happen properly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8576) Improve discard handling of 'Docker::inspect()'

2018-02-15 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-8576:
-

Assignee: Greg Mann

> Improve discard handling of 'Docker::inspect()'
> ---
>
> Key: MESOS-8576
> URL: https://issues.apache.org/jira/browse/MESOS-8576
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization, docker
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Major
>  Labels: mesosphere
>
> In the call path of {{Docker::inspect()}}, each continuation currently checks 
> if {{promise->future().hasDiscard()}}, where the {{promise}} is associated 
> with the output of the {{docker inspect}} call. However, if the call to 
> {{docker inspect}} becomes hung indefinitely, then continuations are never 
> invoked, and a subsequent discard of the returned {{Future}} will have no 
> effect. We should add proper {{onDiscard}} handling to that {{Future}} so 
> that appropriate cleanup is performed in such cases.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8575) Add discard handling to 'Docker::stop()'

2018-02-15 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8575?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-8575:
-

Assignee: Greg Mann

> Add discard handling to 'Docker::stop()'
> 
>
> Key: MESOS-8575
> URL: https://issues.apache.org/jira/browse/MESOS-8575
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Major
>  Labels: mesosphere
>
> The 'Docker::stop()' method should be updated so that when the {{Future}} it 
> returns is discarded, a subprocess associated with a pending call to {{docker 
> stop}} will be cleaned up.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8574) Docker executor makes no progress when 'docker inspect' hangs

2018-02-15 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8574?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-8574:
-

Assignee: Andrei Budnik

> Docker executor makes no progress when 'docker inspect' hangs
> -
>
> Key: MESOS-8574
> URL: https://issues.apache.org/jira/browse/MESOS-8574
> Project: Mesos
>  Issue Type: Improvement
>  Components: docker, executor
>Affects Versions: 1.5.0
>Reporter: Greg Mann
>Assignee: Andrei Budnik
>Priority: Major
>  Labels: mesosphere
>
> In the Docker executor, many calls later in the executor's lifecycle are 
> gated on an initial {{docker inspect}} call returning: 
> https://github.com/apache/mesos/blob/bc6b61bca37752689cffa40a14c53ad89f24e8fc/src/docker/executor.cpp#L223
> If that first call to {{docker inspect}} never returns, the executor becomes 
> stuck in a state where it makes no progress and cannot be killed.
> It's tempting for the executor to simply commit suicide after a timeout, but 
> we must be careful of the case in which the executor's Docker container is 
> actually running successfully, but the Docker daemon is unresponsive. In such 
> a case, we do not want to send TASK_FAILED or TASK_KILLED if the task's 
> container is running successfully.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8569) Allow newline characters when decoding base64 strings in stout

2018-02-15 Thread Kapil Arya (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8569?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kapil Arya reassigned MESOS-8569:
-

Assignee: Kapil Arya

> Allow newline characters when decoding base64 strings in stout
> --
>
> Key: MESOS-8569
> URL: https://issues.apache.org/jira/browse/MESOS-8569
> Project: Mesos
>  Issue Type: Task
>  Components: stout
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>Priority: Major
>
> Current implementation of `stout::base64::decode` errors out on encountering 
> a newline character ("\n" or "\r\n") which is correct wrt 
> [RFC4668#section-3.3|https://tools.ietf.org/html/rfc4648#section-3.3]. 
> However, most implementations insert a newline to delimit encoded string and 
> ignore (instead of erroring out) the newline character while decoding the 
> string. Since stout facilities are used by third-party modules to 
> encode/decode base64 data, it is desirable to allow decoding of 
> newline-delimited data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8585) Agent Crashes When Ask to Start Task with Unknown User

2018-02-15 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann reassigned MESOS-8585:


Assignee: James Peach

> Agent Crashes When Ask to Start Task with Unknown User
> --
>
> Key: MESOS-8585
> URL: https://issues.apache.org/jira/browse/MESOS-8585
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.5.0
>Reporter: Karsten
>Assignee: James Peach
>Priority: Blocker
> Attachments: dcos-mesos-slave.service.1.gz, 
> dcos-mesos-slave.service.2.gz
>
>
> The Marathon team has an integration test that tries to start a task with an 
> unknown user. The test expects a \{{TASK_FAILED}}. However, we see 
> \{{TASK_DROPPED}} instead. The agent logs seem to suggest that the agent 
> crashes and restarts.
>  
> {code}
>  783 2018-02-14 14:55:45: I0214 14:55:45.319974  6213 slave.cpp:2542] 
> Launching task 'sleep-bad-user-7.228ba17d-1197-11e8-baca-6a2835f12cb6' for 
> framework 120721e5-96e5-4c0b-8660-d5ba2e96f05a-0001
> 784 2018-02-14 14:55:45: I0214 14:55:45.320605  6213 paths.cpp:727] 
> Creating sandbox 
> '/var/lib/mesos/slave/slaves/120721e5-96e5-4c0b-8660-d5ba2e96f05a-S3/frameworks/120721e5-96e5-4c0b-8660-d5ba2e96f05
> 784 
> a-0001/executors/sleep-bad-user-7.228ba17d-1197-11e8-baca-6a2835f12cb6/runs/dc99056a-1d85-427f-a34b-ac666d4acc88'
>  for user 'bad'
> 785 2018-02-14 14:55:45: F0214 14:55:45.321131  6213 paths.cpp:735] 
> CHECK_SOME(mkdir): Failed to chown directory to 'bad': No such user 'bad' 
> Failed to create executor directory '/var/lib/mesos/slave/
> 785 
> slaves/120721e5-96e5-4c0b-8660-d5ba2e96f05a-S3/frameworks/120721e5-96e5-4c0b-8660-d5ba2e96f05a-0001/executors/sleep-bad-user-7.228ba17d-1197-11e8-baca-6a2835f12cb6/runs/dc99056a-1d85-427f-a34b-ac6
> 785 66d4acc88'
> 786 2018-02-14 14:55:45: *** Check failure stack trace: ***
> 787 2018-02-14 14:55:45: @ 0x7f72033444ad  
> google::LogMessage::Fail()
> 788 2018-02-14 14:55:45: @ 0x7f72033462dd  
> google::LogMessage::SendToLog()
> 789 2018-02-14 14:55:45: @ 0x7f720334409c  
> google::LogMessage::Flush()
> 790 2018-02-14 14:55:45: @ 0x7f7203346bd9  
> google::LogMessageFatal::~LogMessageFatal()
> 791 2018-02-14 14:55:45: @ 0x56544ca378f9  
> _CheckFatal::~_CheckFatal()
> 792 2018-02-14 14:55:45: @ 0x7f720270f30d  
> mesos::internal::slave::paths::createExecutorDirectory()
> 793 2018-02-14 14:55:45: @ 0x7f720273812c  
> mesos::internal::slave::Framework::addExecutor()
> 794 2018-02-14 14:55:45: @ 0x7f7202753e35  
> mesos::internal::slave::Slave::__run()
> 795 2018-02-14 14:55:45: @ 0x7f7202764292  
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal5slave5SlaveERKNS1_6FutureISt4
> 795 
> listIbSaIbRKNSA_13FrameworkInfoERKNSA_12ExecutorInfoERK6OptionINSA_8TaskInfoEERKSR_INSA_13TaskGroupInfoEERKSt6vectorINSB_19ResourceVersionUUIDESaIS11_EESK_SN_SQ_SV_SZ_S15_EEvRKNS1_3PIDIT_EEMS1
> 795 
> 7_FvT0_T1_T2_T3_T4_T5_EOT6_OT7_OT8_OT9_OT10_OT11_EUlOSI_OSL_OSO_OST_OSX_OS13_S3_E_ISI_SL_SO_ST_SX_S13_St12_PlaceholderILi1EEclEOS3_
> 796 2018-02-14 14:55:45: @ 0x7f72032a2b11  
> process::ProcessBase::consume()
> 797 2018-02-14 14:55:45: @ 0x7f72032b183c  
> process::ProcessManager::resume()
> 798 2018-02-14 14:55:45: @ 0x7f72032b6da6  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> 799 2018-02-14 14:55:45: @ 0x7f72005ced73  (unknown)
> 800 2018-02-14 14:55:45: @ 0x7f72000cf52c  (unknown)
> 801 2018-02-14 14:55:45: @ 0x7f71ffe0d1dd  (unknown)
> 802 2018-02-14 14:57:15: dcos-mesos-slave.service: Main process exited, 
> code=killed, status=6/ABRT
> 803 2018-02-14 14:57:15: dcos-mesos-slave.service: Unit entered failed 
> state.
> 804 2018-02-14 14:57:15: dcos-mesos-slave.service: Failed with result 
> 'signal'.
> 805 2018-02-14 14:57:20: dcos-mesos-slave.service: Service hold-off time 
> over, scheduling restart.
> 806 2018-02-14 14:57:20: Stopped Mesos Agent: distributed systems kernel 
> agent.
> 807 2018-02-14 14:57:20: Starting Mesos Agent: distributed systems kernel 
> agent...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-8585) Agent Crashes When Ask to Start Task with Unknown User

2018-02-15 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16365805#comment-16365805
 ] 

James Peach commented on MESOS-8585:


Yeh, crashing in this case seems pretty unfortunate. Probably 
`createExecutorDirectory` should return an error and we should refactor the 
callers to be able to propagate that correctly.

> Agent Crashes When Ask to Start Task with Unknown User
> --
>
> Key: MESOS-8585
> URL: https://issues.apache.org/jira/browse/MESOS-8585
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.5.0
>Reporter: Karsten
>Priority: Major
> Attachments: dcos-mesos-slave.service.1.gz, 
> dcos-mesos-slave.service.2.gz
>
>
> The Marathon team has an integration test that tries to start a task with an 
> unknown user. The test expects a \{{TASK_FAILED}}. However, we see 
> \{{TASK_DROPPED}} instead. The agent logs seem to suggest that the agent 
> crashes and restarts.
>  
> {code}
>  783 2018-02-14 14:55:45: I0214 14:55:45.319974  6213 slave.cpp:2542] 
> Launching task 'sleep-bad-user-7.228ba17d-1197-11e8-baca-6a2835f12cb6' for 
> framework 120721e5-96e5-4c0b-8660-d5ba2e96f05a-0001
> 784 2018-02-14 14:55:45: I0214 14:55:45.320605  6213 paths.cpp:727] 
> Creating sandbox 
> '/var/lib/mesos/slave/slaves/120721e5-96e5-4c0b-8660-d5ba2e96f05a-S3/frameworks/120721e5-96e5-4c0b-8660-d5ba2e96f05
> 784 
> a-0001/executors/sleep-bad-user-7.228ba17d-1197-11e8-baca-6a2835f12cb6/runs/dc99056a-1d85-427f-a34b-ac666d4acc88'
>  for user 'bad'
> 785 2018-02-14 14:55:45: F0214 14:55:45.321131  6213 paths.cpp:735] 
> CHECK_SOME(mkdir): Failed to chown directory to 'bad': No such user 'bad' 
> Failed to create executor directory '/var/lib/mesos/slave/
> 785 
> slaves/120721e5-96e5-4c0b-8660-d5ba2e96f05a-S3/frameworks/120721e5-96e5-4c0b-8660-d5ba2e96f05a-0001/executors/sleep-bad-user-7.228ba17d-1197-11e8-baca-6a2835f12cb6/runs/dc99056a-1d85-427f-a34b-ac6
> 785 66d4acc88'
> 786 2018-02-14 14:55:45: *** Check failure stack trace: ***
> 787 2018-02-14 14:55:45: @ 0x7f72033444ad  
> google::LogMessage::Fail()
> 788 2018-02-14 14:55:45: @ 0x7f72033462dd  
> google::LogMessage::SendToLog()
> 789 2018-02-14 14:55:45: @ 0x7f720334409c  
> google::LogMessage::Flush()
> 790 2018-02-14 14:55:45: @ 0x7f7203346bd9  
> google::LogMessageFatal::~LogMessageFatal()
> 791 2018-02-14 14:55:45: @ 0x56544ca378f9  
> _CheckFatal::~_CheckFatal()
> 792 2018-02-14 14:55:45: @ 0x7f720270f30d  
> mesos::internal::slave::paths::createExecutorDirectory()
> 793 2018-02-14 14:55:45: @ 0x7f720273812c  
> mesos::internal::slave::Framework::addExecutor()
> 794 2018-02-14 14:55:45: @ 0x7f7202753e35  
> mesos::internal::slave::Slave::__run()
> 795 2018-02-14 14:55:45: @ 0x7f7202764292  
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal5slave5SlaveERKNS1_6FutureISt4
> 795 
> listIbSaIbRKNSA_13FrameworkInfoERKNSA_12ExecutorInfoERK6OptionINSA_8TaskInfoEERKSR_INSA_13TaskGroupInfoEERKSt6vectorINSB_19ResourceVersionUUIDESaIS11_EESK_SN_SQ_SV_SZ_S15_EEvRKNS1_3PIDIT_EEMS1
> 795 
> 7_FvT0_T1_T2_T3_T4_T5_EOT6_OT7_OT8_OT9_OT10_OT11_EUlOSI_OSL_OSO_OST_OSX_OS13_S3_E_ISI_SL_SO_ST_SX_S13_St12_PlaceholderILi1EEclEOS3_
> 796 2018-02-14 14:55:45: @ 0x7f72032a2b11  
> process::ProcessBase::consume()
> 797 2018-02-14 14:55:45: @ 0x7f72032b183c  
> process::ProcessManager::resume()
> 798 2018-02-14 14:55:45: @ 0x7f72032b6da6  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> 799 2018-02-14 14:55:45: @ 0x7f72005ced73  (unknown)
> 800 2018-02-14 14:55:45: @ 0x7f72000cf52c  (unknown)
> 801 2018-02-14 14:55:45: @ 0x7f71ffe0d1dd  (unknown)
> 802 2018-02-14 14:57:15: dcos-mesos-slave.service: Main process exited, 
> code=killed, status=6/ABRT
> 803 2018-02-14 14:57:15: dcos-mesos-slave.service: Unit entered failed 
> state.
> 804 2018-02-14 14:57:15: dcos-mesos-slave.service: Failed with result 
> 'signal'.
> 805 2018-02-14 14:57:20: dcos-mesos-slave.service: Service hold-off time 
> over, scheduling restart.
> 806 2018-02-14 14:57:20: Stopped Mesos Agent: distributed systems kernel 
> agent.
> 807 2018-02-14 14:57:20: Starting Mesos Agent: distributed systems kernel 
> agent...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-8524) When `UPDATE_SLAVE` messages are received, offers might not be rescinded due to a race

2018-02-15 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier reassigned MESOS-8524:
---

Assignee: (was: Benjamin Bannier)

> When `UPDATE_SLAVE` messages are received, offers might not be rescinded due 
> to a race 
> ---
>
> Key: MESOS-8524
> URL: https://issues.apache.org/jira/browse/MESOS-8524
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, master
>Affects Versions: 1.5.0
> Environment: Master + Agent running with enabled 
> {{RESOURCE_PROVIDER}} capability
>Reporter: Jan Schlicht
>Priority: Major
>  Labels: mesosphere
>
> When an agent with enabled {{RESOURCE_PROVIDER}} capability (re-)registers 
> with the master it sends a {{UPDATE_SLAVE}} after being (re-)registered. In 
> the master, the agent is added (back) to the allocator, as soon as it's 
> (re-)registered, i.e. before {{UPDATE_SLAVE}} is being send. This triggers an 
> allocation and offers might get sent out to frameworks. When {{UPDATE_SLAVE}} 
> is being handled in the master, these offers have to be rescinded, as they're 
> based on an outdated agent state.
> Internally, the allocator defers a offer callback in the master 
> ({{Master::offer}}). In rare cases a {{UPDATE_SLAVE}} message might arrive at 
> the same time and its handler in the master called before the offer callback 
> (but after the actual allocation took place). In this case the (outdated) 
> offer is still sent to frameworks and never rescinded.
> Here's the relevant log lines, this was discovered while working on 
> https://reviews.apache.org/r/65045/:
> {noformat}
> I0201 14:17:47.041093 242208768 hierarchical.cpp:1517] Performed allocation 
> for 1 agents in 704915ns
> I0201 14:17:47.041738 242745344 master.cpp:7235] Received update of agent 
> 53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 at slave(540)@172.18.8.20:60469 
> (172.18.8.20) with total oversubscribed resources {}
> I0201 14:17:47.042778 242745344 master.cpp:8808] Sending 1 offers to 
> framework 53c557e7-3161-449b-bacc-a4f8c02e78e7- (default) at 
> scheduler-798f476b-b099-443e-bd3b-9e7333f29672@172.18.8.20:60469
> I0201 14:17:47.043102 243281920 sched.cpp:921] Scheduler::resourceOffers took 
> 40444ns
> I0201 14:17:47.043427 243818496 hierarchical.cpp:712] Grew agent 
> 53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 by disk[MOUNT]:200 (total), {  } 
> (used)
> I0201 14:17:47.043643 243818496 hierarchical.cpp:669] Agent 
> 53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 (172.18.8.20) updated with total 
> resources disk[MOUNT]:200; cpus:2; mem:1024; disk:1024; ports:[31000-32000]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Issue Comment Deleted] (MESOS-8524) When `UPDATE_SLAVE` messages are received, offers might not be rescinded due to a race

2018-02-15 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-8524?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-8524:

Comment: was deleted

(was: Review: https://reviews.apache.org/r/65506/)

> When `UPDATE_SLAVE` messages are received, offers might not be rescinded due 
> to a race 
> ---
>
> Key: MESOS-8524
> URL: https://issues.apache.org/jira/browse/MESOS-8524
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, master
>Affects Versions: 1.5.0
> Environment: Master + Agent running with enabled 
> {{RESOURCE_PROVIDER}} capability
>Reporter: Jan Schlicht
>Assignee: Benjamin Bannier
>Priority: Major
>  Labels: mesosphere
>
> When an agent with enabled {{RESOURCE_PROVIDER}} capability (re-)registers 
> with the master it sends a {{UPDATE_SLAVE}} after being (re-)registered. In 
> the master, the agent is added (back) to the allocator, as soon as it's 
> (re-)registered, i.e. before {{UPDATE_SLAVE}} is being send. This triggers an 
> allocation and offers might get sent out to frameworks. When {{UPDATE_SLAVE}} 
> is being handled in the master, these offers have to be rescinded, as they're 
> based on an outdated agent state.
> Internally, the allocator defers a offer callback in the master 
> ({{Master::offer}}). In rare cases a {{UPDATE_SLAVE}} message might arrive at 
> the same time and its handler in the master called before the offer callback 
> (but after the actual allocation took place). In this case the (outdated) 
> offer is still sent to frameworks and never rescinded.
> Here's the relevant log lines, this was discovered while working on 
> https://reviews.apache.org/r/65045/:
> {noformat}
> I0201 14:17:47.041093 242208768 hierarchical.cpp:1517] Performed allocation 
> for 1 agents in 704915ns
> I0201 14:17:47.041738 242745344 master.cpp:7235] Received update of agent 
> 53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 at slave(540)@172.18.8.20:60469 
> (172.18.8.20) with total oversubscribed resources {}
> I0201 14:17:47.042778 242745344 master.cpp:8808] Sending 1 offers to 
> framework 53c557e7-3161-449b-bacc-a4f8c02e78e7- (default) at 
> scheduler-798f476b-b099-443e-bd3b-9e7333f29672@172.18.8.20:60469
> I0201 14:17:47.043102 243281920 sched.cpp:921] Scheduler::resourceOffers took 
> 40444ns
> I0201 14:17:47.043427 243818496 hierarchical.cpp:712] Grew agent 
> 53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 by disk[MOUNT]:200 (total), {  } 
> (used)
> I0201 14:17:47.043643 243818496 hierarchical.cpp:669] Agent 
> 53c557e7-3161-449b-bacc-a4f8c02e78e7-S0 (172.18.8.20) updated with total 
> resources disk[MOUNT]:200; cpus:2; mem:1024; disk:1024; ports:[31000-32000]
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-2307) Dispatching to a non-existent Process should not return a pending future.

2018-02-15 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16365778#comment-16365778
 ] 

Alexander Rukletsov commented on MESOS-2307:


Abandoned makes sense to me.

> Dispatching to a non-existent Process should not return a pending future.
> -
>
> Key: MESOS-2307
> URL: https://issues.apache.org/jira/browse/MESOS-2307
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Alexander Rukletsov
>Priority: Major
>
> If the libprocess process is terminated, we can still dispatch calls to it as 
> long as we have a {{UPID}}. In this case the future will be pending forever. 
> Instead, it would be better to introduce a separate state for such case, e.g. 
> {{Disconnected}}, {{Abandoned}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-7273) HealthCheckTest.ROOT_INTERNET_CURL_HealthyTaskViaHTTPSWithContainerImage fails on some Linux machines.

2018-02-15 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7273?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-7273:
--

Assignee: (was: haosdent)

> HealthCheckTest.ROOT_INTERNET_CURL_HealthyTaskViaHTTPSWithContainerImage 
> fails on some Linux machines.
> --
>
> Key: MESOS-7273
> URL: https://issues.apache.org/jira/browse/MESOS-7273
> Project: Mesos
>  Issue Type: Bug
> Environment: Ubuntu 16.04
>Reporter: Alexander Rukletsov
>Priority: Major
>  Labels: flaky-test, health-check, test, test-fail
> Attachments: 
> ROOT_INTERNET_CURL_HealthyTaskViaHTTPSWithContainerImage-badrun.txt, 
> ROOT_INTERNET_CURL_HealthyTaskViaHTTPWithContainerImage_failure_ubuntu16.04.txt
>
>
> Log of a bad run: http://pastebin.com/ENa5Sd62
> Brief investigation hints that the task executable failed to start, which is 
> may or may not be related to the environment variable setup:
> {noformat}
> Overwriting environment variable 'PATH', original: 
> '/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin', new: 
> '/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin'
> Failed to execute command: No such file or directory
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (MESOS-7434) SlaveTest.RestartSlaveRequireExecutorAuthentication is flaky.

2018-02-15 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7434?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reassigned MESOS-7434:
--

Assignee: (was: Andrei Budnik)

> SlaveTest.RestartSlaveRequireExecutorAuthentication is flaky.
> -
>
> Key: MESOS-7434
> URL: https://issues.apache.org/jira/browse/MESOS-7434
> Project: Mesos
>  Issue Type: Bug
> Environment: Debian 8
> CentOS 6
> other Linux distros
>Reporter: Greg Mann
>Priority: Major
>  Labels: flaky, flaky-test, mesosphere
> Attachments: RestartSlaveRequireExecutorAuthentication is 
> flaky_failure_log_centos6.txt, 
> RestartSlaveRequireExecutorAuthentication_failure_log_debian8.txt, 
> SlaveTest.RestartSlaveRequireExecAuth-Ubuntu-16.txt
>
>
> This test failure has been observed on an internal CI system. It occurs on a 
> variety of Linux distributions. It seems that using {{cat}} as the task 
> command may be problematic; see attached log file 
> {{SlaveTest.RestartSlaveRequireExecutorAuthentication.txt}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (MESOS-7991) fatal, check failed !framework->recovered()

2018-02-15 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7991?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16365532#comment-16365532
 ] 

Alexander Rukletsov commented on MESOS-7991:


Lowering priority to "Major" because the issue is apparently rare (we have only 
one instance so far) and not severe. Keeping it open because one internal 
invariant is apparently not an invariant and can break.

> fatal, check failed !framework->recovered()
> ---
>
> Key: MESOS-7991
> URL: https://issues.apache.org/jira/browse/MESOS-7991
> Project: Mesos
>  Issue Type: Bug
>Reporter: Jack Crawford
>Assignee: Alexander Rukletsov
>Priority: Critical
>  Labels: reliability
>
> mesos master crashed on what appears to be framework recovery
> mesos master version: 1.3.1
> mesos agent version: 1.3.1
> {code}
> W0920 14:58:54.756364 25452 master.cpp:7568] Task 
> 862181ec-dffb-4c03-8807-5fb4c4e9a907 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756369 25452 master.cpp:7568] Task 
> 9c21c48a-63ad-4d58-9e22-f720af19a644 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756376 25452 master.cpp:7568] Task 
> 05c451f8-c48a-47bd-a235-0ceb9b3f8d0c of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756381 25452 master.cpp:7568] Task 
> e8641b1f-f67f-42fe-821c-09e5a290fc60 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756386 25452 master.cpp:7568] Task 
> f838a03c-5cd4-47eb-8606-69b004d89808 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756392 25452 master.cpp:7568] Task 
> 685ca5da-fa24-494d-a806-06e03bbf00bd of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> W0920 14:58:54.756397 25452 master.cpp:7568] Task 
> 65ccf39b-5c46-4121-9fdd-21570e8068e6 of framework 
> 889aae9d-1aab-4268-ba42-9d5c2461d871 unknown to the agent 
> a498d458-bbca-426e-b076-b328f5b035da-S5225 at slave(1)
> @10.0.239.217:5051 (ip-10-0-239-217) during re-registration: reconciling with 
> the agent
> F0920 14:58:54.756404 25452 master.cpp:7601] Check failed: 
> !framework->recovered()
> *** Check failure stack trace: ***
> @ 0x7f7bf80087ed  google::LogMessage::Fail()
> @ 0x7f7bf800a5a0  google::LogMessage::SendToLog()
> @ 0x7f7bf80083d3  google::LogMessage::Flush()
> @ 0x7f7bf800afc9  google::LogMessageFatal::~LogMessageFatal()
> @ 0x7f7bf736fe7e  
> mesos::internal::master::Master::reconcileKnownSlave()
> @ 0x7f7bf739e612  mesos::internal::master::Master::_reregisterSlave()
> @ 0x7f7bf73a580e  
> _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal6master6MasterERKNS5_9SlaveInfoERKNS0_4UPIDERK6OptionINSt7__cxx1112basic_stringIcSt11char_traitsIcESaIc
> RKSt6vectorINS5_8ResourceESaISQ_EERKSP_INS5_12ExecutorInfoESaISV_EERKSP_INS5_4TaskESaIS10_EERKSP_INS5_13FrameworkInfoESaIS15_EERKSP_INS6_17Archive_FrameworkESaIS1A_EERKSL_RKSP_INS5_20SlaveInfo_CapabilityESaIS
> 1H_EERKNS0_6FutureIbEES9_SC_SM_SS_SX_S12_S17_S1C_SL_S1J_S1N_EEvRKNS0_3PIDIT_EEMS1R_FvT0_T1_T2_T3_T4_T5_T6_T7_T8_T9_T10_ET11_T12_T13_T14_T15_T16_T17_T18_T19_T20_T21_EUlS2_E_E9_M_invokeERKSt9_Any_dataOS2_
> @ 0x7f7bf7f5e69c  process::ProcessBase::visit()
> @ 0x7f7bf7f71403  process::ProcessManager::resume()
> @ 0x7f7bf7f7c127  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> @ 0x7f7bf60b5c80  (unknown)
> @ 0x7f7bf58c86ba  start_thread
> @ 0x7f7bf55fe3dd  (unknown)
> mesos-master.service: Main process exited, code=killed, status=6/ABRT
> mesos-master.service: Unit entered failed state.
> mesos-master.service: Failed with result 'signal'.
> {code}



--
This message was sent by Atlassian JIRA

[jira] [Commented] (MESOS-8585) Agent Crashes When Ask to Start Task with Unknown User

2018-02-15 Thread Jan Schlicht (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-8585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16365343#comment-16365343
 ] 

Jan Schlicht commented on MESOS-8585:
-

Looks like this has been introduced in https://reviews.apache.org/r/64630/.
cc [~jpe...@apache.org]

> Agent Crashes When Ask to Start Task with Unknown User
> --
>
> Key: MESOS-8585
> URL: https://issues.apache.org/jira/browse/MESOS-8585
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Affects Versions: 1.5.0
>Reporter: Karsten
>Priority: Major
> Attachments: dcos-mesos-slave.service.1.gz, 
> dcos-mesos-slave.service.2.gz
>
>
> The Marathon team has an integration test that tries to start a task with an 
> unknown user. The test expects a \{{TASK_FAILED}}. However, we see 
> \{{TASK_DROPPED}} instead. The agent logs seem to suggest that the agent 
> crashes and restarts.
>  
> {code}
>  783 2018-02-14 14:55:45: I0214 14:55:45.319974  6213 slave.cpp:2542] 
> Launching task 'sleep-bad-user-7.228ba17d-1197-11e8-baca-6a2835f12cb6' for 
> framework 120721e5-96e5-4c0b-8660-d5ba2e96f05a-0001
> 784 2018-02-14 14:55:45: I0214 14:55:45.320605  6213 paths.cpp:727] 
> Creating sandbox 
> '/var/lib/mesos/slave/slaves/120721e5-96e5-4c0b-8660-d5ba2e96f05a-S3/frameworks/120721e5-96e5-4c0b-8660-d5ba2e96f05
> 784 
> a-0001/executors/sleep-bad-user-7.228ba17d-1197-11e8-baca-6a2835f12cb6/runs/dc99056a-1d85-427f-a34b-ac666d4acc88'
>  for user 'bad'
> 785 2018-02-14 14:55:45: F0214 14:55:45.321131  6213 paths.cpp:735] 
> CHECK_SOME(mkdir): Failed to chown directory to 'bad': No such user 'bad' 
> Failed to create executor directory '/var/lib/mesos/slave/
> 785 
> slaves/120721e5-96e5-4c0b-8660-d5ba2e96f05a-S3/frameworks/120721e5-96e5-4c0b-8660-d5ba2e96f05a-0001/executors/sleep-bad-user-7.228ba17d-1197-11e8-baca-6a2835f12cb6/runs/dc99056a-1d85-427f-a34b-ac6
> 785 66d4acc88'
> 786 2018-02-14 14:55:45: *** Check failure stack trace: ***
> 787 2018-02-14 14:55:45: @ 0x7f72033444ad  
> google::LogMessage::Fail()
> 788 2018-02-14 14:55:45: @ 0x7f72033462dd  
> google::LogMessage::SendToLog()
> 789 2018-02-14 14:55:45: @ 0x7f720334409c  
> google::LogMessage::Flush()
> 790 2018-02-14 14:55:45: @ 0x7f7203346bd9  
> google::LogMessageFatal::~LogMessageFatal()
> 791 2018-02-14 14:55:45: @ 0x56544ca378f9  
> _CheckFatal::~_CheckFatal()
> 792 2018-02-14 14:55:45: @ 0x7f720270f30d  
> mesos::internal::slave::paths::createExecutorDirectory()
> 793 2018-02-14 14:55:45: @ 0x7f720273812c  
> mesos::internal::slave::Framework::addExecutor()
> 794 2018-02-14 14:55:45: @ 0x7f7202753e35  
> mesos::internal::slave::Slave::__run()
> 795 2018-02-14 14:55:45: @ 0x7f7202764292  
> _ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal5slave5SlaveERKNS1_6FutureISt4
> 795 
> listIbSaIbRKNSA_13FrameworkInfoERKNSA_12ExecutorInfoERK6OptionINSA_8TaskInfoEERKSR_INSA_13TaskGroupInfoEERKSt6vectorINSB_19ResourceVersionUUIDESaIS11_EESK_SN_SQ_SV_SZ_S15_EEvRKNS1_3PIDIT_EEMS1
> 795 
> 7_FvT0_T1_T2_T3_T4_T5_EOT6_OT7_OT8_OT9_OT10_OT11_EUlOSI_OSL_OSO_OST_OSX_OS13_S3_E_ISI_SL_SO_ST_SX_S13_St12_PlaceholderILi1EEclEOS3_
> 796 2018-02-14 14:55:45: @ 0x7f72032a2b11  
> process::ProcessBase::consume()
> 797 2018-02-14 14:55:45: @ 0x7f72032b183c  
> process::ProcessManager::resume()
> 798 2018-02-14 14:55:45: @ 0x7f72032b6da6  
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
> 799 2018-02-14 14:55:45: @ 0x7f72005ced73  (unknown)
> 800 2018-02-14 14:55:45: @ 0x7f72000cf52c  (unknown)
> 801 2018-02-14 14:55:45: @ 0x7f71ffe0d1dd  (unknown)
> 802 2018-02-14 14:57:15: dcos-mesos-slave.service: Main process exited, 
> code=killed, status=6/ABRT
> 803 2018-02-14 14:57:15: dcos-mesos-slave.service: Unit entered failed 
> state.
> 804 2018-02-14 14:57:15: dcos-mesos-slave.service: Failed with result 
> 'signal'.
> 805 2018-02-14 14:57:20: dcos-mesos-slave.service: Service hold-off time 
> over, scheduling restart.
> 806 2018-02-14 14:57:20: Stopped Mesos Agent: distributed systems kernel 
> agent.
> 807 2018-02-14 14:57:20: Starting Mesos Agent: distributed systems kernel 
> agent...
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (MESOS-8585) Agent Crashes When Ask to Start Task with Unknown User

2018-02-15 Thread Karsten (JIRA)
Karsten created MESOS-8585:
--

 Summary: Agent Crashes When Ask to Start Task with Unknown User
 Key: MESOS-8585
 URL: https://issues.apache.org/jira/browse/MESOS-8585
 Project: Mesos
  Issue Type: Bug
  Components: agent
Affects Versions: 1.5.0
Reporter: Karsten


The Marathon team has an integration test that tries to start a task with an 
unknown user. The test expects a \{{TASK_FAILED}}. However, we see 
\{{TASK_DROPPED}} instead. The agent logs seem to suggest that the agent 
crashes and restarts.

 

{code}
 783 2018-02-14 14:55:45: I0214 14:55:45.319974  6213 slave.cpp:2542] Launching 
task 'sleep-bad-user-7.228ba17d-1197-11e8-baca-6a2835f12cb6' for framework 
120721e5-96e5-4c0b-8660-d5ba2e96f05a-0001
784 2018-02-14 14:55:45: I0214 14:55:45.320605  6213 paths.cpp:727] 
Creating sandbox 
'/var/lib/mesos/slave/slaves/120721e5-96e5-4c0b-8660-d5ba2e96f05a-S3/frameworks/120721e5-96e5-4c0b-8660-d5ba2e96f05
784 
a-0001/executors/sleep-bad-user-7.228ba17d-1197-11e8-baca-6a2835f12cb6/runs/dc99056a-1d85-427f-a34b-ac666d4acc88'
 for user 'bad'
785 2018-02-14 14:55:45: F0214 14:55:45.321131  6213 paths.cpp:735] 
CHECK_SOME(mkdir): Failed to chown directory to 'bad': No such user 'bad' 
Failed to create executor directory '/var/lib/mesos/slave/
785 
slaves/120721e5-96e5-4c0b-8660-d5ba2e96f05a-S3/frameworks/120721e5-96e5-4c0b-8660-d5ba2e96f05a-0001/executors/sleep-bad-user-7.228ba17d-1197-11e8-baca-6a2835f12cb6/runs/dc99056a-1d85-427f-a34b-ac6
785 66d4acc88'
786 2018-02-14 14:55:45: *** Check failure stack trace: ***
787 2018-02-14 14:55:45: @ 0x7f72033444ad  
google::LogMessage::Fail()
788 2018-02-14 14:55:45: @ 0x7f72033462dd  
google::LogMessage::SendToLog()
789 2018-02-14 14:55:45: @ 0x7f720334409c  
google::LogMessage::Flush()
790 2018-02-14 14:55:45: @ 0x7f7203346bd9  
google::LogMessageFatal::~LogMessageFatal()
791 2018-02-14 14:55:45: @ 0x56544ca378f9  
_CheckFatal::~_CheckFatal()
792 2018-02-14 14:55:45: @ 0x7f720270f30d  
mesos::internal::slave::paths::createExecutorDirectory()
793 2018-02-14 14:55:45: @ 0x7f720273812c  
mesos::internal::slave::Framework::addExecutor()
794 2018-02-14 14:55:45: @ 0x7f7202753e35  
mesos::internal::slave::Slave::__run()
795 2018-02-14 14:55:45: @ 0x7f7202764292  
_ZNO6lambda12CallableOnceIFvPN7process11ProcessBaseEEE10CallableFnINS_8internal7PartialIZNS1_8dispatchIN5mesos8internal5slave5SlaveERKNS1_6FutureISt4
795 
listIbSaIbRKNSA_13FrameworkInfoERKNSA_12ExecutorInfoERK6OptionINSA_8TaskInfoEERKSR_INSA_13TaskGroupInfoEERKSt6vectorINSB_19ResourceVersionUUIDESaIS11_EESK_SN_SQ_SV_SZ_S15_EEvRKNS1_3PIDIT_EEMS1
795 
7_FvT0_T1_T2_T3_T4_T5_EOT6_OT7_OT8_OT9_OT10_OT11_EUlOSI_OSL_OSO_OST_OSX_OS13_S3_E_ISI_SL_SO_ST_SX_S13_St12_PlaceholderILi1EEclEOS3_
796 2018-02-14 14:55:45: @ 0x7f72032a2b11  
process::ProcessBase::consume()
797 2018-02-14 14:55:45: @ 0x7f72032b183c  
process::ProcessManager::resume()
798 2018-02-14 14:55:45: @ 0x7f72032b6da6  
_ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUlvE_vEEE6_M_runEv
799 2018-02-14 14:55:45: @ 0x7f72005ced73  (unknown)
800 2018-02-14 14:55:45: @ 0x7f72000cf52c  (unknown)
801 2018-02-14 14:55:45: @ 0x7f71ffe0d1dd  (unknown)
802 2018-02-14 14:57:15: dcos-mesos-slave.service: Main process exited, 
code=killed, status=6/ABRT
803 2018-02-14 14:57:15: dcos-mesos-slave.service: Unit entered failed 
state.
804 2018-02-14 14:57:15: dcos-mesos-slave.service: Failed with result 
'signal'.
805 2018-02-14 14:57:20: dcos-mesos-slave.service: Service hold-off time 
over, scheduling restart.
806 2018-02-14 14:57:20: Stopped Mesos Agent: distributed systems kernel 
agent.
807 2018-02-14 14:57:20: Starting Mesos Agent: distributed systems kernel 
agent...

{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)