[jira] [Commented] (MESOS-5188) docker executor thinks task is failed when docker container was stopped

2016-06-17 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337520#comment-15337520
 ] 

haosdent commented on MESOS-5188:
-

Looks like not an issue of 1.0.0, let me remove the fix version. [~liqlin] 

> docker executor thinks task is failed when docker container was stopped
> ---
>
> Key: MESOS-5188
> URL: https://issues.apache.org/jira/browse/MESOS-5188
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.28.0
>Reporter: Liqiang Lin
>
> Test cases:
> 1. Launch a task with Swarm (on Mesos).
> {code}
> # docker -H 192.168.56.110:54375 run -d --cpu-shares 1 ubuntu sleep 300
> {code}
> 2. Then stop the docker container.
> {code}
> # docker -H 192.168.56.110:54375 ps
> CONTAINER IDIMAGE   COMMAND CREATED   
>   STATUS  PORTS   NAMES
> b4813ba3ed4dubuntu  "sleep 300" 9 seconds ago 
>   Up 8 seconds
> mesos1/mesos-2cd5576e-6260-4262-a62c-b0dc45c86c45-S1.1595e79b-aef2-44b6-a313-ad4ff8626958
> # docker -H 192.168.56.110:54375 stop b4813ba3ed4d
> b4813ba3ed4d
> {code}
> 3. Found the task is failed. See Mesos slave log,
> {code}
> I0407 09:10:57.606552 32307 slave.cpp:1508] Got assigned task 99ee7dc74861 
> for framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-
> I0407 09:10:57.608230 32307 slave.cpp:1627] Launching task 99ee7dc74861 for 
> framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-
> I0407 09:10:57.609979 32307 paths.cpp:528] Trying to chown 
> '/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-/executors/99ee7dc74861/runs/250a169f-7aba-474d-a4f5-cd24ecf0e7d9'
>  to user 'root'
> I0407 09:10:57.615881 32307 slave.cpp:5586] Launching executor 99ee7dc74861 
> of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- with resources 
> cpus(*):0.1; mem(*):32 in work directory 
> '/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-/executors/99ee7dc74861/runs/250a169f-7aba-474d-a4f5-cd24ecf0e7d9'
> I0407 09:12:18.458449 32307 slave.cpp:1845] Queuing task '99ee7dc74861' for 
> executor '99ee7dc74861' of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-
> I0407 09:12:18.459092 32307 slave.cpp:3711] No pings from master received 
> within 75secs
> I0407 09:12:18.460212 32307 slave.cpp:4593] Current disk usage 56.53%. Max 
> allowed age: 2.342613645432778days
> I0407 09:12:18.463484 32307 slave.cpp:928] Re-detecting master
> I0407 09:12:18.463969 32307 slave.cpp:975] Detecting new master
> I0407 09:12:18.464501 32307 slave.cpp:939] New master detected at 
> master@192.168.56.110:5050
> I0407 09:12:18.464848 32307 slave.cpp:964] No credentials provided. 
> Attempting to register without authentication
> I0407 09:12:18.465237 32307 slave.cpp:975] Detecting new master
> I0407 09:12:18.463611 32312 status_update_manager.cpp:174] Pausing sending 
> status updates
> I0407 09:12:18.465744 32312 status_update_manager.cpp:174] Pausing sending 
> status updates
> I0407 09:12:18.472323 32313 docker.cpp:1011] Starting container 
> '250a169f-7aba-474d-a4f5-cd24ecf0e7d9' for task '99ee7dc74861' (and executor 
> '99ee7dc74861') of framework '5b84aad8-dd60-40b3-84c2-93be6b7aa81c-'
> I0407 09:12:18.588739 32313 slave.cpp:1218] Re-registered with master 
> master@192.168.56.110:5050
> I0407 09:12:18.588927 32313 slave.cpp:1254] Forwarding total oversubscribed 
> resources
> I0407 09:12:18.589320 32313 slave.cpp:2395] Updating framework 
> 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- pid to 
> scheduler(1)@192.168.56.110:53375
> I0407 09:12:18.592079 32308 status_update_manager.cpp:181] Resuming sending 
> status updates
> I0407 09:12:18.592842 32313 slave.cpp:2534] Updated checkpointed resources 
> from  to
> I0407 09:12:18.592793 32308 status_update_manager.cpp:181] Resuming sending 
> status updates
> I0407 09:12:20.582041 32307 slave.cpp:2836] Got registration for executor 
> '99ee7dc74861' of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- from 
> executor(1)@192.168.56.110:40725
> I0407 09:12:20.584446 32307 docker.cpp:1308] Ignoring updating container 
> '250a169f-7aba-474d-a4f5-cd24ecf0e7d9' with resources passed to update is 
> identical to existing resources
> I0407 09:12:20.585093 32307 slave.cpp:2010] Sending queued task 
> '99ee7dc74861' to executor '99ee7dc74861' of framework 
> 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- at executor(1)@192.168.56.110:40725
> I0407 09:12:21.307077 32312 slave.cpp:3195] Handling status update 
> TASK_RUNNING (UUID: a7098650-cbf6-4445-8216-b5f658d2f5f4) for task 
> 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- from 
> 

[jira] [Updated] (MESOS-5188) docker executor thinks task is failed when docker container was stopped

2016-06-17 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-5188:

Fix Version/s: (was: 1.0.0)

> docker executor thinks task is failed when docker container was stopped
> ---
>
> Key: MESOS-5188
> URL: https://issues.apache.org/jira/browse/MESOS-5188
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.28.0
>Reporter: Liqiang Lin
>
> Test cases:
> 1. Launch a task with Swarm (on Mesos).
> {code}
> # docker -H 192.168.56.110:54375 run -d --cpu-shares 1 ubuntu sleep 300
> {code}
> 2. Then stop the docker container.
> {code}
> # docker -H 192.168.56.110:54375 ps
> CONTAINER IDIMAGE   COMMAND CREATED   
>   STATUS  PORTS   NAMES
> b4813ba3ed4dubuntu  "sleep 300" 9 seconds ago 
>   Up 8 seconds
> mesos1/mesos-2cd5576e-6260-4262-a62c-b0dc45c86c45-S1.1595e79b-aef2-44b6-a313-ad4ff8626958
> # docker -H 192.168.56.110:54375 stop b4813ba3ed4d
> b4813ba3ed4d
> {code}
> 3. Found the task is failed. See Mesos slave log,
> {code}
> I0407 09:10:57.606552 32307 slave.cpp:1508] Got assigned task 99ee7dc74861 
> for framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-
> I0407 09:10:57.608230 32307 slave.cpp:1627] Launching task 99ee7dc74861 for 
> framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-
> I0407 09:10:57.609979 32307 paths.cpp:528] Trying to chown 
> '/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-/executors/99ee7dc74861/runs/250a169f-7aba-474d-a4f5-cd24ecf0e7d9'
>  to user 'root'
> I0407 09:10:57.615881 32307 slave.cpp:5586] Launching executor 99ee7dc74861 
> of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- with resources 
> cpus(*):0.1; mem(*):32 in work directory 
> '/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-/executors/99ee7dc74861/runs/250a169f-7aba-474d-a4f5-cd24ecf0e7d9'
> I0407 09:12:18.458449 32307 slave.cpp:1845] Queuing task '99ee7dc74861' for 
> executor '99ee7dc74861' of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c-
> I0407 09:12:18.459092 32307 slave.cpp:3711] No pings from master received 
> within 75secs
> I0407 09:12:18.460212 32307 slave.cpp:4593] Current disk usage 56.53%. Max 
> allowed age: 2.342613645432778days
> I0407 09:12:18.463484 32307 slave.cpp:928] Re-detecting master
> I0407 09:12:18.463969 32307 slave.cpp:975] Detecting new master
> I0407 09:12:18.464501 32307 slave.cpp:939] New master detected at 
> master@192.168.56.110:5050
> I0407 09:12:18.464848 32307 slave.cpp:964] No credentials provided. 
> Attempting to register without authentication
> I0407 09:12:18.465237 32307 slave.cpp:975] Detecting new master
> I0407 09:12:18.463611 32312 status_update_manager.cpp:174] Pausing sending 
> status updates
> I0407 09:12:18.465744 32312 status_update_manager.cpp:174] Pausing sending 
> status updates
> I0407 09:12:18.472323 32313 docker.cpp:1011] Starting container 
> '250a169f-7aba-474d-a4f5-cd24ecf0e7d9' for task '99ee7dc74861' (and executor 
> '99ee7dc74861') of framework '5b84aad8-dd60-40b3-84c2-93be6b7aa81c-'
> I0407 09:12:18.588739 32313 slave.cpp:1218] Re-registered with master 
> master@192.168.56.110:5050
> I0407 09:12:18.588927 32313 slave.cpp:1254] Forwarding total oversubscribed 
> resources
> I0407 09:12:18.589320 32313 slave.cpp:2395] Updating framework 
> 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- pid to 
> scheduler(1)@192.168.56.110:53375
> I0407 09:12:18.592079 32308 status_update_manager.cpp:181] Resuming sending 
> status updates
> I0407 09:12:18.592842 32313 slave.cpp:2534] Updated checkpointed resources 
> from  to
> I0407 09:12:18.592793 32308 status_update_manager.cpp:181] Resuming sending 
> status updates
> I0407 09:12:20.582041 32307 slave.cpp:2836] Got registration for executor 
> '99ee7dc74861' of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- from 
> executor(1)@192.168.56.110:40725
> I0407 09:12:20.584446 32307 docker.cpp:1308] Ignoring updating container 
> '250a169f-7aba-474d-a4f5-cd24ecf0e7d9' with resources passed to update is 
> identical to existing resources
> I0407 09:12:20.585093 32307 slave.cpp:2010] Sending queued task 
> '99ee7dc74861' to executor '99ee7dc74861' of framework 
> 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- at executor(1)@192.168.56.110:40725
> I0407 09:12:21.307077 32312 slave.cpp:3195] Handling status update 
> TASK_RUNNING (UUID: a7098650-cbf6-4445-8216-b5f658d2f5f4) for task 
> 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- from 
> executor(1)@192.168.56.110:40725
> I0407 09:12:21.308820 32308 status_update_manager.cpp:320] Received status 
> 

[jira] [Updated] (MESOS-5641) Update docker-volume.md to add some content for how to test

2016-06-17 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-5641:
--
Issue Type: Task  (was: Bug)

> Update docker-volume.md to add some content for how to test
> ---
>
> Key: MESOS-5641
> URL: https://issues.apache.org/jira/browse/MESOS-5641
> Project: Mesos
>  Issue Type: Task
>Reporter: Guangya Liu
>Assignee: Guangya Liu
> Fix For: 1.0.0
>
>
> The mesos-execute was fixed in MESOS-5265 , the document should be updated to 
> reflect how to use mesos-execute to test the feature of docker volume 
> isolator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5641) Update docker-volume.md to add some content for how to test

2016-06-17 Thread Guangya Liu (JIRA)
Guangya Liu created MESOS-5641:
--

 Summary: Update docker-volume.md to add some content for how to 
test
 Key: MESOS-5641
 URL: https://issues.apache.org/jira/browse/MESOS-5641
 Project: Mesos
  Issue Type: Bug
Reporter: Guangya Liu
Assignee: Guangya Liu


The mesos-execute was fixed in MESOS-5265 , the document should be updated to 
reflect how to use mesos-execute to test the feature of docker volume isolator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5637) Authorized endpoint results are inconsistent for failures.

2016-06-17 Thread Till Toenshoff (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337419#comment-15337419
 ] 

Till Toenshoff commented on MESOS-5637:
---

We need to decide on...
 - the HTTP status code we actually want to show our users
 - if we want to display the future error message in the HTTP body
... for being able to unify this.

Furthermore we might want to introduce tests that prevent regressions 
introducing such inconsistencies in the future.

> Authorized endpoint results are inconsistent for failures.
> --
>
> Key: MESOS-5637
> URL: https://issues.apache.org/jira/browse/MESOS-5637
> Project: Mesos
>  Issue Type: Bug
>  Components: master, modules
>Affects Versions: 1.0.0
>Reporter: Till Toenshoff
>  Labels: authorization, mesosphere, security
>
> When trying to access authorized endpoints, the resulting HTTP status codes 
> are not consistent for internal authorizer failures (failed future returned 
> by {{authorized}}).
> {{/flags}}: 
> {noformat}
> HTTP/1.1 503 Service Unavailable
> Date: Fri, 17 Jun 2016 23:11:04 GMT
> Content-Length: 0
> {noformat}
> {{/state}}:
> {noformat}
> HTTP/1.1 500 Internal Server Error
> Date: Fri, 17 Jun 2016 23:08:49 GMT
> Content-Type: text/plain; charset=utf-8
> Content-Length: size($FUTURE_ERROR_MESSAGE)
> $FUTURE_ERROR_MESSAGE
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5576) Masters may drop the first message they send between masters after a network partition

2016-06-17 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-5576:
-
Issue Type: Improvement  (was: Bug)

Changing type from {{Bug}} to {{Improvement}} because the masters will still 
recover *eventually* in this case.  Bad sockets are cleaned out when the 
masters abort due to {{--registry_fetch_timeout}}.

> Masters may drop the first message they send between masters after a network 
> partition
> --
>
> Key: MESOS-5576
> URL: https://issues.apache.org/jira/browse/MESOS-5576
> Project: Mesos
>  Issue Type: Improvement
>  Components: leader election, master, replicated log
>Affects Versions: 0.28.2
> Environment: Observed in an OpenStack environment where each master 
> lives on a separate VM.
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> We observed the following situation in a cluster of five masters:
> || Time || Master 1 || Master 2 || Master 3 || Master 4 || Master 5 ||
> | 0 | Follower | Follower | Follower | Follower | Leader |
> | 1 | Follower | Follower | Follower | Follower || Partitioned from cluster 
> by downing this VM's network ||
> | 2 || Elected Leader by ZK | Voting | Voting | Voting | Suicides due to lost 
> leadership |
> | 3 | Performs consensus | Replies to leader | Replies to leader | Replies to 
> leader | Still down |
> | 4 | Performs writing | Acks to leader | Acks to leader | Acks to leader | 
> Still down |
> | 5 | Leader | Follower | Follower | Follower | Still down |
> | 6 | Leader | Follower | Follower | Follower | Comes back up |
> | 7 | Leader | Follower | Follower | Follower | Follower |
> | 8 || Partitioned in the same way as Master 5 | Follower | Follower | 
> Follower | Follower |
> | 9 | Suicides due to lost leadership || Elected Leader by ZK | Follower | 
> Follower | Follower |
> | 10 | Still down | Performs consensus | Replies to leader | Replies to 
> leader || Doesn't get the message! ||
> | 11 | Still down | Performs writing | Acks to leader | Acks to leader || 
> Acks to leader ||
> | 12 | Still down | Leader | Follower | Follower | Follower |
> Master 2 sends a series of messages to the recently-restarted Master 5.  The 
> first message is dropped, but subsequent messages are not dropped.
> This appears to be due to a stale link between the masters.  Before leader 
> election, the replicated log actors create a network watcher, which adds 
> links to masters that join the ZK group:
> https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/network.hpp#L157-L159
> This link does not appear to break (Master 2 -> 5) when Master 5 goes down, 
> perhaps due to how the network partition was induced (in the hypervisor 
> layer, rather than in the VM itself).
> When Master 2 tries to send an {{PromiseRequest}} to Master 5, we do not 
> observe the [expected log 
> message|https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/replica.cpp#L493-L494]
> Instead, we see a log line in Master 2:
> {code}
> process.cpp:2040] Failed to shutdown socket with fd 27: Transport endpoint is 
> not connected
> {code}
> The broken link is removed by the libprocess {{socket_manager}} and the 
> following {{WriteRequest}} from Master 2 to Master 5 succeeds via a new 
> socket.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5640) Unify the help info for master/agent flags

2016-06-17 Thread Guangya Liu (JIRA)
Guangya Liu created MESOS-5640:
--

 Summary: Unify the help info for master/agent flags
 Key: MESOS-5640
 URL: https://issues.apache.org/jira/browse/MESOS-5640
 Project: Mesos
  Issue Type: Bug
Reporter: Guangya Liu
Priority: Minor


Currently, in master/flags.cpp, some flags end up with a "\n" while some not, 
this caused the output not consistent.

{code}
--[no-]hostname_lookup 
Whether we should execute a lookup to find out the server's hostname,

 if not explicitly set (via, e.g., `--hostname`).

 True by default; if set to `false` it will cause Mesos

 to use the IP address, unless the hostname is explicitly set. (default: true)
  --http_authenticators=VALUE   
 HTTP authenticator implementation to use when handling requests to

 authenticated endpoints. Use the default

 `basic`, or load an alternate

 HTTP authenticator module using `--modules`.


 Currently there is no support for multiple HTTP authenticators. (default: 
basic)
  --http_framework_authenticators=VALUE 
 HTTP authenticator implementation to use when authenticating HTTP

 frameworks. Use the

 `basic` authenticator or load an

 alternate authenticator module using `--modules`.

 Must be used in conjunction with `--http_authenticate_frameworks`.
{code}

I think we should follow the linux "man command" format by adding "\n" to all 
flags.

The following is a sample output for "man ls".

{code}
 -@  Display extended attribute keys and sizes in long (-l) output.

 -1  (The numeric digit ``one''.)  Force output to be one entry per 
line.  This is the default when output is not to a terminal.

 -A  List all entries except for . and ...  Always set for the 
super-user.

 -a  Include directory entries whose names begin with a dot (.).

 -B  Force printing of non-printable characters (as defined by ctype(3) 
and current locale settings) in file names as \xxx, where xxx is the numeric 
value of the character
 in octal.

 -b  As -B, but use C escape codes whenever possible.
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5639) Add documentation about metadata for CNI plugins.

2016-06-17 Thread Jie Yu (JIRA)
Jie Yu created MESOS-5639:
-

 Summary: Add documentation about metadata for CNI plugins.
 Key: MESOS-5639
 URL: https://issues.apache.org/jira/browse/MESOS-5639
 Project: Mesos
  Issue Type: Task
Reporter: Jie Yu
Assignee: Jie Yu


We need to document the behavior implemented in MESOS-5592.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5635) Agent repeatedly reregisters, possible one-way disconnection

2016-06-17 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-5635:
-
Description: 
This issue was observed recently on an internal test cluster. Due to a bug in 
the agent code (MESOS-5629), regular segfaults were occurring on an agent. 
After one such failure, the agent recovered and about a minute later the 
following was observed in the master logs:
{code}
I0617 22:23:41.663557  2014 master.cpp:4795] Re-registering agent 
6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051 
(10.10.0.179)
{code}
However, we see nothing about registration in the agent logs at this time. 
Subsequently, in the master logs, we see the agent continuing to reregister 
every couple seconds:
{code}
I0617 22:23:43.528590  2014 master.cpp:4795] Re-registering agent 
6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051 
(10.10.0.179)
{code}
After about four minutes of this, we see:
{code}
I0617 22:27:43.994493  2014 master.cpp:6750] Removed agent 
6d4248cd-2832-4152-b5d0-defbf36f6759-S3 (10.10.0.179): health check timed out
{code}
And after this point, we see repeated reregistration attempts from that agent 
in the master logs:
{code}
W0617 22:29:09.514423  2010 master.cpp:4773] Agent 
6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051 
(10.10.0.179) attempted to re-register after removal;
{code}

During all of this, however, the agent logs indicate nothing about 
registration. All we see are requests coming in to {{/state}}:
{code}
Jun 17 22:26:37 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:37.870980   873 
http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.181:38792 with 
User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10
Jun 17 22:26:38 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:38.158476   879 
http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
Jun 17 22:26:38 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:38.884507   873 
http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
Jun 17 22:26:39 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:39.604486   876 
http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
Jun 17 22:26:40 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:40.018326   875 
http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.181:38803 with 
User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10
Jun 17 22:26:40 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:40.329465   873 
http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
{code}

The lack of logging on the agent side, and the health check timeout, suggests a 
one-way disconnection such that the master cannot send messages to the agent, 
but the agent can send messages to the master. This behavior has been observed 
several times on this test cluster in the past couple days. Full master and 
agent logs from the relevant time period have been attached.

  was:
This issue was observed recently on an internal test cluster. Due to a bug in 
the agent code (MESOS-5629), regular segfaults were occurring on an agent. 
After one such failure, the agent recovered and about a minute later the 
following was observed in the master logs:
{code}
I0617 22:23:41.663557  2014 master.cpp:4795] Re-registering agent 
6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051 
(10.10.0.179)
{code}
However, we see nothing about registration in the agent logs at this time. 
Subsequently, in the master logs, we see the agent continuing to reregister 
every couple seconds:
{code}
I0617 22:23:43.528590  2014 master.cpp:4795] Re-registering agent 
6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051 
(10.10.0.179)
{code}
After about four minutes of this, we see:
{code}
I0617 22:27:43.994493  2014 master.cpp:6750] Removed agent 
6d4248cd-2832-4152-b5d0-defbf36f6759-S3 (10.10.0.179): health check timed out
{code}
And after this point, we see repeated reregistration attempts from that agent 
in the master logs:
{code}
W0617 22:29:09.514423  2010 master.cpp:4773] Agent 
6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051 
(10.10.0.179) attempted to re-register after removal;
{code}

During all of this, however, the agent logs indicate nothing about 
registration. All we see are requests coming in to {{/state}}:
{code}
Jun 17 22:26:37 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:37.870980   873 
http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.181:38792 with 
User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10
Jun 17 22:26:38 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:38.158476   879 
http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
Jun 17 22:26:38 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:38.884507   873 
http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
Jun 17 22:26:39 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:39.604486   876 
http.cpp:192] HTTP GET for 

[jira] [Updated] (MESOS-5635) Agent repeatedly reregisters, possible one-way disconnection

2016-06-17 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-5635:
-
Attachment: master-log.txt
agent-log.txt

> Agent repeatedly reregisters, possible one-way disconnection
> 
>
> Key: MESOS-5635
> URL: https://issues.apache.org/jira/browse/MESOS-5635
> Project: Mesos
>  Issue Type: Bug
>Reporter: Greg Mann
>  Labels: agent, mesosphere
> Attachments: agent-log.txt, master-log.txt
>
>
> This issue was observed recently on an internal test cluster. Due to a bug in 
> the agent code (MESOS-5629), regular segfaults were occurring on an agent. 
> After one such failure, the agent recovered and about a minute later the 
> following was observed in the master logs:
> {code}
> I0617 22:23:41.663557  2014 master.cpp:4795] Re-registering agent 
> 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051 
> (10.10.0.179)
> {code}
> However, we see nothing about registration in the agent logs at this time. 
> Subsequently, in the master logs, we see the agent continuing to reregister 
> every couple seconds:
> {code}
> I0617 22:23:43.528590  2014 master.cpp:4795] Re-registering agent 
> 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051 
> (10.10.0.179)
> {code}
> After about four minutes of this, we see:
> {code}
> I0617 22:27:43.994493  2014 master.cpp:6750] Removed agent 
> 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 (10.10.0.179): health check timed out
> {code}
> And after this point, we see repeated reregistration attempts from that agent 
> in the master logs:
> {code}
> W0617 22:29:09.514423  2010 master.cpp:4773] Agent 
> 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051 
> (10.10.0.179) attempted to re-register after removal;
> {code}
> During all of this, however, the agent logs indicate nothing about 
> registration. All we see are requests coming in to {{/state}}:
> {code}
> Jun 17 22:26:37 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:37.870980   873 
> http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.181:38792 with 
> User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10
> Jun 17 22:26:38 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:38.158476   879 
> http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
> Jun 17 22:26:38 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:38.884507   873 
> http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
> Jun 17 22:26:39 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:39.604486   876 
> http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
> Jun 17 22:26:40 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:40.018326   875 
> http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.181:38803 with 
> User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10
> Jun 17 22:26:40 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:40.329465   873 
> http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
> {code}
> The lack of logging on the agent side, and the health check timeout, suggests 
> a one-way disconnection such that the master cannot send messages to the 
> agent, but the agent can send messages to the master. This behavior has been 
> observed several times on this test cluster in the past couple days.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5638) Check all omissions of 'defer' for safety

2016-06-17 Thread Greg Mann (JIRA)
Greg Mann created MESOS-5638:


 Summary: Check all omissions of 'defer' for safety
 Key: MESOS-5638
 URL: https://issues.apache.org/jira/browse/MESOS-5638
 Project: Mesos
  Issue Type: Bug
Reporter: Greg Mann


When registering callbacks with {{.then}}, {{.onAny}}, etc., we sometimes omit 
{{defer()}} in cases where the callback is deemed threadsafe when run 
synchronously at an arbitrary callsite. Because of recent bugs due to the 
unsafe omission of {{defer()}}, we should do a sweep of the codebase for all 
such occurrences and evaluate their safety.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5635) Agent repeatedly reregisters, possible one-way disconnection

2016-06-17 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-5635:
-
Description: 
This issue was observed recently on an internal test cluster. Due to a bug in 
the agent code (MESOS-5629), regular segfaults were occurring on an agent. 
After one such failure, the agent recovered and about a minute later the 
following was observed in the master logs:
{code}
I0617 22:23:41.663557  2014 master.cpp:4795] Re-registering agent 
6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051 
(10.10.0.179)
{code}
However, we see nothing about registration in the agent logs at this time. 
Subsequently, in the master logs, we see the agent continuing to reregister 
every couple seconds:
{code}
I0617 22:23:43.528590  2014 master.cpp:4795] Re-registering agent 
6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051 
(10.10.0.179)
{code}
After about four minutes of this, we see:
{code}
I0617 22:27:43.994493  2014 master.cpp:6750] Removed agent 
6d4248cd-2832-4152-b5d0-defbf36f6759-S3 (10.10.0.179): health check timed out
{code}
And after this point, we see repeated reregistration attempts from that agent 
in the master logs:
{code}
W0617 22:29:09.514423  2010 master.cpp:4773] Agent 
6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051 
(10.10.0.179) attempted to re-register after removal;
{code}

During all of this, however, the agent logs indicate nothing about 
registration. All we see are requests coming in to {{/state}}:
{code}
Jun 17 22:26:37 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:37.870980   873 
http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.181:38792 with 
User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10
Jun 17 22:26:38 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:38.158476   879 
http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
Jun 17 22:26:38 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:38.884507   873 
http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
Jun 17 22:26:39 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:39.604486   876 
http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
Jun 17 22:26:40 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:40.018326   875 
http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.181:38803 with 
User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10
Jun 17 22:26:40 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:40.329465   873 
http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009
{code}

The lack of logging on the agent side, and the health check timeout, suggests a 
one-way disconnection such that the master cannot send messages to the agent, 
but the agent can send messages to the master. This behavior has been observed 
several times on this test cluster in the past couple days.

  was:
This issue was observed recently on an internal test cluster. Due to a bug in 
the agent code (MESOS-5629), regular segfaults were occurring on an agent. 
While the agent was recovering from one of these failures, it segfaulted again. 
After this time, we noticed that after beginning recovery, the agent did not 
print {{Finished recovery}}, and its logs did not show any indication of 
reregistering with the master. Looking at the master's logs, however, the 
following line was observed repeatedly, at intervals on the order of seconds:
{code}
W0617 21:27:07.010679  2016 master.cpp:4773] Agent 
2b899dd3-3b1f-4520-a6b2-98e32196f723-S4 at slave(1)@10.10.0.87:5051 
(10.10.0.87) attempted to re-register after removal; shutting it down
{code}
These re-registration attempts had no corresponding lines in the agent log.

Subsequently deleting the contents of the agent's {{work_dir}} and restarting 
it led to a successful registration with a new agent ID:
{code}
I0617 21:29:01.246119  2011 master.cpp:4635] Registering agent at 
slave(1)@10.10.0.87:5051 (10.10.0.87) with id 
2b899dd3-3b1f-4520-a6b2-98e32196f723-S5
{code}


> Agent repeatedly reregisters, possible one-way disconnection
> 
>
> Key: MESOS-5635
> URL: https://issues.apache.org/jira/browse/MESOS-5635
> Project: Mesos
>  Issue Type: Bug
>Reporter: Greg Mann
>  Labels: agent, mesosphere
>
> This issue was observed recently on an internal test cluster. Due to a bug in 
> the agent code (MESOS-5629), regular segfaults were occurring on an agent. 
> After one such failure, the agent recovered and about a minute later the 
> following was observed in the master logs:
> {code}
> I0617 22:23:41.663557  2014 master.cpp:4795] Re-registering agent 
> 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051 
> (10.10.0.179)
> {code}
> However, we see nothing about registration in the agent logs at this time. 
> Subsequently, in the master logs, we see the agent 

[jira] [Updated] (MESOS-5637) Authorized endpoint results are inconsistent for failures.

2016-06-17 Thread Till Toenshoff (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Toenshoff updated MESOS-5637:
--
Affects Version/s: 1.0.0

> Authorized endpoint results are inconsistent for failures.
> --
>
> Key: MESOS-5637
> URL: https://issues.apache.org/jira/browse/MESOS-5637
> Project: Mesos
>  Issue Type: Bug
>  Components: master, modules
>Affects Versions: 1.0.0
>Reporter: Till Toenshoff
>  Labels: authorization, mesosphere, security
>
> When trying to access authorized endpoints, the resulting HTTP status codes 
> are not consistent for internal authorizer failures (failed future returned 
> by {{authorized}}).
> {{/flags}}: 
> {noformat}
> HTTP/1.1 503 Service Unavailable
> Date: Fri, 17 Jun 2016 23:11:04 GMT
> Content-Length: 0
> {noformat}
> {{/state}}:
> {noformat}
> HTTP/1.1 500 Internal Server Error
> Date: Fri, 17 Jun 2016 23:08:49 GMT
> Content-Type: text/plain; charset=utf-8
> Content-Length: size($FUTURE_ERROR_MESSAGE)
> $FUTURE_ERROR_MESSAGE
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5533) Agent fails to start on CentOS 6 due to missing cgroup hiearchy

2016-06-17 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337329#comment-15337329
 ] 

Gilbert Song commented on MESOS-5533:
-

[~avin...@mesosphere.io], I guess we have some info mismatch, my bad. I have 
patches for the test failures below on centos 6:

`CniIsolatorTest.ROOT_INTERNET_CURL_LaunchCommandTask`
`CniIsolatorTest.ROOT_VerifyCheckpointedInfo`
`CniIsolatorTest.ROOT_SlaveRecovery`

But this should be a diff issue. Seems like it is just a check. Should be a 
quick fix. Do you want to take over? Or I can do that.

> Agent fails to start on CentOS 6 due to missing cgroup hiearchy
> ---
>
> Key: MESOS-5533
> URL: https://issues.apache.org/jira/browse/MESOS-5533
> Project: Mesos
>  Issue Type: Bug
>  Components: build, isolation
>Reporter: Kapil Arya
>Assignee: Gilbert Song
>Priority: Blocker
>  Labels: mesosphere
> Fix For: 1.0.0
>
>
> With the network CNI isolator, agent now _requires_ cgroups to be installed 
> on the system. Can we add some check(s) to either automatically disable CNI 
> module if cgroup hierarchies are not available or ask the user to 
> install/enable cgroup hierarchies.
> On CentOS 6, cgroup tools aren't installed by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5637) Authorized endpoint results are inconsistent for failures.

2016-06-17 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-5637:
---
Priority: Major  (was: Minor)

> Authorized endpoint results are inconsistent for failures.
> --
>
> Key: MESOS-5637
> URL: https://issues.apache.org/jira/browse/MESOS-5637
> Project: Mesos
>  Issue Type: Bug
>  Components: master, modules
>Reporter: Till Toenshoff
>  Labels: authorization, mesosphere, security
>
> When trying to access authorized endpoints, the resulting HTTP status codes 
> are not consistent for internal authorizer failures (failed future returned 
> by {{authorized}}).
> {{/flags}}: 
> {noformat}
> HTTP/1.1 503 Service Unavailable
> Date: Fri, 17 Jun 2016 23:11:04 GMT
> Content-Length: 0
> {noformat}
> {{/state}}:
> {noformat}
> HTTP/1.1 500 Internal Server Error
> Date: Fri, 17 Jun 2016 23:08:49 GMT
> Content-Type: text/plain; charset=utf-8
> Content-Length: size($FUTURE_ERROR_MESSAGE)
> $FUTURE_ERROR_MESSAGE
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5637) Authorized endpoint results are inconsistent for failures.

2016-06-17 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-5637:
---
Labels: authorization mesosphere security  (was: authorization security)

> Authorized endpoint results are inconsistent for failures.
> --
>
> Key: MESOS-5637
> URL: https://issues.apache.org/jira/browse/MESOS-5637
> Project: Mesos
>  Issue Type: Bug
>  Components: master, modules
>Reporter: Till Toenshoff
>Priority: Minor
>  Labels: authorization, mesosphere, security
>
> When trying to access authorized endpoints, the resulting HTTP status codes 
> are not consistent for internal authorizer failures (failed future returned 
> by {{authorized}}).
> {{/flags}}: 
> {noformat}
> HTTP/1.1 503 Service Unavailable
> Date: Fri, 17 Jun 2016 23:11:04 GMT
> Content-Length: 0
> {noformat}
> {{/state}}:
> {noformat}
> HTTP/1.1 500 Internal Server Error
> Date: Fri, 17 Jun 2016 23:08:49 GMT
> Content-Type: text/plain; charset=utf-8
> Content-Length: size($FUTURE_ERROR_MESSAGE)
> $FUTURE_ERROR_MESSAGE
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5637) Authorized endpoint results are inconsistent for failures.

2016-06-17 Thread Till Toenshoff (JIRA)
Till Toenshoff created MESOS-5637:
-

 Summary: Authorized endpoint results are inconsistent for failures.
 Key: MESOS-5637
 URL: https://issues.apache.org/jira/browse/MESOS-5637
 Project: Mesos
  Issue Type: Bug
  Components: master, modules
Reporter: Till Toenshoff
Priority: Minor


When trying to access authorized endpoints, the resulting HTTP status codes are 
not consistent for internal authorizer failures (failed future returned by 
{{authorized}}).

{{/flags}}: 
{noformat}
HTTP/1.1 503 Service Unavailable
Date: Fri, 17 Jun 2016 23:11:04 GMT
Content-Length: 0
{noformat}

{{/state}}:
{noformat}
HTTP/1.1 500 Internal Server Error
Date: Fri, 17 Jun 2016 23:08:49 GMT
Content-Type: text/plain; charset=utf-8
Content-Length: size($FUTURE_ERROR_MESSAGE)

$FUTURE_ERROR_MESSAGE
{noformat}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5592) Pass NetworkInfo to CNI Plugins

2016-06-17 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-5592:
--
  Sprint: Mesosphere Sprint 37
Story Points: 3
  Labels: mesosphere  (was: )

> Pass NetworkInfo to CNI Plugins
> ---
>
> Key: MESOS-5592
> URL: https://issues.apache.org/jira/browse/MESOS-5592
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Dan Osborne
>Assignee: Dan Osborne
>  Labels: mesosphere
> Fix For: 1.0.0
>
>
> Mesos has adopted the Container Network Interface as a simple means of 
> networking Mesos tasks launched by the Unified Containerizer. The CNI 
> specification covers a minimum feature set, granting the flexibility to add 
> customized networking functionality in the form of agreements made between 
> the orchestrator and CNI plugin.
> This proposal is to pass NetworkInfo.Labels to the CNI plugin by injecting it 
> into the CNI network configuration json during plugin invocation.
> Design Doc on this change: 
> https://docs.google.com/document/d/1rxruCCcJqpppsQxQrzTbHFVnnW6CgQ2oTieYAmwL284/edit?usp=sharing
> reviewboard: https://reviews.apache.org/r/48527/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5533) Agent fails to start on CentOS 6 due to missing cgroup hiearchy

2016-06-17 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337273#comment-15337273
 ] 

Avinash Sridharan commented on MESOS-5533:
--

I think [~gilbert] had a patch, not sure it went up for review?

> Agent fails to start on CentOS 6 due to missing cgroup hiearchy
> ---
>
> Key: MESOS-5533
> URL: https://issues.apache.org/jira/browse/MESOS-5533
> Project: Mesos
>  Issue Type: Bug
>  Components: build, isolation
>Reporter: Kapil Arya
>Assignee: Gilbert Song
>Priority: Blocker
>  Labels: mesosphere
> Fix For: 1.0.0
>
>
> With the network CNI isolator, agent now _requires_ cgroups to be installed 
> on the system. Can we add some check(s) to either automatically disable CNI 
> module if cgroup hierarchies are not available or ask the user to 
> install/enable cgroup hierarchies.
> On CentOS 6, cgroup tools aren't installed by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5533) Agent fails to start on CentOS 6 due to missing cgroup hiearchy

2016-06-17 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337255#comment-15337255
 ] 

Vinod Kone commented on MESOS-5533:
---

What's the status of this?

> Agent fails to start on CentOS 6 due to missing cgroup hiearchy
> ---
>
> Key: MESOS-5533
> URL: https://issues.apache.org/jira/browse/MESOS-5533
> Project: Mesos
>  Issue Type: Bug
>  Components: build, isolation
>Reporter: Kapil Arya
>Assignee: Gilbert Song
>Priority: Blocker
>  Labels: mesosphere
> Fix For: 1.0.0
>
>
> With the network CNI isolator, agent now _requires_ cgroups to be installed 
> on the system. Can we add some check(s) to either automatically disable CNI 
> module if cgroup hierarchies are not available or ask the user to 
> install/enable cgroup hierarchies.
> On CentOS 6, cgroup tools aren't installed by default.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5635) Agent repeatedly reregisters, possible one-way disconnection

2016-06-17 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-5635:
-
Summary: Agent repeatedly reregisters, possible one-way disconnection  
(was: Agent repeatedly reregisters, possible one-way partition)

> Agent repeatedly reregisters, possible one-way disconnection
> 
>
> Key: MESOS-5635
> URL: https://issues.apache.org/jira/browse/MESOS-5635
> Project: Mesos
>  Issue Type: Bug
>Reporter: Greg Mann
>  Labels: agent, mesosphere
>
> This issue was observed recently on an internal test cluster. Due to a bug in 
> the agent code (MESOS-5629), regular segfaults were occurring on an agent. 
> While the agent was recovering from one of these failures, it segfaulted 
> again. After this time, we noticed that after beginning recovery, the agent 
> did not print {{Finished recovery}}, and its logs did not show any indication 
> of reregistering with the master. Looking at the master's logs, however, the 
> following line was observed repeatedly, at intervals on the order of seconds:
> {code}
> W0617 21:27:07.010679  2016 master.cpp:4773] Agent 
> 2b899dd3-3b1f-4520-a6b2-98e32196f723-S4 at slave(1)@10.10.0.87:5051 
> (10.10.0.87) attempted to re-register after removal; shutting it down
> {code}
> These re-registration attempts had no corresponding lines in the agent log.
> Subsequently deleting the contents of the agent's {{work_dir}} and restarting 
> it led to a successful registration with a new agent ID:
> {code}
> I0617 21:29:01.246119  2011 master.cpp:4635] Registering agent at 
> slave(1)@10.10.0.87:5051 (10.10.0.87) with id 
> 2b899dd3-3b1f-4520-a6b2-98e32196f723-S5
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5635) Agent repeatedly reregisters, possible one-way partition

2016-06-17 Thread Greg Mann (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Greg Mann updated MESOS-5635:
-
Summary: Agent repeatedly reregisters, possible one-way partition  (was: 
Agent failure during recovery prevents reregistration)

> Agent repeatedly reregisters, possible one-way partition
> 
>
> Key: MESOS-5635
> URL: https://issues.apache.org/jira/browse/MESOS-5635
> Project: Mesos
>  Issue Type: Bug
>Reporter: Greg Mann
>  Labels: agent, mesosphere
>
> This issue was observed recently on an internal test cluster. Due to a bug in 
> the agent code (MESOS-5629), regular segfaults were occurring on an agent. 
> While the agent was recovering from one of these failures, it segfaulted 
> again. After this time, we noticed that after beginning recovery, the agent 
> did not print {{Finished recovery}}, and its logs did not show any indication 
> of reregistering with the master. Looking at the master's logs, however, the 
> following line was observed repeatedly, at intervals on the order of seconds:
> {code}
> W0617 21:27:07.010679  2016 master.cpp:4773] Agent 
> 2b899dd3-3b1f-4520-a6b2-98e32196f723-S4 at slave(1)@10.10.0.87:5051 
> (10.10.0.87) attempted to re-register after removal; shutting it down
> {code}
> These re-registration attempts had no corresponding lines in the agent log.
> Subsequently deleting the contents of the agent's {{work_dir}} and restarting 
> it led to a successful registration with a new agent ID:
> {code}
> I0617 21:29:01.246119  2011 master.cpp:4635] Registering agent at 
> slave(1)@10.10.0.87:5051 (10.10.0.87) with id 
> 2b899dd3-3b1f-4520-a6b2-98e32196f723-S5
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5636) Display allocated resources in the agent listing of the webui.

2016-06-17 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-5636:
---
Description: State endpoint returns information about allocated resources 
for each agent. We can present this information in the agent listing.  (was: 
State endpoint returns information about slaves used resources. Present this 
data in agents page.)

> Display allocated resources in the agent listing of the webui.
> --
>
> Key: MESOS-5636
> URL: https://issues.apache.org/jira/browse/MESOS-5636
> Project: Mesos
>  Issue Type: Improvement
>  Components: webui
>Reporter: Tomasz Janiszewski
>Assignee: Tomasz Janiszewski
>Priority: Trivial
> Fix For: 1.0.0
>
> Attachments: mesos_agents_webui.png
>
>
> State endpoint returns information about allocated resources for each agent. 
> We can present this information in the agent listing.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5636) Display allocated resources in the agent listing of the webui.

2016-06-17 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-5636:
---
Summary: Display allocated resources in the agent listing of the webui.  
(was: Support displaying allocated resources of Agents in Mesos webui)

> Display allocated resources in the agent listing of the webui.
> --
>
> Key: MESOS-5636
> URL: https://issues.apache.org/jira/browse/MESOS-5636
> Project: Mesos
>  Issue Type: Improvement
>  Components: webui
>Reporter: Tomasz Janiszewski
>Assignee: Tomasz Janiszewski
>Priority: Trivial
> Fix For: 1.0.0
>
> Attachments: mesos_agents_webui.png
>
>
> State endpoint returns information about slaves used resources. Present this 
> data in agents page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5636) Display allocated resources in the agent listing of the webui.

2016-06-17 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-5636:
---
Shepherd: Benjamin Mahler

> Display allocated resources in the agent listing of the webui.
> --
>
> Key: MESOS-5636
> URL: https://issues.apache.org/jira/browse/MESOS-5636
> Project: Mesos
>  Issue Type: Improvement
>  Components: webui
>Reporter: Tomasz Janiszewski
>Assignee: Tomasz Janiszewski
>Priority: Trivial
> Fix For: 1.0.0
>
> Attachments: mesos_agents_webui.png
>
>
> State endpoint returns information about slaves used resources. Present this 
> data in agents page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5636) Support displaying allocated resources of Agents in Mesos webui

2016-06-17 Thread Tomasz Janiszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomasz Janiszewski updated MESOS-5636:
--
Summary: Support displaying allocated resources of Agents in Mesos webui  
(was: Support displaying used resources of Agents in Mesos webui)

> Support displaying allocated resources of Agents in Mesos webui
> ---
>
> Key: MESOS-5636
> URL: https://issues.apache.org/jira/browse/MESOS-5636
> Project: Mesos
>  Issue Type: Improvement
>  Components: webui
>Reporter: Tomasz Janiszewski
>Assignee: Tomasz Janiszewski
>Priority: Trivial
> Attachments: mesos_agents_webui.png
>
>
> State endpoint returns information about slaves used resources. Present this 
> data in agents page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5636) Support displaying used resources of Agents in Mesos webui

2016-06-17 Thread Tomasz Janiszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomasz Janiszewski updated MESOS-5636:
--
Attachment: mesos_agents_webui.png

> Support displaying used resources of Agents in Mesos webui
> --
>
> Key: MESOS-5636
> URL: https://issues.apache.org/jira/browse/MESOS-5636
> Project: Mesos
>  Issue Type: Improvement
>  Components: webui
>Reporter: Tomasz Janiszewski
>Priority: Trivial
> Attachments: mesos_agents_webui.png
>
>
> State endpoint returns information about slaves used resources. Present this 
> data in agents page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5633) User related shell environment is not set correctly in tasks

2016-06-17 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337097#comment-15337097
 ] 

Jie Yu commented on MESOS-5633:
---

Remember that Mesos task should always write to it's own sandbox 
($MESOS_SANDBOX). I am wondering if it makes sense to set $HOME to 
$MESOS_SANDBOX. I am not sure if it'll break something. Is there a standard 
specifying how $HOME should be set?

> User related shell environment is not set correctly in tasks
> 
>
> Key: MESOS-5633
> URL: https://issues.apache.org/jira/browse/MESOS-5633
> Project: Mesos
>  Issue Type: Bug
>Reporter: haosdent
>
> If user specify the user field in {{FrameworkInfo}} or {{Task}}, both 
> {{setuid}} and {{setgroups}} are set correctly. However, some user related 
> shell variables, e.g., {{HOME}}, {{USER}} are still use root.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5265) Update mesos-execute to support docker volume isolator.

2016-06-17 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-5265:
--
  Sprint: Mesosphere Sprint 37
Story Points: 3

> Update mesos-execute to support docker volume isolator.
> ---
>
> Key: MESOS-5265
> URL: https://issues.apache.org/jira/browse/MESOS-5265
> Project: Mesos
>  Issue Type: Bug
>Reporter: Guangya Liu
>Assignee: Guangya Liu
>
> The mesos-execute needs to be updated to support docker volume isolator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5635) Agent failure during recovery prevents reregistration

2016-06-17 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-5635:
--
Description: 
This issue was observed recently on an internal test cluster. Due to a bug in 
the agent code (MESOS-5629), regular segfaults were occurring on an agent. 
While the agent was recovering from one of these failures, it segfaulted again. 
After this time, we noticed that after beginning recovery, the agent did not 
print {{Finished recovery}}, and its logs did not show any indication of 
reregistering with the master. Looking at the master's logs, however, the 
following line was observed repeatedly, at intervals on the order of seconds:
{code}
W0617 21:27:07.010679  2016 master.cpp:4773] Agent 
2b899dd3-3b1f-4520-a6b2-98e32196f723-S4 at slave(1)@10.10.0.87:5051 
(10.10.0.87) attempted to re-register after removal; shutting it down
{code}
These re-registration attempts had no corresponding lines in the agent log.

Subsequently deleting the contents of the agent's {{work_dir}} and restarting 
it led to a successful registration with a new agent ID:
{code}
I0617 21:29:01.246119  2011 master.cpp:4635] Registering agent at 
slave(1)@10.10.0.87:5051 (10.10.0.87) with id 
2b899dd3-3b1f-4520-a6b2-98e32196f723-S5
{code}

  was:
This issue was observed recently on an internal test cluster. Due to a bug in 
the agent code (MESOS-5629), regular segfaults were occurring on an agent. 
While the agent was recovering from one of these failures, it segfaulted again. 
After this time, we noticed that after recovery, the agent did not print 
{{Finished recovery}}, and its logs did not show any indication of 
reregistering with the master. Looking at the master's logs, however, the 
following line was observed repeatedly, at intervals on the order of seconds:
{code}
W0617 21:27:07.010679  2016 master.cpp:4773] Agent 
2b899dd3-3b1f-4520-a6b2-98e32196f723-S4 at slave(1)@10.10.0.87:5051 
(10.10.0.87) attempted to re-register after removal; shutting it down
{code}
These re-registration attempts had no corresponding lines in the agent log.

Subsequently deleting the contents of the agent's {{work_dir}} and restarting 
it led to a successful registration with a new agent ID:
{code}
I0617 21:29:01.246119  2011 master.cpp:4635] Registering agent at 
slave(1)@10.10.0.87:5051 (10.10.0.87) with id 
2b899dd3-3b1f-4520-a6b2-98e32196f723-S5
{code}


> Agent failure during recovery prevents reregistration
> -
>
> Key: MESOS-5635
> URL: https://issues.apache.org/jira/browse/MESOS-5635
> Project: Mesos
>  Issue Type: Bug
>Reporter: Greg Mann
>  Labels: agent, mesosphere
>
> This issue was observed recently on an internal test cluster. Due to a bug in 
> the agent code (MESOS-5629), regular segfaults were occurring on an agent. 
> While the agent was recovering from one of these failures, it segfaulted 
> again. After this time, we noticed that after beginning recovery, the agent 
> did not print {{Finished recovery}}, and its logs did not show any indication 
> of reregistering with the master. Looking at the master's logs, however, the 
> following line was observed repeatedly, at intervals on the order of seconds:
> {code}
> W0617 21:27:07.010679  2016 master.cpp:4773] Agent 
> 2b899dd3-3b1f-4520-a6b2-98e32196f723-S4 at slave(1)@10.10.0.87:5051 
> (10.10.0.87) attempted to re-register after removal; shutting it down
> {code}
> These re-registration attempts had no corresponding lines in the agent log.
> Subsequently deleting the contents of the agent's {{work_dir}} and restarting 
> it led to a successful registration with a new agent ID:
> {code}
> I0617 21:29:01.246119  2011 master.cpp:4635] Registering agent at 
> slave(1)@10.10.0.87:5051 (10.10.0.87) with id 
> 2b899dd3-3b1f-4520-a6b2-98e32196f723-S5
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5636) Support displaying used resources of Agents in Mesos webui

2016-06-17 Thread Tomasz Janiszewski (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tomasz Janiszewski updated MESOS-5636:
--
Attachment: mesos_agents_webui.png

> Support displaying used resources of Agents in Mesos webui
> --
>
> Key: MESOS-5636
> URL: https://issues.apache.org/jira/browse/MESOS-5636
> Project: Mesos
>  Issue Type: Improvement
>  Components: webui
>Reporter: Tomasz Janiszewski
>Priority: Trivial
> Attachments: mesos_agents_webui.png
>
>
> State endpoint returns information about slaves used resources. Present this 
> data in agents page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5636) Support displaying used resources of Agents in Mesos webui

2016-06-17 Thread Tomasz Janiszewski (JIRA)
Tomasz Janiszewski created MESOS-5636:
-

 Summary: Support displaying used resources of Agents in Mesos webui
 Key: MESOS-5636
 URL: https://issues.apache.org/jira/browse/MESOS-5636
 Project: Mesos
  Issue Type: Improvement
  Components: webui
Reporter: Tomasz Janiszewski
Priority: Trivial


State endpoint returns information about slaves used resources. Present this 
data in agents page.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5635) Agent failure during recovery prevents reregistration

2016-06-17 Thread Greg Mann (JIRA)
Greg Mann created MESOS-5635:


 Summary: Agent failure during recovery prevents reregistration
 Key: MESOS-5635
 URL: https://issues.apache.org/jira/browse/MESOS-5635
 Project: Mesos
  Issue Type: Bug
Reporter: Greg Mann


This issue was observed recently on an internal test cluster. Due to a bug in 
the agent code (MESOS-5629), regular segfaults were occurring on an agent. 
While the agent was recovering from one of these failures, it segfaulted again. 
After this time, we noticed that after recovery, the agent did not print 
{{Finished recovery}}, and its logs did not show any indication of 
reregistering with the master. Looking at the master's logs, however, the 
following line was observed repeatedly, at intervals on the order of seconds:
{code}
W0617 21:27:07.010679  2016 master.cpp:4773] Agent 
2b899dd3-3b1f-4520-a6b2-98e32196f723-S4 at slave(1)@10.10.0.87:5051 
(10.10.0.87) attempted to re-register after removal; shutting it down
{code}
These re-registration attempts had no corresponding lines in the agent log.

Subsequently deleting the contents of the agent's {{work_dir}} and restarting 
it led to a successful registration with a new agent ID:
{code}
I0617 21:29:01.246119  2011 master.cpp:4635] Registering agent at 
slave(1)@10.10.0.87:5051 (10.10.0.87) with id 
2b899dd3-3b1f-4520-a6b2-98e32196f723-S5
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4087) Introduce a module for logging executor/task output

2016-06-17 Thread Mallik Singaraju (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336993#comment-15336993
 ] 

Mallik Singaraju commented on MESOS-4087:
-

ok thanks joesph

> Introduce a module for logging executor/task output
> ---
>
> Key: MESOS-4087
> URL: https://issues.apache.org/jira/browse/MESOS-4087
> Project: Mesos
>  Issue Type: Task
>  Components: containerization, modules
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: logging, mesosphere
> Fix For: 0.27.0
>
>
> Existing executor/task logs are logged to files in their sandbox directory, 
> with some nuances based on which containerizer is used (see background 
> section in linked document).
> A logger for executor/task logs has the following requirements:
> * The logger is given a command to run and must handle the stdout/stderr of 
> the command.
> * The handling of stdout/stderr must be resilient across agent failover.  
> Logging should not stop if the agent fails.
> * Logs should be readable, presumably via the web UI, or via some other 
> module-specific UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5630) Change build to always enable Nvidia GPU support for Linux

2016-06-17 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336983#comment-15336983
 ] 

Benjamin Mahler commented on MESOS-5630:


{noformat}
commit da610431162e738615a59cb04fb69766b9a847d5
Author: Kevin Klues 
Date:   Fri Jun 17 14:17:07 2016 -0700

Fixed Cmake build for Nvidia GPU support on Linux.

Review: https://reviews.apache.org/r/48881/
{noformat}

{noformat}
commit 1f65937ba38eca54247447ceafd6ccdd93163cdc
Author: Kevin Klues 
Date:   Fri Jun 17 14:17:15 2016 -0700

Fixed Cmake build for Nvidia GPU support on Linux in stout.

Review: https://reviews.apache.org/r/48882/
{noformat}

{noformat}
commit d2d5c409f51f689f523137b502f553225d3474ae
Author: Kevin Klues 
Date:   Fri Jun 17 14:17:20 2016 -0700

Fixed Cmake build for Nvidia GPU support on Linux in libprocess.

Review: https://reviews.apache.org/r/48883/
{noformat}

> Change build to always enable Nvidia GPU support for Linux
> --
>
> Key: MESOS-5630
> URL: https://issues.apache.org/jira/browse/MESOS-5630
> Project: Mesos
>  Issue Type: Improvement
> Environment: Build / run unit tests in three build environments:
> {noformat}
> 1) CentOS 7 on GPU capable machine
> 2) CentOS 7 on NON-GPU capable machine
> 3) OSX
> $ rm -rf build; ./bootstrap; mkdir build; cd build; ../configure; make -j 
> check; sudo GTEST_FILTER="*NVIDIA*" src/mesos-tests
> {noformat}
> Test support/build_docker.sh (to make sure we won't crash Apache's CI):
> {noformat}
> $ ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1' CONFIGURATION="--enable-libevent 
> --enable-ssl" COMPILER=gcc BUILDTOOL=autotools OS=centos:7 
> support/docker_build.sh
> $ ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1' CONFIGURATION="--enable-libevent 
> --enable-ssl" COMPILER=gcc BUILDTOOL=autotools OS=ubuntu:14.04 
> support/docker_build.sh
> {noformat}
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>  Labels: gpu, mesosphere
> Fix For: 1.0.0
>
>
> See Summary



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5516) Implement GET_STATE Call in v1 agent API.

2016-06-17 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone reassigned MESOS-5516:
-

Assignee: (was: Vinod Kone)

> Implement GET_STATE Call in v1 agent API.
> -
>
> Key: MESOS-5516
> URL: https://issues.apache.org/jira/browse/MESOS-5516
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5592) Pass NetworkInfo to CNI Plugins

2016-06-17 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-5592:
--
Assignee: Dan Osborne

> Pass NetworkInfo to CNI Plugins
> ---
>
> Key: MESOS-5592
> URL: https://issues.apache.org/jira/browse/MESOS-5592
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Dan Osborne
>Assignee: Dan Osborne
> Fix For: 1.0.0
>
>
> Mesos has adopted the Container Network Interface as a simple means of 
> networking Mesos tasks launched by the Unified Containerizer. The CNI 
> specification covers a minimum feature set, granting the flexibility to add 
> customized networking functionality in the form of agreements made between 
> the orchestrator and CNI plugin.
> This proposal is to pass NetworkInfo.Labels to the CNI plugin by injecting it 
> into the CNI network configuration json during plugin invocation.
> Design Doc on this change: 
> https://docs.google.com/document/d/1rxruCCcJqpppsQxQrzTbHFVnnW6CgQ2oTieYAmwL284/edit?usp=sharing
> reviewboard: https://reviews.apache.org/r/48527/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5634) Add Framework Capability for GPU_RESOURCES

2016-06-17 Thread Kevin Klues (JIRA)
Kevin Klues created MESOS-5634:
--

 Summary: Add Framework Capability for GPU_RESOURCES
 Key: MESOS-5634
 URL: https://issues.apache.org/jira/browse/MESOS-5634
 Project: Mesos
  Issue Type: Task
Reporter: Kevin Klues
Assignee: Kevin Klues
 Fix For: 1.0.0


Due to the scarce resource problem described in MESOS-5377, we plan to 
introduce a GPU_RESOURCES Framework capability. This capability will allow the 
Mesos allocator to make better decisions about which frameworks should receive 
resources from GPU capable machines.  In essence, the allocator will ONLY 
allocate resources from GPU capable machines to frameworks that have this 
capability. This is necessary to prevent non-GPU workloads from filling up the 
GPU machines and preventing GPU workloads to run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4087) Introduce a module for logging executor/task output

2016-06-17 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336709#comment-15336709
 ] 

Joseph Wu commented on MESOS-4087:
--

Sounds like you're trying to build a custom solution for your specific 
framework.  You might want to ask in the Spark community on how they've done 
logging.

The {{ContainerLogger}} (this JIRA) is meant to encompass the stdout/stderr of 
*any* executor, and involves loading a module into your agents.  If you are 
willing to dip into C++, you can write your own appender/forwarder.  Examples:
https://github.com/apache/mesos/tree/master/src/slave/container_loggers

> Introduce a module for logging executor/task output
> ---
>
> Key: MESOS-4087
> URL: https://issues.apache.org/jira/browse/MESOS-4087
> Project: Mesos
>  Issue Type: Task
>  Components: containerization, modules
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: logging, mesosphere
> Fix For: 0.27.0
>
>
> Existing executor/task logs are logged to files in their sandbox directory, 
> with some nuances based on which containerizer is used (see background 
> section in linked document).
> A logger for executor/task logs has the following requirements:
> * The logger is given a command to run and must handle the stdout/stderr of 
> the command.
> * The handling of stdout/stderr must be resilient across agent failover.  
> Logging should not stop if the agent fails.
> * Logs should be readable, presumably via the web UI, or via some other 
> module-specific UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5400) Add preliminary support for parsing ELF files in stout.

2016-06-17 Thread Kevin Klues (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336700#comment-15336700
 ] 

Kevin Klues commented on MESOS-5400:


{noformat}
commit 7c0f57ff0ecb2b0e3e2cfe5eeca80e53d791c2d3
Author: Kevin Klues klue...@gmail.com
Date:   Fri Jun 17 01:32:29 2016 -0400

Added missing `stout/elf.hpp` file to `nobase_include_HEADERS`.

Without this, files that #included `stout/elf.hpp` would fail with a
`make distcheck` because this file was not being installed properly
from a `make install`.

Review: https://reviews.apache.org/r/48838/
{noformat}

> Add preliminary support for parsing ELF files in stout.
> ---
>
> Key: MESOS-5400
> URL: https://issues.apache.org/jira/browse/MESOS-5400
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>Priority: Minor
> Fix For: 1.0.0
>
>
> The upcoming Nvidia GPU support for docker containers in Mesos relies on 
> consolidating all Nvidia shared libraries into a common location for 
> injecting a volume into a container.
> As part of this, we need some preliminary parsing capabilities for ELF file 
> to infer things about each shared library we are consolidating.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5630) Change build to always enable Nvidia GPU support for Linux

2016-06-17 Thread Kevin Klues (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336672#comment-15336672
 ] 

Kevin Klues commented on MESOS-5630:


https://reviews.apache.org/r/48832/

> Change build to always enable Nvidia GPU support for Linux
> --
>
> Key: MESOS-5630
> URL: https://issues.apache.org/jira/browse/MESOS-5630
> Project: Mesos
>  Issue Type: Improvement
> Environment: Build / run unit tests in three build environments:
> {noformat}
> 1) CentOS 7 on GPU capable machine
> 2) CentOS 7 on NON-GPU capable machine
> 3) OSX
> $ rm -rf build; ./bootstrap; mkdir build; cd build; ../configure; make -j 
> check; sudo GTEST_FILTER="*NVIDIA*" src/mesos-tests
> {noformat}
> Test support/build_docker.sh (to make sure we won't crash Apache's CI):
> {noformat}
> $ ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1' CONFIGURATION="--enable-libevent 
> --enable-ssl" COMPILER=gcc BUILDTOOL=autotools OS=centos:7 
> support/docker_build.sh
> $ ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1' CONFIGURATION="--enable-libevent 
> --enable-ssl" COMPILER=gcc BUILDTOOL=autotools OS=ubuntu:14.04 
> support/docker_build.sh
> {noformat}
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>  Labels: gpu, mesosphere
> Fix For: 1.0.0
>
>
> See Summary



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4248) mesos slave can't start in CentOS-7 docker container

2016-06-17 Thread Justin Venus (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336624#comment-15336624
 ] 

Justin Venus commented on MESOS-4248:
-

Yes, that is exactly what I want.  Thank you for point out the ticket so I 
didn't have to go Jira spelunking today. 

> mesos slave can't start in CentOS-7 docker container
> 
>
> Key: MESOS-4248
> URL: https://issues.apache.org/jira/browse/MESOS-4248
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.26.0
> Environment: My host OS is Debian Jessie,  the container OS is CentOS 
> 7.2.
> {code}
> # cat /etc/system-release
> CentOS Linux release 7.2.1511 (Core) 
> # rpm -qa |grep mesos
> mesosphere-zookeeper-3.4.6-0.1.20141204175332.centos7.x86_64
> mesosphere-el-repo-7-1.noarch
> mesos-0.26.0-0.2.145.centos701406.x86_64
> $ docker version
> Client:
>  Version:  1.9.1
>  API version:  1.21
>  Go version:   go1.4.2
>  Git commit:   a34a1d5
>  Built:Fri Nov 20 12:59:02 UTC 2015
>  OS/Arch:  linux/amd64
> Server:
>  Version:  1.9.1
>  API version:  1.21
>  Go version:   go1.4.2
>  Git commit:   a34a1d5
>  Built:Fri Nov 20 12:59:02 UTC 2015
>  OS/Arch:  linux/amd64
> {code}
>Reporter: Yubao Liu
>
> // Check the "Environment" label above for kinds of software versions.
> "systemctl start mesos-slave" can't start mesos-slave:
> {code}
> # journalctl -u mesos-slave
> 
> Dec 24 10:35:25 mesos-slave1 systemd[1]: Started Mesos Slave.
> Dec 24 10:35:25 mesos-slave1 systemd[1]: Starting Mesos Slave...
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210180 12838 
> logging.cpp:172] INFO level logging started!
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210603 12838 
> main.cpp:190] Build: 2015-12-16 23:06:16 by root
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210625 12838 
> main.cpp:192] Version: 0.26.0
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210634 12838 
> main.cpp:195] Git tag: 0.26.0
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210644 12838 
> main.cpp:199] Git SHA: d3717e5c4d1bf4fca5c41cd7ea54fae489028faa
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210765 12838 
> containerizer.cpp:142] Using isolation: posix/cpu,posix/mem,filesystem/posix
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.215638 12838 
> linux_launcher.cpp:103] Using /sys/fs/cgroup/freezer as the freezer hierarchy 
> for the Linux launcher
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.220279 12838 
> systemd.cpp:128] systemd version `219` detected
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.227017 12838 
> systemd.cpp:210] Started systemd slice `mesos_executors.slice`
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: Failed to create a 
> containerizer: Could not create MesosContainerizer: Failed to create 
> launcher: Failed to locate systemd cgroups hierarchy: does not exist
> Dec 24 10:35:25 mesos-slave1 systemd[1]: mesos-slave.service: main process 
> exited, code=exited, status=1/FAILURE
> Dec 24 10:35:25 mesos-slave1 systemd[1]: Unit mesos-slave.service entered 
> failed state.
> Dec 24 10:35:25 mesos-slave1 systemd[1]: mesos-slave.service failed.
> {code}
> I used strace to debug it, mesos-slave tried to access 
> "/sys/fs/cgroup/systemd/mesos_executors.slice",  but it's actually at 
> "/sys/fs/cgroup/systemd/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope/mesos_executors.slice/",
>mesos-slave should check "/proc/self/cgroup" to find those intermediate 
> directories:
> {code}
> # cat /proc/self/cgroup 
> 8:perf_event:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope
> 7:blkio:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope
> 6:net_cls,net_prio:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope
> 5:freezer:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope
> 4:devices:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope
> 3:cpu,cpuacct:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope
> 2:cpuset:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope
> 1:name=systemd:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5504) Implement GET_MAINTENANCE_SCHEDULE Call in v1 master API.

2016-06-17 Thread Vinod Kone (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336574#comment-15336574
 ] 

Vinod Kone edited comment on MESOS-5504 at 6/17/16 5:59 PM:


commit 87c079e979b8fbf04c4ed491f843f92266f3d7da
Author: haosdent huang 
Date:   Fri Jun 17 10:40:45 2016 -0700

Added test case `MasterAPITest.UpdateAndGetMaintenanceSchedule`.

Review: https://reviews.apache.org/r/48259/

commit b73f0a50f6c5c2feb642827e3e6fbe0ec1a1c914
Author: haosdent huang 
Date:   Fri Jun 17 10:40:39 2016 -0700

Implemented GET_MAINTENANCE_STATUS Call in v1 master API.

Review: https://reviews.apache.org/r/48084/

commit 5f09adb9aa7b49e4d83104f36a14df1385c6880a
Author: haosdent huang 
Date:   Fri Jun 17 10:40:33 2016 -0700

Implemented GET_MAINTENANCE_SCHEDULE Call in v1 master API.

Review: https://reviews.apache.org/r/48257/



was (Author: vinodkone):
commit 87c079e979b8fbf04c4ed491f843f92266f3d7da
Author: haosdent huang 
Date:   Fri Jun 17 10:40:45 2016 -0700

Added test case `MasterAPITest.UpdateAndGetMaintenanceSchedule`.

Review: https://reviews.apache.org/r/48259/

commit b73f0a50f6c5c2feb642827e3e6fbe0ec1a1c914
Author: haosdent huang 
Date:   Fri Jun 17 10:40:39 2016 -0700

Implemented GET_MAINTENANCE_STATUS Call in v1 master API.

Review: https://reviews.apache.org/r/48084/


> Implement GET_MAINTENANCE_SCHEDULE Call in v1 master API.
> -
>
> Key: MESOS-5504
> URL: https://issues.apache.org/jira/browse/MESOS-5504
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>Assignee: haosdent
> Fix For: 1.0.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4087) Introduce a module for logging executor/task output

2016-06-17 Thread Mallik Singaraju (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336558#comment-15336558
 ] 

Mallik Singaraju commented on MESOS-4087:
-

I am looking at the stdout/stderr of the agent sandbox running the spark 
executor tasks on mesos.

Here is how I am submitting my job from a jenkins slave which has spark submit 
on it.

SPARK_JAVA_OPTS="\
-Dspark.executor.uri=https://s3.amazonaws.com//spark-1.6.1-bin-hadoop-2.6_scala-2.11.tgz
 \
-Dlog4j.configuration=log4j.properties \
" \
$SPARK_HOME/bin/spark-submit \
--class com.uptake.ad.AnomalyDetectionApp \
--deploy-mode cluster \
--verbose \
--conf spark.master=mesos://xx.xx.xx.xx:7070 \
--conf spark.ssl.enabled=true \
--conf spark.mesos.coarse=false \
--conf spark.cores.max=1 \
--conf spark.executor.memory=1G \
--conf spark.driver.memory=1G \
https://s3.amazonaws.com//.jar

I want to override the log4j config which is defaulted to spark_home/conf with 
the one from the classpath in .jar when the  spark executor task is 
being run. Goal is to add a graylog appender to log4j so that I can push the 
driver's as well as executor application specific logs to a central gray log 
server.

Looks like when a executor task runs on mesos spark is always loading the 
log4j.properties from the SPARK_HOME/conf instead from .jar


> Introduce a module for logging executor/task output
> ---
>
> Key: MESOS-4087
> URL: https://issues.apache.org/jira/browse/MESOS-4087
> Project: Mesos
>  Issue Type: Task
>  Components: containerization, modules
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: logging, mesosphere
> Fix For: 0.27.0
>
>
> Existing executor/task logs are logged to files in their sandbox directory, 
> with some nuances based on which containerizer is used (see background 
> section in linked document).
> A logger for executor/task logs has the following requirements:
> * The logger is given a command to run and must handle the stdout/stderr of 
> the command.
> * The handling of stdout/stderr must be resilient across agent failover.  
> Logging should not stop if the agent fails.
> * Logs should be readable, presumably via the web UI, or via some other 
> module-specific UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5633) User related shell environment is not set correctly in tasks

2016-06-17 Thread haosdent (JIRA)
haosdent created MESOS-5633:
---

 Summary: User related shell environment is not set correctly in 
tasks
 Key: MESOS-5633
 URL: https://issues.apache.org/jira/browse/MESOS-5633
 Project: Mesos
  Issue Type: Bug
Reporter: haosdent


If user specify the user field in {{FrameworkInfo}} or {{Task}}, both 
{{setuid}} and {{setgroups}} are set correctly. However, some user related 
shell variables, e.g., {{HOME}}, {{USER}} are still use root.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5617) Mesos website preview incorrect in facebook

2016-06-17 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-5617:

Attachment: facebook_post.png

> Mesos website preview incorrect in facebook
> ---
>
> Key: MESOS-5617
> URL: https://issues.apache.org/jira/browse/MESOS-5617
> Project: Mesos
>  Issue Type: Improvement
>  Components: project website
>Reporter: haosdent
>Assignee: haosdent
>Priority: Minor
> Attachments: facebook_post.png
>
>
> We need follow 
> https://developers.facebook.com/docs/sharing/best-practices#images to avoid 
> the preview logo of the sharing relateds to Mesos website be cropped by 
> facebook. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4248) mesos slave can't start in CentOS-7 docker container

2016-06-17 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336464#comment-15336464
 ] 

Joseph Wu commented on MESOS-4248:
--

This might be related to what you want: [MESOS-5544].

> mesos slave can't start in CentOS-7 docker container
> 
>
> Key: MESOS-4248
> URL: https://issues.apache.org/jira/browse/MESOS-4248
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.26.0
> Environment: My host OS is Debian Jessie,  the container OS is CentOS 
> 7.2.
> {code}
> # cat /etc/system-release
> CentOS Linux release 7.2.1511 (Core) 
> # rpm -qa |grep mesos
> mesosphere-zookeeper-3.4.6-0.1.20141204175332.centos7.x86_64
> mesosphere-el-repo-7-1.noarch
> mesos-0.26.0-0.2.145.centos701406.x86_64
> $ docker version
> Client:
>  Version:  1.9.1
>  API version:  1.21
>  Go version:   go1.4.2
>  Git commit:   a34a1d5
>  Built:Fri Nov 20 12:59:02 UTC 2015
>  OS/Arch:  linux/amd64
> Server:
>  Version:  1.9.1
>  API version:  1.21
>  Go version:   go1.4.2
>  Git commit:   a34a1d5
>  Built:Fri Nov 20 12:59:02 UTC 2015
>  OS/Arch:  linux/amd64
> {code}
>Reporter: Yubao Liu
>
> // Check the "Environment" label above for kinds of software versions.
> "systemctl start mesos-slave" can't start mesos-slave:
> {code}
> # journalctl -u mesos-slave
> 
> Dec 24 10:35:25 mesos-slave1 systemd[1]: Started Mesos Slave.
> Dec 24 10:35:25 mesos-slave1 systemd[1]: Starting Mesos Slave...
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210180 12838 
> logging.cpp:172] INFO level logging started!
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210603 12838 
> main.cpp:190] Build: 2015-12-16 23:06:16 by root
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210625 12838 
> main.cpp:192] Version: 0.26.0
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210634 12838 
> main.cpp:195] Git tag: 0.26.0
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210644 12838 
> main.cpp:199] Git SHA: d3717e5c4d1bf4fca5c41cd7ea54fae489028faa
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210765 12838 
> containerizer.cpp:142] Using isolation: posix/cpu,posix/mem,filesystem/posix
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.215638 12838 
> linux_launcher.cpp:103] Using /sys/fs/cgroup/freezer as the freezer hierarchy 
> for the Linux launcher
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.220279 12838 
> systemd.cpp:128] systemd version `219` detected
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.227017 12838 
> systemd.cpp:210] Started systemd slice `mesos_executors.slice`
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: Failed to create a 
> containerizer: Could not create MesosContainerizer: Failed to create 
> launcher: Failed to locate systemd cgroups hierarchy: does not exist
> Dec 24 10:35:25 mesos-slave1 systemd[1]: mesos-slave.service: main process 
> exited, code=exited, status=1/FAILURE
> Dec 24 10:35:25 mesos-slave1 systemd[1]: Unit mesos-slave.service entered 
> failed state.
> Dec 24 10:35:25 mesos-slave1 systemd[1]: mesos-slave.service failed.
> {code}
> I used strace to debug it, mesos-slave tried to access 
> "/sys/fs/cgroup/systemd/mesos_executors.slice",  but it's actually at 
> "/sys/fs/cgroup/systemd/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope/mesos_executors.slice/",
>mesos-slave should check "/proc/self/cgroup" to find those intermediate 
> directories:
> {code}
> # cat /proc/self/cgroup 
> 8:perf_event:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope
> 7:blkio:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope
> 6:net_cls,net_prio:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope
> 5:freezer:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope
> 4:devices:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope
> 3:cpu,cpuacct:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope
> 2:cpuset:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope
> 1:name=systemd:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4087) Introduce a module for logging executor/task output

2016-06-17 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336435#comment-15336435
 ] 

Joseph Wu commented on MESOS-4087:
--

Just to clarify, are you looking at the stdout/stderr of your {{spark-submit}} 
command?  Or are you looking at the [agent 
sandboxes|http://mesos.apache.org/documentation/latest/sandbox/#where-is-it] 
for your spark executors?

Under the default settings, the spark executors' sandboxes will have a 
{{stdout}} and {{stderr}} file for their stdout/stderr logging.  If {{log4j}} 
places logs in a different location, you'll have to check that location.

> Introduce a module for logging executor/task output
> ---
>
> Key: MESOS-4087
> URL: https://issues.apache.org/jira/browse/MESOS-4087
> Project: Mesos
>  Issue Type: Task
>  Components: containerization, modules
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: logging, mesosphere
> Fix For: 0.27.0
>
>
> Existing executor/task logs are logged to files in their sandbox directory, 
> with some nuances based on which containerizer is used (see background 
> section in linked document).
> A logger for executor/task logs has the following requirements:
> * The logger is given a command to run and must handle the stdout/stderr of 
> the command.
> * The handling of stdout/stderr must be resilient across agent failover.  
> Logging should not stop if the agent fails.
> * Logs should be readable, presumably via the web UI, or via some other 
> module-specific UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4248) mesos slave can't start in CentOS-7 docker container

2016-06-17 Thread Justin Venus (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336432#comment-15336432
 ] 

Justin Venus commented on MESOS-4248:
-

Thanks for pointing that ticket out.

However, MESOS-4675 doesn't solve my use case.  
- I want to run systemd in a docker container
- I want mesos-slave to setup the slice "mesos_executors.slice"
- I want to use the cgroup isolators
- I want mesos-executor tasks to survive a mesos-slave restart
- Basically I want mesos-slave to work like it's on bare metal (especially in a 
docker container

I'm carrying around patches for 0.25.0, 0.26.0 and testing 0.27.2 to make this 
work.  I'll open a feature request in jira.

Please notice systemd is in a CGroup
{code}
[root@mesos-slave05of2 /]# systemctl status
● mesos-slave05of2
State: running
 Jobs: 0 queued
   Failed: 0 units
Since: Wed 2016-06-08 21:41:38 UTC; 1 weeks 1 days ago
   CGroup: 
/system.slice/docker-6c53ffcbc602cc6b19149030f6f453a1febd7fc79bf472fa6227c1fecd7c053c.scope
   ├─1 /usr/lib/systemd/systemd --system --log-target=console 
--log-level=info --unit=mesos-slave.target
   ├─mesos_executors.slice
   │ ├─10139 python2.7 
/var/lib/mesos/slaves/f54196a4-d706-4324-97f6-009e18022152-S8/frameworks/20160221-001235-380151
   │ ├─10171 /usr/bin/python2.7 
/var/lib/mesos/slaves/f54196a4-d706-4324-97f6-009e18022152-S8/frameworks/20160221-0012
   │ ├─10282 /usr/bin/python2.7 
/var/lib/mesos/slaves/f54196a4-d706-4324-97f6-009e18022152-S8/frameworks/20160221-0012
   │ ├─10283 /bin/bash -c echo '#!/bin/bash  
PEX_INSTALL=${PEX_INSTALL:-${HOME}/.pex/install} LD_LIBRARY_PATH=${LD_LIB
   │ ├─12647 python2.7 
/var/lib/mesos/slaves/f54196a4-d706-4324-97f6-009e18022152-S8/frameworks/20160221-001235-380151
   │ ├─12672 /usr/bin/python2.7 
/var/lib/mesos/slaves/f54196a4-d706-4324-97f6-009e18022152-S8/frameworks/20160221-0012
   │ ├─12690 /usr/bin/python2.7 
/var/lib/mesos/slaves/f54196a4-d706-4324-97f6-009e18022152-S8/frameworks/20160221-0012
   │ └─12691 python2.7 
/var/lib/mesos/slaves/f54196a4-d706-4324-97f6-009e18022152-S8/frameworks/20160221-001235-380151
   └─system.slice
 ├─thermos-observer.service
 │ └─142 python2.7 /usr/sbin/thermos_observer 
--mesos-root=/var/lib/mesos --port=1338 --log_to_disk=NONE --log_to_
 ├─mesos-slave.service
 │ ├─  143 mesos-slave
 │ ├─  187 mesos-docker-executor
{code}

> mesos slave can't start in CentOS-7 docker container
> 
>
> Key: MESOS-4248
> URL: https://issues.apache.org/jira/browse/MESOS-4248
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.26.0
> Environment: My host OS is Debian Jessie,  the container OS is CentOS 
> 7.2.
> {code}
> # cat /etc/system-release
> CentOS Linux release 7.2.1511 (Core) 
> # rpm -qa |grep mesos
> mesosphere-zookeeper-3.4.6-0.1.20141204175332.centos7.x86_64
> mesosphere-el-repo-7-1.noarch
> mesos-0.26.0-0.2.145.centos701406.x86_64
> $ docker version
> Client:
>  Version:  1.9.1
>  API version:  1.21
>  Go version:   go1.4.2
>  Git commit:   a34a1d5
>  Built:Fri Nov 20 12:59:02 UTC 2015
>  OS/Arch:  linux/amd64
> Server:
>  Version:  1.9.1
>  API version:  1.21
>  Go version:   go1.4.2
>  Git commit:   a34a1d5
>  Built:Fri Nov 20 12:59:02 UTC 2015
>  OS/Arch:  linux/amd64
> {code}
>Reporter: Yubao Liu
>
> // Check the "Environment" label above for kinds of software versions.
> "systemctl start mesos-slave" can't start mesos-slave:
> {code}
> # journalctl -u mesos-slave
> 
> Dec 24 10:35:25 mesos-slave1 systemd[1]: Started Mesos Slave.
> Dec 24 10:35:25 mesos-slave1 systemd[1]: Starting Mesos Slave...
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210180 12838 
> logging.cpp:172] INFO level logging started!
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210603 12838 
> main.cpp:190] Build: 2015-12-16 23:06:16 by root
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210625 12838 
> main.cpp:192] Version: 0.26.0
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210634 12838 
> main.cpp:195] Git tag: 0.26.0
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210644 12838 
> main.cpp:199] Git SHA: d3717e5c4d1bf4fca5c41cd7ea54fae489028faa
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210765 12838 
> containerizer.cpp:142] Using isolation: posix/cpu,posix/mem,filesystem/posix
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.215638 12838 
> linux_launcher.cpp:103] Using /sys/fs/cgroup/freezer as the freezer hierarchy 
> for the Linux launcher
> Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 

[jira] [Commented] (MESOS-5629) Agent segfaults after request to '/files/browse'

2016-06-17 Thread Greg Mann (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336423#comment-15336423
 ] 

Greg Mann commented on MESOS-5629:
--

I just did some testing as well - reliably reproduced the segfault before the 
fix, and was unable to induce it after the fix. LGTM!

> Agent segfaults after request to '/files/browse'
> 
>
> Key: MESOS-5629
> URL: https://issues.apache.org/jira/browse/MESOS-5629
> Project: Mesos
>  Issue Type: Bug
> Environment: CentOS 7, Mesos 1.0.0-rc1 with patches
>Reporter: Greg Mann
>Assignee: Joerg Schad
>Priority: Blocker
>  Labels: authorization, mesosphere, security
> Fix For: 1.0.0
>
> Attachments: test-browse.py
>
>
> We observed a number of agent segfaults today on an internal testing cluster. 
> Here is a log excerpt:
> {code}
> Jun 16 17:12:28 ip-10-10-0-87 mesos-slave[24818]: I0616 17:12:28.522925 24830 
> status_update_manager.cpp:392] Received status update acknowledgement (UUID: 
> e79ab0f4-2fa2-4df2-9b59-89b97a482167) for task 
> datadog-monitor.804b138b-33e5-11e6-ac16-566ccbdde23e of framework 
> 6d4248cd-2832-4152-b5d0-defbf36f6759-
> Jun 16 17:12:28 ip-10-10-0-87 mesos-slave[24818]: I0616 17:12:28.523006 24830 
> status_update_manager.cpp:824] Checkpointing ACK for status update 
> TASK_RUNNING (UUID: e79ab0f4-2fa2-4df2-9b59-89b97a482167) for task 
> datadog-monitor.804b138b-33e5-11e6-ac16-566ccbdde23e of framework 
> 6d4248cd-2832-4152-b5d0-defbf36f6759-
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: I0616 17:12:29.147181 24824 
> http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.87:33356
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: *** Aborted at 1466097149 
> (unix time) try "date -d @1466097149" if you are using GNU date ***
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: PC: @ 0x7ff4d68b12a3 
> (unknown)
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: *** SIGSEGV (@0x0) received 
> by PID 24818 (TID 0x7ff4d31ab700) from PID 0; stack trace: ***
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d6431100 
> (unknown)
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d68b12a3 
> (unknown)
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7eced33 
> process::dispatch<>()
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7e7aad7 
> _ZNSt17_Function_handlerIFN7process6FutureIbEERK6OptionISsEEZN5mesos8internal5slave9Framework15recoverExecutorERKNSA_5state13ExecutorStateEEUlS6_E_E9_M_invokeERKSt9_Any_dataS6_
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7bd1752 
> mesos::internal::FilesProcess::authorize()
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7bd1bea 
> mesos::internal::FilesProcess::browse()
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7bd6e43 
> std::_Function_handler<>::_M_invoke()
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d85478cb 
> _ZZZN7process11ProcessBase5visitERKNS_9HttpEventEENKUlRKNS_6FutureI6OptionINS_4http14authentication20AuthenticationResultE0_clESC_ENKUlRKNS4_IbEEE1_clESG_
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d8551341 
> process::ProcessManager::resume()
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d8551647 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d6909220 
> (unknown)
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d6429dc5 
> start_thread
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d615728d __clone
> Jun 16 17:12:29 ip-10-10-0-87 systemd[1]: dcos-mesos-slave.service: main 
> process exited, code=killed, status=11/SEGV
> Jun 16 17:12:29 ip-10-10-0-87 systemd[1]: Unit dcos-mesos-slave.service 
> entered failed state.
> Jun 16 17:12:29 ip-10-10-0-87 systemd[1]: dcos-mesos-slave.service failed.
> Jun 16 17:12:34 ip-10-10-0-87 systemd[1]: dcos-mesos-slave.service holdoff 
> time over, scheduling restart.
> {code}
> In every case, the stack trace indicates one of the {{/files/*}} endpoints; I 
> observed this a number of times coming from {{browse()}}, and twice from 
> {{read()}}.
> The agent was built from the 1.0.0-rc1 branch, with two cherry-picks applied: 
> [this|https://reviews.apache.org/r/48563/] and 
> [this|https://reviews.apache.org/r/48566/], which were done to repair a 
> different [segfault issue|https://issues.apache.org/jira/browse/MESOS-5587] 
> on the master and agent.
> Thanks go to [~bmahler] for digging into this a bit and discovering a 
> possible cause 
> [here|https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L5737-L5745],
>  where use of {{defer()}} may be necessary 

[jira] [Commented] (MESOS-4087) Introduce a module for logging executor/task output

2016-06-17 Thread Mallik Singaraju (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336124#comment-15336124
 ] 

Mallik Singaraju commented on MESOS-4087:
-

Hi, we are using spark 1.6.1 deployed and running on mesos and I need some info 
on how to capture the log of spark  executors running on mesos in spark 1.6.1. 
We are not using the container based approach to deploy spark on mesos. Instead 
we are currently just deploying spark job (.jar) though spark-submit. 

I am not currently able to override the default behavior of spark executors 
always picking up the http://log4j.properties  in the /conf .
I tried setting the log4j.configuration to http://log4j.properties  in the 
classpath of .jar and did supply that as argument to 
spark-submit. That does not seem to capture any logs of the spark executor 
tasks in mesos. I did figure out that you worked on the logging piece through 
JIRA. Do you have any recommendation on how to approach this?

> Introduce a module for logging executor/task output
> ---
>
> Key: MESOS-4087
> URL: https://issues.apache.org/jira/browse/MESOS-4087
> Project: Mesos
>  Issue Type: Task
>  Components: containerization, modules
>Reporter: Joseph Wu
>Assignee: Joseph Wu
>  Labels: logging, mesosphere
> Fix For: 0.27.0
>
>
> Existing executor/task logs are logged to files in their sandbox directory, 
> with some nuances based on which containerizer is used (see background 
> section in linked document).
> A logger for executor/task logs has the following requirements:
> * The logger is given a command to run and must handle the stdout/stderr of 
> the command.
> * The handling of stdout/stderr must be resilient across agent failover.  
> Logging should not stop if the agent fails.
> * Logs should be readable, presumably via the web UI, or via some other 
> module-specific UI.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5629) Agent segfaults after request to '/files/browse'

2016-06-17 Thread Joerg Schad (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336016#comment-15336016
 ] 

Joerg Schad commented on MESOS-5629:


https://reviews.apache.org/r/48849/

> Agent segfaults after request to '/files/browse'
> 
>
> Key: MESOS-5629
> URL: https://issues.apache.org/jira/browse/MESOS-5629
> Project: Mesos
>  Issue Type: Bug
> Environment: CentOS 7, Mesos 1.0.0-rc1 with patches
>Reporter: Greg Mann
>Assignee: Joerg Schad
>Priority: Blocker
>  Labels: authorization, mesosphere, security
> Fix For: 1.0.0
>
> Attachments: test-browse.py
>
>
> We observed a number of agent segfaults today on an internal testing cluster. 
> Here is a log excerpt:
> {code}
> Jun 16 17:12:28 ip-10-10-0-87 mesos-slave[24818]: I0616 17:12:28.522925 24830 
> status_update_manager.cpp:392] Received status update acknowledgement (UUID: 
> e79ab0f4-2fa2-4df2-9b59-89b97a482167) for task 
> datadog-monitor.804b138b-33e5-11e6-ac16-566ccbdde23e of framework 
> 6d4248cd-2832-4152-b5d0-defbf36f6759-
> Jun 16 17:12:28 ip-10-10-0-87 mesos-slave[24818]: I0616 17:12:28.523006 24830 
> status_update_manager.cpp:824] Checkpointing ACK for status update 
> TASK_RUNNING (UUID: e79ab0f4-2fa2-4df2-9b59-89b97a482167) for task 
> datadog-monitor.804b138b-33e5-11e6-ac16-566ccbdde23e of framework 
> 6d4248cd-2832-4152-b5d0-defbf36f6759-
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: I0616 17:12:29.147181 24824 
> http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.87:33356
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: *** Aborted at 1466097149 
> (unix time) try "date -d @1466097149" if you are using GNU date ***
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: PC: @ 0x7ff4d68b12a3 
> (unknown)
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: *** SIGSEGV (@0x0) received 
> by PID 24818 (TID 0x7ff4d31ab700) from PID 0; stack trace: ***
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d6431100 
> (unknown)
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d68b12a3 
> (unknown)
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7eced33 
> process::dispatch<>()
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7e7aad7 
> _ZNSt17_Function_handlerIFN7process6FutureIbEERK6OptionISsEEZN5mesos8internal5slave9Framework15recoverExecutorERKNSA_5state13ExecutorStateEEUlS6_E_E9_M_invokeERKSt9_Any_dataS6_
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7bd1752 
> mesos::internal::FilesProcess::authorize()
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7bd1bea 
> mesos::internal::FilesProcess::browse()
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7bd6e43 
> std::_Function_handler<>::_M_invoke()
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d85478cb 
> _ZZZN7process11ProcessBase5visitERKNS_9HttpEventEENKUlRKNS_6FutureI6OptionINS_4http14authentication20AuthenticationResultE0_clESC_ENKUlRKNS4_IbEEE1_clESG_
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d8551341 
> process::ProcessManager::resume()
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d8551647 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d6909220 
> (unknown)
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d6429dc5 
> start_thread
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d615728d __clone
> Jun 16 17:12:29 ip-10-10-0-87 systemd[1]: dcos-mesos-slave.service: main 
> process exited, code=killed, status=11/SEGV
> Jun 16 17:12:29 ip-10-10-0-87 systemd[1]: Unit dcos-mesos-slave.service 
> entered failed state.
> Jun 16 17:12:29 ip-10-10-0-87 systemd[1]: dcos-mesos-slave.service failed.
> Jun 16 17:12:34 ip-10-10-0-87 systemd[1]: dcos-mesos-slave.service holdoff 
> time over, scheduling restart.
> {code}
> In every case, the stack trace indicates one of the {{/files/*}} endpoints; I 
> observed this a number of times coming from {{browse()}}, and twice from 
> {{read()}}.
> The agent was built from the 1.0.0-rc1 branch, with two cherry-picks applied: 
> [this|https://reviews.apache.org/r/48563/] and 
> [this|https://reviews.apache.org/r/48566/], which were done to repair a 
> different [segfault issue|https://issues.apache.org/jira/browse/MESOS-5587] 
> on the master and agent.
> Thanks go to [~bmahler] for digging into this a bit and discovering a 
> possible cause 
> [here|https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L5737-L5745],
>  where use of {{defer()}} may be necessary to keep execution in the correct 
> context.



--
This message was sent by Atlassian JIRA

[jira] [Commented] (MESOS-5629) Agent segfaults after request to '/files/browse'

2016-06-17 Thread Joerg Schad (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335967#comment-15335967
 ] 

Joerg Schad commented on MESOS-5629:


Most likely hypotheses the issue is capturing `this` in 
Framework::launchExecutor and Framework::recoverExecutor.
The `Framework` will out of scope, but the `this` pointer to this is still kept 
in the lambda and hence dangling.

Our solution is to remove the `this` capture and replace it by a value copy of 
the slave pid (which is the only attribute used from captured `this`).

We have not been able to reproduce this on an AWS instance. [~greggomann] could 
you help out with verifying the patch (see next comment)?



> Agent segfaults after request to '/files/browse'
> 
>
> Key: MESOS-5629
> URL: https://issues.apache.org/jira/browse/MESOS-5629
> Project: Mesos
>  Issue Type: Bug
> Environment: CentOS 7, Mesos 1.0.0-rc1 with patches
>Reporter: Greg Mann
>Assignee: Joerg Schad
>Priority: Blocker
>  Labels: authorization, mesosphere, security
> Fix For: 1.0.0
>
> Attachments: test-browse.py
>
>
> We observed a number of agent segfaults today on an internal testing cluster. 
> Here is a log excerpt:
> {code}
> Jun 16 17:12:28 ip-10-10-0-87 mesos-slave[24818]: I0616 17:12:28.522925 24830 
> status_update_manager.cpp:392] Received status update acknowledgement (UUID: 
> e79ab0f4-2fa2-4df2-9b59-89b97a482167) for task 
> datadog-monitor.804b138b-33e5-11e6-ac16-566ccbdde23e of framework 
> 6d4248cd-2832-4152-b5d0-defbf36f6759-
> Jun 16 17:12:28 ip-10-10-0-87 mesos-slave[24818]: I0616 17:12:28.523006 24830 
> status_update_manager.cpp:824] Checkpointing ACK for status update 
> TASK_RUNNING (UUID: e79ab0f4-2fa2-4df2-9b59-89b97a482167) for task 
> datadog-monitor.804b138b-33e5-11e6-ac16-566ccbdde23e of framework 
> 6d4248cd-2832-4152-b5d0-defbf36f6759-
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: I0616 17:12:29.147181 24824 
> http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.87:33356
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: *** Aborted at 1466097149 
> (unix time) try "date -d @1466097149" if you are using GNU date ***
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: PC: @ 0x7ff4d68b12a3 
> (unknown)
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: *** SIGSEGV (@0x0) received 
> by PID 24818 (TID 0x7ff4d31ab700) from PID 0; stack trace: ***
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d6431100 
> (unknown)
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d68b12a3 
> (unknown)
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7eced33 
> process::dispatch<>()
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7e7aad7 
> _ZNSt17_Function_handlerIFN7process6FutureIbEERK6OptionISsEEZN5mesos8internal5slave9Framework15recoverExecutorERKNSA_5state13ExecutorStateEEUlS6_E_E9_M_invokeERKSt9_Any_dataS6_
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7bd1752 
> mesos::internal::FilesProcess::authorize()
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7bd1bea 
> mesos::internal::FilesProcess::browse()
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7bd6e43 
> std::_Function_handler<>::_M_invoke()
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d85478cb 
> _ZZZN7process11ProcessBase5visitERKNS_9HttpEventEENKUlRKNS_6FutureI6OptionINS_4http14authentication20AuthenticationResultE0_clESC_ENKUlRKNS4_IbEEE1_clESG_
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d8551341 
> process::ProcessManager::resume()
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d8551647 
> _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d6909220 
> (unknown)
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d6429dc5 
> start_thread
> Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d615728d __clone
> Jun 16 17:12:29 ip-10-10-0-87 systemd[1]: dcos-mesos-slave.service: main 
> process exited, code=killed, status=11/SEGV
> Jun 16 17:12:29 ip-10-10-0-87 systemd[1]: Unit dcos-mesos-slave.service 
> entered failed state.
> Jun 16 17:12:29 ip-10-10-0-87 systemd[1]: dcos-mesos-slave.service failed.
> Jun 16 17:12:34 ip-10-10-0-87 systemd[1]: dcos-mesos-slave.service holdoff 
> time over, scheduling restart.
> {code}
> In every case, the stack trace indicates one of the {{/files/*}} endpoints; I 
> observed this a number of times coming from {{browse()}}, and twice from 
> {{read()}}.
> The agent was built from the 1.0.0-rc1 branch, with two cherry-picks applied: 
> [this|https://reviews.apache.org/r/48563/] and 
> 

[jira] [Commented] (MESOS-5588) Improve error handling when parsing acls.

2016-06-17 Thread Till Toenshoff (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335936#comment-15335936
 ] 

Till Toenshoff commented on MESOS-5588:
---

The object count comparison appears like a great start - I like it.

> Improve error handling when parsing acls.
> -
>
> Key: MESOS-5588
> URL: https://issues.apache.org/jira/browse/MESOS-5588
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Joerg Schad
>Assignee: Joerg Schad
>  Labels: mesosphere, security
> Fix For: 1.0.0
>
>
> During parsing of the authorizer errors are ignored. This can lead to 
> undetected security issues.
> Consider the following acl with an typo (usr instead of user)
> {code}
>"view_frameworks": [
>   {
> "principals": { "type": "ANY" },
> "usr": { "type": "NONE" }
>   }
> ]
> {code}
> When the master is started with these flags it will interprete the acl int he 
> following way which gives any principal access to any framework.
> {noformat}
> view_frameworks {
>   principals {
> type: ANY
>   }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5588) Improve error handling when parsing acls.

2016-06-17 Thread Till Toenshoff (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Toenshoff updated MESOS-5588:
--
Priority: Major  (was: Blocker)

> Improve error handling when parsing acls.
> -
>
> Key: MESOS-5588
> URL: https://issues.apache.org/jira/browse/MESOS-5588
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Joerg Schad
>Assignee: Joerg Schad
>  Labels: mesosphere, security
> Fix For: 1.0.0
>
>
> During parsing of the authorizer errors are ignored. This can lead to 
> undetected security issues.
> Consider the following acl with an typo (usr instead of user)
> {code}
>"view_frameworks": [
>   {
> "principals": { "type": "ANY" },
> "usr": { "type": "NONE" }
>   }
> ]
> {code}
> When the master is started with these flags it will interprete the acl int he 
> following way which gives any principal access to any framework.
> {noformat}
> view_frameworks {
>   principals {
> type: ANY
>   }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5588) Improve error handling when parsing acls.

2016-06-17 Thread Till Toenshoff (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335934#comment-15335934
 ] 

Till Toenshoff commented on MESOS-5588:
---

This patch de-escalates the issue down from the original blocker as further 
changes will not change the API aka proto.

> Improve error handling when parsing acls.
> -
>
> Key: MESOS-5588
> URL: https://issues.apache.org/jira/browse/MESOS-5588
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Joerg Schad
>Assignee: Joerg Schad
>Priority: Blocker
>  Labels: mesosphere, security
> Fix For: 1.0.0
>
>
> During parsing of the authorizer errors are ignored. This can lead to 
> undetected security issues.
> Consider the following acl with an typo (usr instead of user)
> {code}
>"view_frameworks": [
>   {
> "principals": { "type": "ANY" },
> "usr": { "type": "NONE" }
>   }
> ]
> {code}
> When the master is started with these flags it will interprete the acl int he 
> following way which gives any principal access to any framework.
> {noformat}
> view_frameworks {
>   principals {
> type: ANY
>   }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5588) Improve error handling when parsing acls.

2016-06-17 Thread Till Toenshoff (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335933#comment-15335933
 ] 

Till Toenshoff commented on MESOS-5588:
---

{noformat}
commit a1a9108338b37f2aea0a575dfc7cbca5b8489cc1
Author: Alexander Rojas 
Date:   Fri Jun 17 13:02:38 2016 +0200

Marked some optional fields in acls.proto as required.

The messages `GetEndpoints`, `ViewFramework`, `ViewTask`, `ViewExecutor`
and `AccessSandbox` all have optional authorization objects as a result
of copy and pasting previous message, but their semantics were those
of an required field, which led to some unexpected behavior when a user
misstyped any entry there.

This patch sets the fields to their actual expected values.

Review: https://reviews.apache.org/r/48781/
{noformat}

> Improve error handling when parsing acls.
> -
>
> Key: MESOS-5588
> URL: https://issues.apache.org/jira/browse/MESOS-5588
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Joerg Schad
>Assignee: Joerg Schad
>Priority: Blocker
>  Labels: mesosphere, security
> Fix For: 1.0.0
>
>
> During parsing of the authorizer errors are ignored. This can lead to 
> undetected security issues.
> Consider the following acl with an typo (usr instead of user)
> {code}
>"view_frameworks": [
>   {
> "principals": { "type": "ANY" },
> "usr": { "type": "NONE" }
>   }
> ]
> {code}
> When the master is started with these flags it will interprete the acl int he 
> following way which gives any principal access to any framework.
> {noformat}
> view_frameworks {
>   principals {
> type: ANY
>   }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5632) Orphaned docker container not killed if executor has exited

2016-06-17 Thread Mansheng Yang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335734#comment-15335734
 ] 

Mansheng Yang commented on MESOS-5632:
--

yes - restarting the agent will kill the two containers and start a new one

> Orphaned docker container not killed if executor has exited
> ---
>
> Key: MESOS-5632
> URL: https://issues.apache.org/jira/browse/MESOS-5632
> Project: Mesos
>  Issue Type: Bug
>  Components: docker, slave
>Reporter: Mansheng Yang
>
> [This ticket|https://issues.apache.org/jira/browse/MESOS-3573] is marked as 
> resolved but it was only partially fixed.
> As mentioned in that ticket, if you start a docker container, kill the 
> docker-executor process, then a new container will be started but the old one 
> will still be there.
> Some logs:
> {noformat}
> I0617 15:01:22.851604  7285 docker.cpp:877] Recovering container 
> '71695f70-afad-421d-8636-deb6724ecaca' for executor 
> 'kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d' of framework 
> '317ab6ce-d599-4ad4-bae2-eb74a6c42d87-'
> I0617 15:01:22.853303  7285 docker.cpp:2107] Executor for container 
> '71695f70-afad-421d-8636-deb6724ecaca' has exited
> I0617 15:01:22.853327  7285 docker.cpp:1826] Destroying container 
> '71695f70-afad-421d-8636-deb6724ecaca'
> I0617 15:01:22.853575  7285 docker.cpp:1954] Running docker stop on container 
> '71695f70-afad-421d-8636-deb6724ecaca'
> I0617 15:01:22.853607  7285 docker.cpp:1956] Running docker stop on container 
> 'mesos-cbb3d52c-b6dd-4b7e-864d-705fc2fab983-S4.71695f70-afad-421d-8636-deb6724ecaca'0
> I0617 15:01:22.854801  7283 slave.cpp:4767] Sending reconnect request to 
> executor 'kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d' of framework 
> 317ab6ce-d599-4ad4-bae2-eb74a6c42d87- at executor(1)@127.0.1.1:56304
> E0617 15:01:22.855870  7283 process.cpp:2040] Failed to shutdown socket with 
> fd 10: Transport endpoint is not connected
> E0617 15:01:22.855974  7283 slave.cpp:4118] Termination of executor 
> 'kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d' of framework 
> 317ab6ce-d599-4ad4-bae2-eb74a6c42d87- failed: Unknown container: 
> 71695f70-afad-421d-8636-deb6724ecaca
> I0617 15:01:22.857015  7283 slave.cpp:3257] Handling status update 
> TASK_FAILED (UUID: b5dfa1dc-62db-4fb5-93c8-958d22f930df) for task 
> kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d of framework 
> 317ab6ce-d599-4ad4-bae2-eb74a6c42d87- from @0.0.0.0:0
> W0617 15:01:22.858330  7288 docker.cpp:1403] Ignoring updating unknown 
> container: 71695f70-afad-421d-8636-deb6724ecaca
> I0617 15:01:22.858819  7288 status_update_manager.cpp:320] Received status 
> update TASK_FAILED (UUID: b5dfa1dc-62db-4fb5-93c8-958d22f930df) for task 
> kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d of framework 
> 317ab6ce-d599-4ad4-bae2-eb74a6c42d87-
> I0617 15:01:22.858986  7288 status_update_manager.cpp:824] Checkpointing 
> UPDATE for status update TASK_FAILED (UUID: 
> b5dfa1dc-62db-4fb5-93c8-958d22f930df) for task 
> kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d of framework 
> 317ab6ce-d599-4ad4-bae2-eb74a6c42d87-
> W0617 15:01:22.920336  7289 slave.cpp:3601] Dropping status update 
> TASK_FAILED (UUID: b5dfa1dc-62db-4fb5-93c8-958d22f930df) for task 
> kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d of framework 
> 317ab6ce-d599-4ad4-bae2-eb74a6c42d87- sent by status update manager 
> because the agent is in RECOVERING state
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5632) Orphaned docker container not killed if executor has exited

2016-06-17 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335650#comment-15335650
 ] 

haosdent commented on MESOS-5632:
-

If restart Mesos Agent work for you?

> Orphaned docker container not killed if executor has exited
> ---
>
> Key: MESOS-5632
> URL: https://issues.apache.org/jira/browse/MESOS-5632
> Project: Mesos
>  Issue Type: Bug
>  Components: docker, slave
>Reporter: Mansheng Yang
>
> [This ticket|https://issues.apache.org/jira/browse/MESOS-3573] is marked as 
> resolved but it was only partially fixed.
> As mentioned in that ticket, if you start a docker container, kill the 
> docker-executor process, then a new container will be started but the old one 
> will still be there.
> Some logs:
> {noformat}
> I0617 15:01:22.851604  7285 docker.cpp:877] Recovering container 
> '71695f70-afad-421d-8636-deb6724ecaca' for executor 
> 'kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d' of framework 
> '317ab6ce-d599-4ad4-bae2-eb74a6c42d87-'
> I0617 15:01:22.853303  7285 docker.cpp:2107] Executor for container 
> '71695f70-afad-421d-8636-deb6724ecaca' has exited
> I0617 15:01:22.853327  7285 docker.cpp:1826] Destroying container 
> '71695f70-afad-421d-8636-deb6724ecaca'
> I0617 15:01:22.853575  7285 docker.cpp:1954] Running docker stop on container 
> '71695f70-afad-421d-8636-deb6724ecaca'
> I0617 15:01:22.853607  7285 docker.cpp:1956] Running docker stop on container 
> 'mesos-cbb3d52c-b6dd-4b7e-864d-705fc2fab983-S4.71695f70-afad-421d-8636-deb6724ecaca'0
> I0617 15:01:22.854801  7283 slave.cpp:4767] Sending reconnect request to 
> executor 'kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d' of framework 
> 317ab6ce-d599-4ad4-bae2-eb74a6c42d87- at executor(1)@127.0.1.1:56304
> E0617 15:01:22.855870  7283 process.cpp:2040] Failed to shutdown socket with 
> fd 10: Transport endpoint is not connected
> E0617 15:01:22.855974  7283 slave.cpp:4118] Termination of executor 
> 'kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d' of framework 
> 317ab6ce-d599-4ad4-bae2-eb74a6c42d87- failed: Unknown container: 
> 71695f70-afad-421d-8636-deb6724ecaca
> I0617 15:01:22.857015  7283 slave.cpp:3257] Handling status update 
> TASK_FAILED (UUID: b5dfa1dc-62db-4fb5-93c8-958d22f930df) for task 
> kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d of framework 
> 317ab6ce-d599-4ad4-bae2-eb74a6c42d87- from @0.0.0.0:0
> W0617 15:01:22.858330  7288 docker.cpp:1403] Ignoring updating unknown 
> container: 71695f70-afad-421d-8636-deb6724ecaca
> I0617 15:01:22.858819  7288 status_update_manager.cpp:320] Received status 
> update TASK_FAILED (UUID: b5dfa1dc-62db-4fb5-93c8-958d22f930df) for task 
> kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d of framework 
> 317ab6ce-d599-4ad4-bae2-eb74a6c42d87-
> I0617 15:01:22.858986  7288 status_update_manager.cpp:824] Checkpointing 
> UPDATE for status update TASK_FAILED (UUID: 
> b5dfa1dc-62db-4fb5-93c8-958d22f930df) for task 
> kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d of framework 
> 317ab6ce-d599-4ad4-bae2-eb74a6c42d87-
> W0617 15:01:22.920336  7289 slave.cpp:3601] Dropping status update 
> TASK_FAILED (UUID: b5dfa1dc-62db-4fb5-93c8-958d22f930df) for task 
> kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d of framework 
> 317ab6ce-d599-4ad4-bae2-eb74a6c42d87- sent by status update manager 
> because the agent is in RECOVERING state
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5632) Orphaned docker container not killed if executor has exited

2016-06-17 Thread Mansheng Yang (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mansheng Yang updated MESOS-5632:
-
Description: 
[This ticket|https://issues.apache.org/jira/browse/MESOS-3573] is marked as 
resolved but it was only partially fixed.

As mentioned in that ticket, if you start a docker container, kill the 
docker-executor process, then a new container will be started but the old one 
will still be there.

Some logs:
{noformat}
I0617 15:01:22.851604  7285 docker.cpp:877] Recovering container 
'71695f70-afad-421d-8636-deb6724ecaca' for executor 
'kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d' of framework 
'317ab6ce-d599-4ad4-bae2-eb74a6c42d87-'
I0617 15:01:22.853303  7285 docker.cpp:2107] Executor for container 
'71695f70-afad-421d-8636-deb6724ecaca' has exited
I0617 15:01:22.853327  7285 docker.cpp:1826] Destroying container 
'71695f70-afad-421d-8636-deb6724ecaca'
I0617 15:01:22.853575  7285 docker.cpp:1954] Running docker stop on container 
'71695f70-afad-421d-8636-deb6724ecaca'
I0617 15:01:22.853607  7285 docker.cpp:1956] Running docker stop on container 
'mesos-cbb3d52c-b6dd-4b7e-864d-705fc2fab983-S4.71695f70-afad-421d-8636-deb6724ecaca'0
I0617 15:01:22.854801  7283 slave.cpp:4767] Sending reconnect request to 
executor 'kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d' of framework 
317ab6ce-d599-4ad4-bae2-eb74a6c42d87- at executor(1)@127.0.1.1:56304
E0617 15:01:22.855870  7283 process.cpp:2040] Failed to shutdown socket with fd 
10: Transport endpoint is not connected
E0617 15:01:22.855974  7283 slave.cpp:4118] Termination of executor 
'kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d' of framework 
317ab6ce-d599-4ad4-bae2-eb74a6c42d87- failed: Unknown container: 
71695f70-afad-421d-8636-deb6724ecaca
I0617 15:01:22.857015  7283 slave.cpp:3257] Handling status update TASK_FAILED 
(UUID: b5dfa1dc-62db-4fb5-93c8-958d22f930df) for task 
kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d of framework 
317ab6ce-d599-4ad4-bae2-eb74a6c42d87- from @0.0.0.0:0
W0617 15:01:22.858330  7288 docker.cpp:1403] Ignoring updating unknown 
container: 71695f70-afad-421d-8636-deb6724ecaca
I0617 15:01:22.858819  7288 status_update_manager.cpp:320] Received status 
update TASK_FAILED (UUID: b5dfa1dc-62db-4fb5-93c8-958d22f930df) for task 
kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d of framework 
317ab6ce-d599-4ad4-bae2-eb74a6c42d87-
I0617 15:01:22.858986  7288 status_update_manager.cpp:824] Checkpointing UPDATE 
for status update TASK_FAILED (UUID: b5dfa1dc-62db-4fb5-93c8-958d22f930df) for 
task kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d of framework 
317ab6ce-d599-4ad4-bae2-eb74a6c42d87-
W0617 15:01:22.920336  7289 slave.cpp:3601] Dropping status update TASK_FAILED 
(UUID: b5dfa1dc-62db-4fb5-93c8-958d22f930df) for task 
kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d of framework 
317ab6ce-d599-4ad4-bae2-eb74a6c42d87- sent by status update manager because 
the agent is in RECOVERING state
{noformat}

> Orphaned docker container not killed if executor has exited
> ---
>
> Key: MESOS-5632
> URL: https://issues.apache.org/jira/browse/MESOS-5632
> Project: Mesos
>  Issue Type: Bug
>  Components: docker, slave
>Reporter: Mansheng Yang
>
> [This ticket|https://issues.apache.org/jira/browse/MESOS-3573] is marked as 
> resolved but it was only partially fixed.
> As mentioned in that ticket, if you start a docker container, kill the 
> docker-executor process, then a new container will be started but the old one 
> will still be there.
> Some logs:
> {noformat}
> I0617 15:01:22.851604  7285 docker.cpp:877] Recovering container 
> '71695f70-afad-421d-8636-deb6724ecaca' for executor 
> 'kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d' of framework 
> '317ab6ce-d599-4ad4-bae2-eb74a6c42d87-'
> I0617 15:01:22.853303  7285 docker.cpp:2107] Executor for container 
> '71695f70-afad-421d-8636-deb6724ecaca' has exited
> I0617 15:01:22.853327  7285 docker.cpp:1826] Destroying container 
> '71695f70-afad-421d-8636-deb6724ecaca'
> I0617 15:01:22.853575  7285 docker.cpp:1954] Running docker stop on container 
> '71695f70-afad-421d-8636-deb6724ecaca'
> I0617 15:01:22.853607  7285 docker.cpp:1956] Running docker stop on container 
> 'mesos-cbb3d52c-b6dd-4b7e-864d-705fc2fab983-S4.71695f70-afad-421d-8636-deb6724ecaca'0
> I0617 15:01:22.854801  7283 slave.cpp:4767] Sending reconnect request to 
> executor 'kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d' of framework 
> 317ab6ce-d599-4ad4-bae2-eb74a6c42d87- at executor(1)@127.0.1.1:56304
> E0617 15:01:22.855870  7283 process.cpp:2040] Failed to shutdown socket with 
> fd 10: Transport endpoint is not connected
> E0617 15:01:22.855974  7283 slave.cpp:4118] Termination of executor 
> 'kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d' of framework 
> 

[jira] [Created] (MESOS-5632) Orphaned docker container not killed if executor has exited

2016-06-17 Thread Mansheng Yang (JIRA)
Mansheng Yang created MESOS-5632:


 Summary: Orphaned docker container not killed if executor has 
exited
 Key: MESOS-5632
 URL: https://issues.apache.org/jira/browse/MESOS-5632
 Project: Mesos
  Issue Type: Bug
  Components: docker, slave
Reporter: Mansheng Yang






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5631) Implement clang-tidy check for incorrect use of capturing lambdas with Futures

2016-06-17 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier updated MESOS-5631:

Description: 
When one enqueues capturing lambdas to a {{Future}} with {{then}} or the 
{{onXXX}} variations, in general any actor might execute that callback (no 
constraints imposed per se).

This can lead to hard to understand dependencies or bugs if the lambda needs to 
access external state (i.e. anything it captures by references/pointer to 
instead of by value); instead such callbacks should always be constraint to a 
specific actor with {{dispatch}}/{{defer}} to ensure the pointed to data isn't 
modified in a concurrent thread.

  was:
When one enqueues capturing lambdas to a {{Future}} with {{then}} or then 
{{onXXX}} variations, in general any actor might execute that callback (no 
constraints imposed per se).

This can lead to hard to understand dependencies or bugs if the lambda needs to 
access external state (i.e. anything it captures by references/pointer to 
instead of by value); instead such callbacks should always be constraint to a 
specific actor with {{dispatch}}/{{defer}} to ensure the pointed to data isn't 
modified in a concurrent thread.


> Implement clang-tidy check for incorrect use of capturing lambdas with Futures
> --
>
> Key: MESOS-5631
> URL: https://issues.apache.org/jira/browse/MESOS-5631
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Bannier
>
> When one enqueues capturing lambdas to a {{Future}} with {{then}} or the 
> {{onXXX}} variations, in general any actor might execute that callback (no 
> constraints imposed per se).
> This can lead to hard to understand dependencies or bugs if the lambda needs 
> to access external state (i.e. anything it captures by references/pointer to 
> instead of by value); instead such callbacks should always be constraint to a 
> specific actor with {{dispatch}}/{{defer}} to ensure the pointed to data 
> isn't modified in a concurrent thread.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5631) Implement clang-tidy check for incorrect use of capturing lambdas with Futures

2016-06-17 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-5631:
---

 Summary: Implement clang-tidy check for incorrect use of capturing 
lambdas with Futures
 Key: MESOS-5631
 URL: https://issues.apache.org/jira/browse/MESOS-5631
 Project: Mesos
  Issue Type: Improvement
Reporter: Benjamin Bannier


When one enqueues capturing lambdas to a {{Future}} with {{then}} or then 
{{onXXX}} variations, in general any actor might execute that callback (no 
constraints imposed per se).

This can lead to hard to understand dependencies or bugs if the lambda needs to 
access external state (i.e. anything it captures by references/pointer to 
instead of by value); instead such callbacks should always be constraint to a 
specific actor with {{dispatch}}/{{defer}} to ensure the pointed to data isn't 
modified in a concurrent thread.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-4967) Oversubscription for reservation

2016-06-17 Thread Klaus Ma (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Klaus Ma reassigned MESOS-4967:
---

Assignee: Klaus Ma

> Oversubscription for reservation
> 
>
> Key: MESOS-4967
> URL: https://issues.apache.org/jira/browse/MESOS-4967
> Project: Mesos
>  Issue Type: Epic
>  Components: allocation, framework, master
>Reporter: Klaus Ma
>Assignee: Klaus Ma
>  Labels: IBM, mesosphere
>
> Reserved resources allow frameworks and cluster operators to ensure 
> sufficient resources are available when needed.  Reservations are usually 
> made to guarantee there are enough resources under peak loads. Often times, 
> reserved resources are not actually allocated; in other words, the frameworks 
> do not use those resources and they sit reserved, but idle.
> This underutilization is either an opportunity cost or a direct cost, 
> particularly to the cluster operator.  Reserved but unallocated resources 
> held by a Lender Framework could be optimistically offered to other 
> frameworks, which we refer to as Tenant Frameworks.  When the resources are 
> requested back by the Lender Framework, some of the Tenant Framework’s tasks 
> are evicted and the original resource offer guarantee is preserved.
> The first step is to identify when resources are reserved, but not allocated. 
>  We then offer these reserved resources to other frameworks, but mark these 
> offered resources as revocable resources.  This allows Tenant Frameworks to 
> use these resources temporarily in a 'best-effort' fashion, knowing that they 
> could be revoked or reclaimed at any time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5491) Implement GET_AGENTS Call in v1 master API.

2016-06-17 Thread zhou xing (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335541#comment-15335541
 ] 

zhou xing commented on MESOS-5491:
--

two RRs: https://reviews.apache.org/r/48841/ & 
https://reviews.apache.org/r/48438/

> Implement GET_AGENTS Call in v1 master API.
> ---
>
> Key: MESOS-5491
> URL: https://issues.apache.org/jira/browse/MESOS-5491
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>Assignee: zhou xing
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-5588) Improve error handling when parsing acls.

2016-06-17 Thread Joerg Schad (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335528#comment-15335528
 ] 

Joerg Schad edited comment on MESOS-5588 at 6/17/16 6:37 AM:
-

1) Yes that happens to every protobuf conversion (see my early comments on this 
ticket about changing the parsing). But in this case it yields a security 
critical issue. IMO this is part of the scope of this ticket (Improve error 
handling when parsing acls, but if you want to split the 1.0 blocker relevant 
part into an extra ticket that seems fine with me.

I agree that the second part is not a blocker (as it does not involve an API 
change), but I would not say that it is a low priority wish


was (Author: js84):
I agree that the second part is not a blocker (as it does not involve an API 
change), but I would not say that it is a low priority wish

> Improve error handling when parsing acls.
> -
>
> Key: MESOS-5588
> URL: https://issues.apache.org/jira/browse/MESOS-5588
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Joerg Schad
>Assignee: Joerg Schad
>Priority: Blocker
>  Labels: mesosphere, security
> Fix For: 1.0.0
>
>
> During parsing of the authorizer errors are ignored. This can lead to 
> undetected security issues.
> Consider the following acl with an typo (usr instead of user)
> {code}
>"view_frameworks": [
>   {
> "principals": { "type": "ANY" },
> "usr": { "type": "NONE" }
>   }
> ]
> {code}
> When the master is started with these flags it will interprete the acl int he 
> following way which gives any principal access to any framework.
> {noformat}
> view_frameworks {
>   principals {
> type: ANY
>   }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5588) Improve error handling when parsing acls.

2016-06-17 Thread Joerg Schad (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335528#comment-15335528
 ] 

Joerg Schad commented on MESOS-5588:


I agree that the second part is not a blocker (as it does not involve an API 
change), but I would not say that it is a low priority wish

> Improve error handling when parsing acls.
> -
>
> Key: MESOS-5588
> URL: https://issues.apache.org/jira/browse/MESOS-5588
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Joerg Schad
>Assignee: Joerg Schad
>Priority: Blocker
>  Labels: mesosphere, security
> Fix For: 1.0.0
>
>
> During parsing of the authorizer errors are ignored. This can lead to 
> undetected security issues.
> Consider the following acl with an typo (usr instead of user)
> {code}
>"view_frameworks": [
>   {
> "principals": { "type": "ANY" },
> "usr": { "type": "NONE" }
>   }
> ]
> {code}
> When the master is started with these flags it will interprete the acl int he 
> following way which gives any principal access to any framework.
> {noformat}
> view_frameworks {
>   principals {
> type: ANY
>   }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5630) Change build to always enable Nvidia GPU support for Linux

2016-06-17 Thread Kevin Klues (JIRA)
Kevin Klues created MESOS-5630:
--

 Summary: Change build to always enable Nvidia GPU support for Linux
 Key: MESOS-5630
 URL: https://issues.apache.org/jira/browse/MESOS-5630
 Project: Mesos
  Issue Type: Improvement
 Environment: Build / run unit tests in three build environments:
{noformat}
1) CentOS 7 on GPU capable machine
2) CentOS 7 on NON-GPU capable machine
3) OSX

$ rm -rf build; ./bootstrap; mkdir build; cd build; ../configure; make -j 
check; sudo GTEST_FILTER="*NVIDIA*" src/mesos-tests
{noformat}
Test support/build_docker.sh (to make sure we won't crash Apache's CI):
{noformat}
$ ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1' CONFIGURATION="--enable-libevent 
--enable-ssl" COMPILER=gcc BUILDTOOL=autotools OS=centos:7 
support/docker_build.sh

$ ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1' CONFIGURATION="--enable-libevent 
--enable-ssl" COMPILER=gcc BUILDTOOL=autotools OS=ubuntu:14.04 
support/docker_build.sh
{noformat}
Reporter: Kevin Klues
Assignee: Kevin Klues
 Fix For: 1.0.0


See Summary



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5588) Improve error handling when parsing acls.

2016-06-17 Thread Alexander Rojas (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335506#comment-15335506
 ] 

Alexander Rojas commented on MESOS-5588:


# What you describe is not an ACLs problem but it affects every protobuf/json 
conversion in Mesos, so probably we should open another Jira entry for that.
# I do not think the behavior you describe is a blocker, since it doesn't 
represent a regression nor a change in the API, the patch provided deals with 
the blocked part. But what you suggest sounds more like a low priority whish.

> Improve error handling when parsing acls.
> -
>
> Key: MESOS-5588
> URL: https://issues.apache.org/jira/browse/MESOS-5588
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Joerg Schad
>Assignee: Joerg Schad
>Priority: Blocker
>  Labels: mesosphere, security
> Fix For: 1.0.0
>
>
> During parsing of the authorizer errors are ignored. This can lead to 
> undetected security issues.
> Consider the following acl with an typo (usr instead of user)
> {code}
>"view_frameworks": [
>   {
> "principals": { "type": "ANY" },
> "usr": { "type": "NONE" }
>   }
> ]
> {code}
> When the master is started with these flags it will interprete the acl int he 
> following way which gives any principal access to any framework.
> {noformat}
> view_frameworks {
>   principals {
> type: ANY
>   }
> }
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)