[jira] [Commented] (MESOS-5188) docker executor thinks task is failed when docker container was stopped
[ https://issues.apache.org/jira/browse/MESOS-5188?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337520#comment-15337520 ] haosdent commented on MESOS-5188: - Looks like not an issue of 1.0.0, let me remove the fix version. [~liqlin] > docker executor thinks task is failed when docker container was stopped > --- > > Key: MESOS-5188 > URL: https://issues.apache.org/jira/browse/MESOS-5188 > Project: Mesos > Issue Type: Bug > Components: docker >Affects Versions: 0.28.0 >Reporter: Liqiang Lin > > Test cases: > 1. Launch a task with Swarm (on Mesos). > {code} > # docker -H 192.168.56.110:54375 run -d --cpu-shares 1 ubuntu sleep 300 > {code} > 2. Then stop the docker container. > {code} > # docker -H 192.168.56.110:54375 ps > CONTAINER IDIMAGE COMMAND CREATED > STATUS PORTS NAMES > b4813ba3ed4dubuntu "sleep 300" 9 seconds ago > Up 8 seconds > mesos1/mesos-2cd5576e-6260-4262-a62c-b0dc45c86c45-S1.1595e79b-aef2-44b6-a313-ad4ff8626958 > # docker -H 192.168.56.110:54375 stop b4813ba3ed4d > b4813ba3ed4d > {code} > 3. Found the task is failed. See Mesos slave log, > {code} > I0407 09:10:57.606552 32307 slave.cpp:1508] Got assigned task 99ee7dc74861 > for framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- > I0407 09:10:57.608230 32307 slave.cpp:1627] Launching task 99ee7dc74861 for > framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- > I0407 09:10:57.609979 32307 paths.cpp:528] Trying to chown > '/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-/executors/99ee7dc74861/runs/250a169f-7aba-474d-a4f5-cd24ecf0e7d9' > to user 'root' > I0407 09:10:57.615881 32307 slave.cpp:5586] Launching executor 99ee7dc74861 > of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- with resources > cpus(*):0.1; mem(*):32 in work directory > '/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-/executors/99ee7dc74861/runs/250a169f-7aba-474d-a4f5-cd24ecf0e7d9' > I0407 09:12:18.458449 32307 slave.cpp:1845] Queuing task '99ee7dc74861' for > executor '99ee7dc74861' of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- > I0407 09:12:18.459092 32307 slave.cpp:3711] No pings from master received > within 75secs > I0407 09:12:18.460212 32307 slave.cpp:4593] Current disk usage 56.53%. Max > allowed age: 2.342613645432778days > I0407 09:12:18.463484 32307 slave.cpp:928] Re-detecting master > I0407 09:12:18.463969 32307 slave.cpp:975] Detecting new master > I0407 09:12:18.464501 32307 slave.cpp:939] New master detected at > master@192.168.56.110:5050 > I0407 09:12:18.464848 32307 slave.cpp:964] No credentials provided. > Attempting to register without authentication > I0407 09:12:18.465237 32307 slave.cpp:975] Detecting new master > I0407 09:12:18.463611 32312 status_update_manager.cpp:174] Pausing sending > status updates > I0407 09:12:18.465744 32312 status_update_manager.cpp:174] Pausing sending > status updates > I0407 09:12:18.472323 32313 docker.cpp:1011] Starting container > '250a169f-7aba-474d-a4f5-cd24ecf0e7d9' for task '99ee7dc74861' (and executor > '99ee7dc74861') of framework '5b84aad8-dd60-40b3-84c2-93be6b7aa81c-' > I0407 09:12:18.588739 32313 slave.cpp:1218] Re-registered with master > master@192.168.56.110:5050 > I0407 09:12:18.588927 32313 slave.cpp:1254] Forwarding total oversubscribed > resources > I0407 09:12:18.589320 32313 slave.cpp:2395] Updating framework > 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- pid to > scheduler(1)@192.168.56.110:53375 > I0407 09:12:18.592079 32308 status_update_manager.cpp:181] Resuming sending > status updates > I0407 09:12:18.592842 32313 slave.cpp:2534] Updated checkpointed resources > from to > I0407 09:12:18.592793 32308 status_update_manager.cpp:181] Resuming sending > status updates > I0407 09:12:20.582041 32307 slave.cpp:2836] Got registration for executor > '99ee7dc74861' of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- from > executor(1)@192.168.56.110:40725 > I0407 09:12:20.584446 32307 docker.cpp:1308] Ignoring updating container > '250a169f-7aba-474d-a4f5-cd24ecf0e7d9' with resources passed to update is > identical to existing resources > I0407 09:12:20.585093 32307 slave.cpp:2010] Sending queued task > '99ee7dc74861' to executor '99ee7dc74861' of framework > 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- at executor(1)@192.168.56.110:40725 > I0407 09:12:21.307077 32312 slave.cpp:3195] Handling status update > TASK_RUNNING (UUID: a7098650-cbf6-4445-8216-b5f658d2f5f4) for task > 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- from >
[jira] [Updated] (MESOS-5188) docker executor thinks task is failed when docker container was stopped
[ https://issues.apache.org/jira/browse/MESOS-5188?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] haosdent updated MESOS-5188: Fix Version/s: (was: 1.0.0) > docker executor thinks task is failed when docker container was stopped > --- > > Key: MESOS-5188 > URL: https://issues.apache.org/jira/browse/MESOS-5188 > Project: Mesos > Issue Type: Bug > Components: docker >Affects Versions: 0.28.0 >Reporter: Liqiang Lin > > Test cases: > 1. Launch a task with Swarm (on Mesos). > {code} > # docker -H 192.168.56.110:54375 run -d --cpu-shares 1 ubuntu sleep 300 > {code} > 2. Then stop the docker container. > {code} > # docker -H 192.168.56.110:54375 ps > CONTAINER IDIMAGE COMMAND CREATED > STATUS PORTS NAMES > b4813ba3ed4dubuntu "sleep 300" 9 seconds ago > Up 8 seconds > mesos1/mesos-2cd5576e-6260-4262-a62c-b0dc45c86c45-S1.1595e79b-aef2-44b6-a313-ad4ff8626958 > # docker -H 192.168.56.110:54375 stop b4813ba3ed4d > b4813ba3ed4d > {code} > 3. Found the task is failed. See Mesos slave log, > {code} > I0407 09:10:57.606552 32307 slave.cpp:1508] Got assigned task 99ee7dc74861 > for framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- > I0407 09:10:57.608230 32307 slave.cpp:1627] Launching task 99ee7dc74861 for > framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- > I0407 09:10:57.609979 32307 paths.cpp:528] Trying to chown > '/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-/executors/99ee7dc74861/runs/250a169f-7aba-474d-a4f5-cd24ecf0e7d9' > to user 'root' > I0407 09:10:57.615881 32307 slave.cpp:5586] Launching executor 99ee7dc74861 > of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- with resources > cpus(*):0.1; mem(*):32 in work directory > '/var/lib/mesos/slaves/2cd5576e-6260-4262-a62c-b0dc45c86c45-S0/frameworks/5b84aad8-dd60-40b3-84c2-93be6b7aa81c-/executors/99ee7dc74861/runs/250a169f-7aba-474d-a4f5-cd24ecf0e7d9' > I0407 09:12:18.458449 32307 slave.cpp:1845] Queuing task '99ee7dc74861' for > executor '99ee7dc74861' of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- > I0407 09:12:18.459092 32307 slave.cpp:3711] No pings from master received > within 75secs > I0407 09:12:18.460212 32307 slave.cpp:4593] Current disk usage 56.53%. Max > allowed age: 2.342613645432778days > I0407 09:12:18.463484 32307 slave.cpp:928] Re-detecting master > I0407 09:12:18.463969 32307 slave.cpp:975] Detecting new master > I0407 09:12:18.464501 32307 slave.cpp:939] New master detected at > master@192.168.56.110:5050 > I0407 09:12:18.464848 32307 slave.cpp:964] No credentials provided. > Attempting to register without authentication > I0407 09:12:18.465237 32307 slave.cpp:975] Detecting new master > I0407 09:12:18.463611 32312 status_update_manager.cpp:174] Pausing sending > status updates > I0407 09:12:18.465744 32312 status_update_manager.cpp:174] Pausing sending > status updates > I0407 09:12:18.472323 32313 docker.cpp:1011] Starting container > '250a169f-7aba-474d-a4f5-cd24ecf0e7d9' for task '99ee7dc74861' (and executor > '99ee7dc74861') of framework '5b84aad8-dd60-40b3-84c2-93be6b7aa81c-' > I0407 09:12:18.588739 32313 slave.cpp:1218] Re-registered with master > master@192.168.56.110:5050 > I0407 09:12:18.588927 32313 slave.cpp:1254] Forwarding total oversubscribed > resources > I0407 09:12:18.589320 32313 slave.cpp:2395] Updating framework > 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- pid to > scheduler(1)@192.168.56.110:53375 > I0407 09:12:18.592079 32308 status_update_manager.cpp:181] Resuming sending > status updates > I0407 09:12:18.592842 32313 slave.cpp:2534] Updated checkpointed resources > from to > I0407 09:12:18.592793 32308 status_update_manager.cpp:181] Resuming sending > status updates > I0407 09:12:20.582041 32307 slave.cpp:2836] Got registration for executor > '99ee7dc74861' of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- from > executor(1)@192.168.56.110:40725 > I0407 09:12:20.584446 32307 docker.cpp:1308] Ignoring updating container > '250a169f-7aba-474d-a4f5-cd24ecf0e7d9' with resources passed to update is > identical to existing resources > I0407 09:12:20.585093 32307 slave.cpp:2010] Sending queued task > '99ee7dc74861' to executor '99ee7dc74861' of framework > 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- at executor(1)@192.168.56.110:40725 > I0407 09:12:21.307077 32312 slave.cpp:3195] Handling status update > TASK_RUNNING (UUID: a7098650-cbf6-4445-8216-b5f658d2f5f4) for task > 99ee7dc74861 of framework 5b84aad8-dd60-40b3-84c2-93be6b7aa81c- from > executor(1)@192.168.56.110:40725 > I0407 09:12:21.308820 32308 status_update_manager.cpp:320] Received status >
[jira] [Updated] (MESOS-5641) Update docker-volume.md to add some content for how to test
[ https://issues.apache.org/jira/browse/MESOS-5641?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-5641: -- Issue Type: Task (was: Bug) > Update docker-volume.md to add some content for how to test > --- > > Key: MESOS-5641 > URL: https://issues.apache.org/jira/browse/MESOS-5641 > Project: Mesos > Issue Type: Task >Reporter: Guangya Liu >Assignee: Guangya Liu > Fix For: 1.0.0 > > > The mesos-execute was fixed in MESOS-5265 , the document should be updated to > reflect how to use mesos-execute to test the feature of docker volume > isolator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5641) Update docker-volume.md to add some content for how to test
Guangya Liu created MESOS-5641: -- Summary: Update docker-volume.md to add some content for how to test Key: MESOS-5641 URL: https://issues.apache.org/jira/browse/MESOS-5641 Project: Mesos Issue Type: Bug Reporter: Guangya Liu Assignee: Guangya Liu The mesos-execute was fixed in MESOS-5265 , the document should be updated to reflect how to use mesos-execute to test the feature of docker volume isolator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5637) Authorized endpoint results are inconsistent for failures.
[ https://issues.apache.org/jira/browse/MESOS-5637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337419#comment-15337419 ] Till Toenshoff commented on MESOS-5637: --- We need to decide on... - the HTTP status code we actually want to show our users - if we want to display the future error message in the HTTP body ... for being able to unify this. Furthermore we might want to introduce tests that prevent regressions introducing such inconsistencies in the future. > Authorized endpoint results are inconsistent for failures. > -- > > Key: MESOS-5637 > URL: https://issues.apache.org/jira/browse/MESOS-5637 > Project: Mesos > Issue Type: Bug > Components: master, modules >Affects Versions: 1.0.0 >Reporter: Till Toenshoff > Labels: authorization, mesosphere, security > > When trying to access authorized endpoints, the resulting HTTP status codes > are not consistent for internal authorizer failures (failed future returned > by {{authorized}}). > {{/flags}}: > {noformat} > HTTP/1.1 503 Service Unavailable > Date: Fri, 17 Jun 2016 23:11:04 GMT > Content-Length: 0 > {noformat} > {{/state}}: > {noformat} > HTTP/1.1 500 Internal Server Error > Date: Fri, 17 Jun 2016 23:08:49 GMT > Content-Type: text/plain; charset=utf-8 > Content-Length: size($FUTURE_ERROR_MESSAGE) > $FUTURE_ERROR_MESSAGE > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5576) Masters may drop the first message they send between masters after a network partition
[ https://issues.apache.org/jira/browse/MESOS-5576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph Wu updated MESOS-5576: - Issue Type: Improvement (was: Bug) Changing type from {{Bug}} to {{Improvement}} because the masters will still recover *eventually* in this case. Bad sockets are cleaned out when the masters abort due to {{--registry_fetch_timeout}}. > Masters may drop the first message they send between masters after a network > partition > -- > > Key: MESOS-5576 > URL: https://issues.apache.org/jira/browse/MESOS-5576 > Project: Mesos > Issue Type: Improvement > Components: leader election, master, replicated log >Affects Versions: 0.28.2 > Environment: Observed in an OpenStack environment where each master > lives on a separate VM. >Reporter: Joseph Wu >Assignee: Joseph Wu > Labels: mesosphere > > We observed the following situation in a cluster of five masters: > || Time || Master 1 || Master 2 || Master 3 || Master 4 || Master 5 || > | 0 | Follower | Follower | Follower | Follower | Leader | > | 1 | Follower | Follower | Follower | Follower || Partitioned from cluster > by downing this VM's network || > | 2 || Elected Leader by ZK | Voting | Voting | Voting | Suicides due to lost > leadership | > | 3 | Performs consensus | Replies to leader | Replies to leader | Replies to > leader | Still down | > | 4 | Performs writing | Acks to leader | Acks to leader | Acks to leader | > Still down | > | 5 | Leader | Follower | Follower | Follower | Still down | > | 6 | Leader | Follower | Follower | Follower | Comes back up | > | 7 | Leader | Follower | Follower | Follower | Follower | > | 8 || Partitioned in the same way as Master 5 | Follower | Follower | > Follower | Follower | > | 9 | Suicides due to lost leadership || Elected Leader by ZK | Follower | > Follower | Follower | > | 10 | Still down | Performs consensus | Replies to leader | Replies to > leader || Doesn't get the message! || > | 11 | Still down | Performs writing | Acks to leader | Acks to leader || > Acks to leader || > | 12 | Still down | Leader | Follower | Follower | Follower | > Master 2 sends a series of messages to the recently-restarted Master 5. The > first message is dropped, but subsequent messages are not dropped. > This appears to be due to a stale link between the masters. Before leader > election, the replicated log actors create a network watcher, which adds > links to masters that join the ZK group: > https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/network.hpp#L157-L159 > This link does not appear to break (Master 2 -> 5) when Master 5 goes down, > perhaps due to how the network partition was induced (in the hypervisor > layer, rather than in the VM itself). > When Master 2 tries to send an {{PromiseRequest}} to Master 5, we do not > observe the [expected log > message|https://github.com/apache/mesos/blob/7a23d0da817be4e8f68d96f524cecf802431033c/src/log/replica.cpp#L493-L494] > Instead, we see a log line in Master 2: > {code} > process.cpp:2040] Failed to shutdown socket with fd 27: Transport endpoint is > not connected > {code} > The broken link is removed by the libprocess {{socket_manager}} and the > following {{WriteRequest}} from Master 2 to Master 5 succeeds via a new > socket. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5640) Unify the help info for master/agent flags
Guangya Liu created MESOS-5640: -- Summary: Unify the help info for master/agent flags Key: MESOS-5640 URL: https://issues.apache.org/jira/browse/MESOS-5640 Project: Mesos Issue Type: Bug Reporter: Guangya Liu Priority: Minor Currently, in master/flags.cpp, some flags end up with a "\n" while some not, this caused the output not consistent. {code} --[no-]hostname_lookup Whether we should execute a lookup to find out the server's hostname, if not explicitly set (via, e.g., `--hostname`). True by default; if set to `false` it will cause Mesos to use the IP address, unless the hostname is explicitly set. (default: true) --http_authenticators=VALUE HTTP authenticator implementation to use when handling requests to authenticated endpoints. Use the default `basic`, or load an alternate HTTP authenticator module using `--modules`. Currently there is no support for multiple HTTP authenticators. (default: basic) --http_framework_authenticators=VALUE HTTP authenticator implementation to use when authenticating HTTP frameworks. Use the `basic` authenticator or load an alternate authenticator module using `--modules`. Must be used in conjunction with `--http_authenticate_frameworks`. {code} I think we should follow the linux "man command" format by adding "\n" to all flags. The following is a sample output for "man ls". {code} -@ Display extended attribute keys and sizes in long (-l) output. -1 (The numeric digit ``one''.) Force output to be one entry per line. This is the default when output is not to a terminal. -A List all entries except for . and ... Always set for the super-user. -a Include directory entries whose names begin with a dot (.). -B Force printing of non-printable characters (as defined by ctype(3) and current locale settings) in file names as \xxx, where xxx is the numeric value of the character in octal. -b As -B, but use C escape codes whenever possible. {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5639) Add documentation about metadata for CNI plugins.
Jie Yu created MESOS-5639: - Summary: Add documentation about metadata for CNI plugins. Key: MESOS-5639 URL: https://issues.apache.org/jira/browse/MESOS-5639 Project: Mesos Issue Type: Task Reporter: Jie Yu Assignee: Jie Yu We need to document the behavior implemented in MESOS-5592. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5635) Agent repeatedly reregisters, possible one-way disconnection
[ https://issues.apache.org/jira/browse/MESOS-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-5635: - Description: This issue was observed recently on an internal test cluster. Due to a bug in the agent code (MESOS-5629), regular segfaults were occurring on an agent. After one such failure, the agent recovered and about a minute later the following was observed in the master logs: {code} I0617 22:23:41.663557 2014 master.cpp:4795] Re-registering agent 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051 (10.10.0.179) {code} However, we see nothing about registration in the agent logs at this time. Subsequently, in the master logs, we see the agent continuing to reregister every couple seconds: {code} I0617 22:23:43.528590 2014 master.cpp:4795] Re-registering agent 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051 (10.10.0.179) {code} After about four minutes of this, we see: {code} I0617 22:27:43.994493 2014 master.cpp:6750] Removed agent 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 (10.10.0.179): health check timed out {code} And after this point, we see repeated reregistration attempts from that agent in the master logs: {code} W0617 22:29:09.514423 2010 master.cpp:4773] Agent 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051 (10.10.0.179) attempted to re-register after removal; {code} During all of this, however, the agent logs indicate nothing about registration. All we see are requests coming in to {{/state}}: {code} Jun 17 22:26:37 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:37.870980 873 http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.181:38792 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10 Jun 17 22:26:38 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:38.158476 879 http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009 Jun 17 22:26:38 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:38.884507 873 http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009 Jun 17 22:26:39 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:39.604486 876 http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009 Jun 17 22:26:40 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:40.018326 875 http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.181:38803 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10 Jun 17 22:26:40 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:40.329465 873 http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009 {code} The lack of logging on the agent side, and the health check timeout, suggests a one-way disconnection such that the master cannot send messages to the agent, but the agent can send messages to the master. This behavior has been observed several times on this test cluster in the past couple days. Full master and agent logs from the relevant time period have been attached. was: This issue was observed recently on an internal test cluster. Due to a bug in the agent code (MESOS-5629), regular segfaults were occurring on an agent. After one such failure, the agent recovered and about a minute later the following was observed in the master logs: {code} I0617 22:23:41.663557 2014 master.cpp:4795] Re-registering agent 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051 (10.10.0.179) {code} However, we see nothing about registration in the agent logs at this time. Subsequently, in the master logs, we see the agent continuing to reregister every couple seconds: {code} I0617 22:23:43.528590 2014 master.cpp:4795] Re-registering agent 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051 (10.10.0.179) {code} After about four minutes of this, we see: {code} I0617 22:27:43.994493 2014 master.cpp:6750] Removed agent 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 (10.10.0.179): health check timed out {code} And after this point, we see repeated reregistration attempts from that agent in the master logs: {code} W0617 22:29:09.514423 2010 master.cpp:4773] Agent 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051 (10.10.0.179) attempted to re-register after removal; {code} During all of this, however, the agent logs indicate nothing about registration. All we see are requests coming in to {{/state}}: {code} Jun 17 22:26:37 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:37.870980 873 http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.181:38792 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10 Jun 17 22:26:38 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:38.158476 879 http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009 Jun 17 22:26:38 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:38.884507 873 http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009 Jun 17 22:26:39 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:39.604486 876 http.cpp:192] HTTP GET for
[jira] [Updated] (MESOS-5635) Agent repeatedly reregisters, possible one-way disconnection
[ https://issues.apache.org/jira/browse/MESOS-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-5635: - Attachment: master-log.txt agent-log.txt > Agent repeatedly reregisters, possible one-way disconnection > > > Key: MESOS-5635 > URL: https://issues.apache.org/jira/browse/MESOS-5635 > Project: Mesos > Issue Type: Bug >Reporter: Greg Mann > Labels: agent, mesosphere > Attachments: agent-log.txt, master-log.txt > > > This issue was observed recently on an internal test cluster. Due to a bug in > the agent code (MESOS-5629), regular segfaults were occurring on an agent. > After one such failure, the agent recovered and about a minute later the > following was observed in the master logs: > {code} > I0617 22:23:41.663557 2014 master.cpp:4795] Re-registering agent > 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051 > (10.10.0.179) > {code} > However, we see nothing about registration in the agent logs at this time. > Subsequently, in the master logs, we see the agent continuing to reregister > every couple seconds: > {code} > I0617 22:23:43.528590 2014 master.cpp:4795] Re-registering agent > 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051 > (10.10.0.179) > {code} > After about four minutes of this, we see: > {code} > I0617 22:27:43.994493 2014 master.cpp:6750] Removed agent > 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 (10.10.0.179): health check timed out > {code} > And after this point, we see repeated reregistration attempts from that agent > in the master logs: > {code} > W0617 22:29:09.514423 2010 master.cpp:4773] Agent > 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051 > (10.10.0.179) attempted to re-register after removal; > {code} > During all of this, however, the agent logs indicate nothing about > registration. All we see are requests coming in to {{/state}}: > {code} > Jun 17 22:26:37 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:37.870980 873 > http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.181:38792 with > User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10 > Jun 17 22:26:38 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:38.158476 879 > http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009 > Jun 17 22:26:38 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:38.884507 873 > http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009 > Jun 17 22:26:39 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:39.604486 876 > http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009 > Jun 17 22:26:40 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:40.018326 875 > http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.181:38803 with > User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10 > Jun 17 22:26:40 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:40.329465 873 > http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009 > {code} > The lack of logging on the agent side, and the health check timeout, suggests > a one-way disconnection such that the master cannot send messages to the > agent, but the agent can send messages to the master. This behavior has been > observed several times on this test cluster in the past couple days. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5638) Check all omissions of 'defer' for safety
Greg Mann created MESOS-5638: Summary: Check all omissions of 'defer' for safety Key: MESOS-5638 URL: https://issues.apache.org/jira/browse/MESOS-5638 Project: Mesos Issue Type: Bug Reporter: Greg Mann When registering callbacks with {{.then}}, {{.onAny}}, etc., we sometimes omit {{defer()}} in cases where the callback is deemed threadsafe when run synchronously at an arbitrary callsite. Because of recent bugs due to the unsafe omission of {{defer()}}, we should do a sweep of the codebase for all such occurrences and evaluate their safety. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5635) Agent repeatedly reregisters, possible one-way disconnection
[ https://issues.apache.org/jira/browse/MESOS-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-5635: - Description: This issue was observed recently on an internal test cluster. Due to a bug in the agent code (MESOS-5629), regular segfaults were occurring on an agent. After one such failure, the agent recovered and about a minute later the following was observed in the master logs: {code} I0617 22:23:41.663557 2014 master.cpp:4795] Re-registering agent 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051 (10.10.0.179) {code} However, we see nothing about registration in the agent logs at this time. Subsequently, in the master logs, we see the agent continuing to reregister every couple seconds: {code} I0617 22:23:43.528590 2014 master.cpp:4795] Re-registering agent 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051 (10.10.0.179) {code} After about four minutes of this, we see: {code} I0617 22:27:43.994493 2014 master.cpp:6750] Removed agent 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 (10.10.0.179): health check timed out {code} And after this point, we see repeated reregistration attempts from that agent in the master logs: {code} W0617 22:29:09.514423 2010 master.cpp:4773] Agent 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051 (10.10.0.179) attempted to re-register after removal; {code} During all of this, however, the agent logs indicate nothing about registration. All we see are requests coming in to {{/state}}: {code} Jun 17 22:26:37 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:37.870980 873 http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.181:38792 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10 Jun 17 22:26:38 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:38.158476 879 http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009 Jun 17 22:26:38 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:38.884507 873 http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009 Jun 17 22:26:39 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:39.604486 876 http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009 Jun 17 22:26:40 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:40.018326 875 http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.181:38803 with User-Agent='Mozilla/5.0 (Macintosh; Intel Mac OS X 10.10 Jun 17 22:26:40 ip-10-10-0-179 mesos-slave[868]: I0617 22:26:40.329465 873 http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.179:41009 {code} The lack of logging on the agent side, and the health check timeout, suggests a one-way disconnection such that the master cannot send messages to the agent, but the agent can send messages to the master. This behavior has been observed several times on this test cluster in the past couple days. was: This issue was observed recently on an internal test cluster. Due to a bug in the agent code (MESOS-5629), regular segfaults were occurring on an agent. While the agent was recovering from one of these failures, it segfaulted again. After this time, we noticed that after beginning recovery, the agent did not print {{Finished recovery}}, and its logs did not show any indication of reregistering with the master. Looking at the master's logs, however, the following line was observed repeatedly, at intervals on the order of seconds: {code} W0617 21:27:07.010679 2016 master.cpp:4773] Agent 2b899dd3-3b1f-4520-a6b2-98e32196f723-S4 at slave(1)@10.10.0.87:5051 (10.10.0.87) attempted to re-register after removal; shutting it down {code} These re-registration attempts had no corresponding lines in the agent log. Subsequently deleting the contents of the agent's {{work_dir}} and restarting it led to a successful registration with a new agent ID: {code} I0617 21:29:01.246119 2011 master.cpp:4635] Registering agent at slave(1)@10.10.0.87:5051 (10.10.0.87) with id 2b899dd3-3b1f-4520-a6b2-98e32196f723-S5 {code} > Agent repeatedly reregisters, possible one-way disconnection > > > Key: MESOS-5635 > URL: https://issues.apache.org/jira/browse/MESOS-5635 > Project: Mesos > Issue Type: Bug >Reporter: Greg Mann > Labels: agent, mesosphere > > This issue was observed recently on an internal test cluster. Due to a bug in > the agent code (MESOS-5629), regular segfaults were occurring on an agent. > After one such failure, the agent recovered and about a minute later the > following was observed in the master logs: > {code} > I0617 22:23:41.663557 2014 master.cpp:4795] Re-registering agent > 6d4248cd-2832-4152-b5d0-defbf36f6759-S3 at slave(1)@10.10.0.179:5051 > (10.10.0.179) > {code} > However, we see nothing about registration in the agent logs at this time. > Subsequently, in the master logs, we see the agent
[jira] [Updated] (MESOS-5637) Authorized endpoint results are inconsistent for failures.
[ https://issues.apache.org/jira/browse/MESOS-5637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Toenshoff updated MESOS-5637: -- Affects Version/s: 1.0.0 > Authorized endpoint results are inconsistent for failures. > -- > > Key: MESOS-5637 > URL: https://issues.apache.org/jira/browse/MESOS-5637 > Project: Mesos > Issue Type: Bug > Components: master, modules >Affects Versions: 1.0.0 >Reporter: Till Toenshoff > Labels: authorization, mesosphere, security > > When trying to access authorized endpoints, the resulting HTTP status codes > are not consistent for internal authorizer failures (failed future returned > by {{authorized}}). > {{/flags}}: > {noformat} > HTTP/1.1 503 Service Unavailable > Date: Fri, 17 Jun 2016 23:11:04 GMT > Content-Length: 0 > {noformat} > {{/state}}: > {noformat} > HTTP/1.1 500 Internal Server Error > Date: Fri, 17 Jun 2016 23:08:49 GMT > Content-Type: text/plain; charset=utf-8 > Content-Length: size($FUTURE_ERROR_MESSAGE) > $FUTURE_ERROR_MESSAGE > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5533) Agent fails to start on CentOS 6 due to missing cgroup hiearchy
[ https://issues.apache.org/jira/browse/MESOS-5533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337329#comment-15337329 ] Gilbert Song commented on MESOS-5533: - [~avin...@mesosphere.io], I guess we have some info mismatch, my bad. I have patches for the test failures below on centos 6: `CniIsolatorTest.ROOT_INTERNET_CURL_LaunchCommandTask` `CniIsolatorTest.ROOT_VerifyCheckpointedInfo` `CniIsolatorTest.ROOT_SlaveRecovery` But this should be a diff issue. Seems like it is just a check. Should be a quick fix. Do you want to take over? Or I can do that. > Agent fails to start on CentOS 6 due to missing cgroup hiearchy > --- > > Key: MESOS-5533 > URL: https://issues.apache.org/jira/browse/MESOS-5533 > Project: Mesos > Issue Type: Bug > Components: build, isolation >Reporter: Kapil Arya >Assignee: Gilbert Song >Priority: Blocker > Labels: mesosphere > Fix For: 1.0.0 > > > With the network CNI isolator, agent now _requires_ cgroups to be installed > on the system. Can we add some check(s) to either automatically disable CNI > module if cgroup hierarchies are not available or ask the user to > install/enable cgroup hierarchies. > On CentOS 6, cgroup tools aren't installed by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5637) Authorized endpoint results are inconsistent for failures.
[ https://issues.apache.org/jira/browse/MESOS-5637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-5637: --- Priority: Major (was: Minor) > Authorized endpoint results are inconsistent for failures. > -- > > Key: MESOS-5637 > URL: https://issues.apache.org/jira/browse/MESOS-5637 > Project: Mesos > Issue Type: Bug > Components: master, modules >Reporter: Till Toenshoff > Labels: authorization, mesosphere, security > > When trying to access authorized endpoints, the resulting HTTP status codes > are not consistent for internal authorizer failures (failed future returned > by {{authorized}}). > {{/flags}}: > {noformat} > HTTP/1.1 503 Service Unavailable > Date: Fri, 17 Jun 2016 23:11:04 GMT > Content-Length: 0 > {noformat} > {{/state}}: > {noformat} > HTTP/1.1 500 Internal Server Error > Date: Fri, 17 Jun 2016 23:08:49 GMT > Content-Type: text/plain; charset=utf-8 > Content-Length: size($FUTURE_ERROR_MESSAGE) > $FUTURE_ERROR_MESSAGE > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5637) Authorized endpoint results are inconsistent for failures.
[ https://issues.apache.org/jira/browse/MESOS-5637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rukletsov updated MESOS-5637: --- Labels: authorization mesosphere security (was: authorization security) > Authorized endpoint results are inconsistent for failures. > -- > > Key: MESOS-5637 > URL: https://issues.apache.org/jira/browse/MESOS-5637 > Project: Mesos > Issue Type: Bug > Components: master, modules >Reporter: Till Toenshoff >Priority: Minor > Labels: authorization, mesosphere, security > > When trying to access authorized endpoints, the resulting HTTP status codes > are not consistent for internal authorizer failures (failed future returned > by {{authorized}}). > {{/flags}}: > {noformat} > HTTP/1.1 503 Service Unavailable > Date: Fri, 17 Jun 2016 23:11:04 GMT > Content-Length: 0 > {noformat} > {{/state}}: > {noformat} > HTTP/1.1 500 Internal Server Error > Date: Fri, 17 Jun 2016 23:08:49 GMT > Content-Type: text/plain; charset=utf-8 > Content-Length: size($FUTURE_ERROR_MESSAGE) > $FUTURE_ERROR_MESSAGE > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5637) Authorized endpoint results are inconsistent for failures.
Till Toenshoff created MESOS-5637: - Summary: Authorized endpoint results are inconsistent for failures. Key: MESOS-5637 URL: https://issues.apache.org/jira/browse/MESOS-5637 Project: Mesos Issue Type: Bug Components: master, modules Reporter: Till Toenshoff Priority: Minor When trying to access authorized endpoints, the resulting HTTP status codes are not consistent for internal authorizer failures (failed future returned by {{authorized}}). {{/flags}}: {noformat} HTTP/1.1 503 Service Unavailable Date: Fri, 17 Jun 2016 23:11:04 GMT Content-Length: 0 {noformat} {{/state}}: {noformat} HTTP/1.1 500 Internal Server Error Date: Fri, 17 Jun 2016 23:08:49 GMT Content-Type: text/plain; charset=utf-8 Content-Length: size($FUTURE_ERROR_MESSAGE) $FUTURE_ERROR_MESSAGE {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5592) Pass NetworkInfo to CNI Plugins
[ https://issues.apache.org/jira/browse/MESOS-5592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-5592: -- Sprint: Mesosphere Sprint 37 Story Points: 3 Labels: mesosphere (was: ) > Pass NetworkInfo to CNI Plugins > --- > > Key: MESOS-5592 > URL: https://issues.apache.org/jira/browse/MESOS-5592 > Project: Mesos > Issue Type: Improvement >Reporter: Dan Osborne >Assignee: Dan Osborne > Labels: mesosphere > Fix For: 1.0.0 > > > Mesos has adopted the Container Network Interface as a simple means of > networking Mesos tasks launched by the Unified Containerizer. The CNI > specification covers a minimum feature set, granting the flexibility to add > customized networking functionality in the form of agreements made between > the orchestrator and CNI plugin. > This proposal is to pass NetworkInfo.Labels to the CNI plugin by injecting it > into the CNI network configuration json during plugin invocation. > Design Doc on this change: > https://docs.google.com/document/d/1rxruCCcJqpppsQxQrzTbHFVnnW6CgQ2oTieYAmwL284/edit?usp=sharing > reviewboard: https://reviews.apache.org/r/48527/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5533) Agent fails to start on CentOS 6 due to missing cgroup hiearchy
[ https://issues.apache.org/jira/browse/MESOS-5533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337273#comment-15337273 ] Avinash Sridharan commented on MESOS-5533: -- I think [~gilbert] had a patch, not sure it went up for review? > Agent fails to start on CentOS 6 due to missing cgroup hiearchy > --- > > Key: MESOS-5533 > URL: https://issues.apache.org/jira/browse/MESOS-5533 > Project: Mesos > Issue Type: Bug > Components: build, isolation >Reporter: Kapil Arya >Assignee: Gilbert Song >Priority: Blocker > Labels: mesosphere > Fix For: 1.0.0 > > > With the network CNI isolator, agent now _requires_ cgroups to be installed > on the system. Can we add some check(s) to either automatically disable CNI > module if cgroup hierarchies are not available or ask the user to > install/enable cgroup hierarchies. > On CentOS 6, cgroup tools aren't installed by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5533) Agent fails to start on CentOS 6 due to missing cgroup hiearchy
[ https://issues.apache.org/jira/browse/MESOS-5533?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337255#comment-15337255 ] Vinod Kone commented on MESOS-5533: --- What's the status of this? > Agent fails to start on CentOS 6 due to missing cgroup hiearchy > --- > > Key: MESOS-5533 > URL: https://issues.apache.org/jira/browse/MESOS-5533 > Project: Mesos > Issue Type: Bug > Components: build, isolation >Reporter: Kapil Arya >Assignee: Gilbert Song >Priority: Blocker > Labels: mesosphere > Fix For: 1.0.0 > > > With the network CNI isolator, agent now _requires_ cgroups to be installed > on the system. Can we add some check(s) to either automatically disable CNI > module if cgroup hierarchies are not available or ask the user to > install/enable cgroup hierarchies. > On CentOS 6, cgroup tools aren't installed by default. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5635) Agent repeatedly reregisters, possible one-way disconnection
[ https://issues.apache.org/jira/browse/MESOS-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-5635: - Summary: Agent repeatedly reregisters, possible one-way disconnection (was: Agent repeatedly reregisters, possible one-way partition) > Agent repeatedly reregisters, possible one-way disconnection > > > Key: MESOS-5635 > URL: https://issues.apache.org/jira/browse/MESOS-5635 > Project: Mesos > Issue Type: Bug >Reporter: Greg Mann > Labels: agent, mesosphere > > This issue was observed recently on an internal test cluster. Due to a bug in > the agent code (MESOS-5629), regular segfaults were occurring on an agent. > While the agent was recovering from one of these failures, it segfaulted > again. After this time, we noticed that after beginning recovery, the agent > did not print {{Finished recovery}}, and its logs did not show any indication > of reregistering with the master. Looking at the master's logs, however, the > following line was observed repeatedly, at intervals on the order of seconds: > {code} > W0617 21:27:07.010679 2016 master.cpp:4773] Agent > 2b899dd3-3b1f-4520-a6b2-98e32196f723-S4 at slave(1)@10.10.0.87:5051 > (10.10.0.87) attempted to re-register after removal; shutting it down > {code} > These re-registration attempts had no corresponding lines in the agent log. > Subsequently deleting the contents of the agent's {{work_dir}} and restarting > it led to a successful registration with a new agent ID: > {code} > I0617 21:29:01.246119 2011 master.cpp:4635] Registering agent at > slave(1)@10.10.0.87:5051 (10.10.0.87) with id > 2b899dd3-3b1f-4520-a6b2-98e32196f723-S5 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5635) Agent repeatedly reregisters, possible one-way partition
[ https://issues.apache.org/jira/browse/MESOS-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-5635: - Summary: Agent repeatedly reregisters, possible one-way partition (was: Agent failure during recovery prevents reregistration) > Agent repeatedly reregisters, possible one-way partition > > > Key: MESOS-5635 > URL: https://issues.apache.org/jira/browse/MESOS-5635 > Project: Mesos > Issue Type: Bug >Reporter: Greg Mann > Labels: agent, mesosphere > > This issue was observed recently on an internal test cluster. Due to a bug in > the agent code (MESOS-5629), regular segfaults were occurring on an agent. > While the agent was recovering from one of these failures, it segfaulted > again. After this time, we noticed that after beginning recovery, the agent > did not print {{Finished recovery}}, and its logs did not show any indication > of reregistering with the master. Looking at the master's logs, however, the > following line was observed repeatedly, at intervals on the order of seconds: > {code} > W0617 21:27:07.010679 2016 master.cpp:4773] Agent > 2b899dd3-3b1f-4520-a6b2-98e32196f723-S4 at slave(1)@10.10.0.87:5051 > (10.10.0.87) attempted to re-register after removal; shutting it down > {code} > These re-registration attempts had no corresponding lines in the agent log. > Subsequently deleting the contents of the agent's {{work_dir}} and restarting > it led to a successful registration with a new agent ID: > {code} > I0617 21:29:01.246119 2011 master.cpp:4635] Registering agent at > slave(1)@10.10.0.87:5051 (10.10.0.87) with id > 2b899dd3-3b1f-4520-a6b2-98e32196f723-S5 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5636) Display allocated resources in the agent listing of the webui.
[ https://issues.apache.org/jira/browse/MESOS-5636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-5636: --- Description: State endpoint returns information about allocated resources for each agent. We can present this information in the agent listing. (was: State endpoint returns information about slaves used resources. Present this data in agents page.) > Display allocated resources in the agent listing of the webui. > -- > > Key: MESOS-5636 > URL: https://issues.apache.org/jira/browse/MESOS-5636 > Project: Mesos > Issue Type: Improvement > Components: webui >Reporter: Tomasz Janiszewski >Assignee: Tomasz Janiszewski >Priority: Trivial > Fix For: 1.0.0 > > Attachments: mesos_agents_webui.png > > > State endpoint returns information about allocated resources for each agent. > We can present this information in the agent listing. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5636) Display allocated resources in the agent listing of the webui.
[ https://issues.apache.org/jira/browse/MESOS-5636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-5636: --- Summary: Display allocated resources in the agent listing of the webui. (was: Support displaying allocated resources of Agents in Mesos webui) > Display allocated resources in the agent listing of the webui. > -- > > Key: MESOS-5636 > URL: https://issues.apache.org/jira/browse/MESOS-5636 > Project: Mesos > Issue Type: Improvement > Components: webui >Reporter: Tomasz Janiszewski >Assignee: Tomasz Janiszewski >Priority: Trivial > Fix For: 1.0.0 > > Attachments: mesos_agents_webui.png > > > State endpoint returns information about slaves used resources. Present this > data in agents page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5636) Display allocated resources in the agent listing of the webui.
[ https://issues.apache.org/jira/browse/MESOS-5636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler updated MESOS-5636: --- Shepherd: Benjamin Mahler > Display allocated resources in the agent listing of the webui. > -- > > Key: MESOS-5636 > URL: https://issues.apache.org/jira/browse/MESOS-5636 > Project: Mesos > Issue Type: Improvement > Components: webui >Reporter: Tomasz Janiszewski >Assignee: Tomasz Janiszewski >Priority: Trivial > Fix For: 1.0.0 > > Attachments: mesos_agents_webui.png > > > State endpoint returns information about slaves used resources. Present this > data in agents page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5636) Support displaying allocated resources of Agents in Mesos webui
[ https://issues.apache.org/jira/browse/MESOS-5636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomasz Janiszewski updated MESOS-5636: -- Summary: Support displaying allocated resources of Agents in Mesos webui (was: Support displaying used resources of Agents in Mesos webui) > Support displaying allocated resources of Agents in Mesos webui > --- > > Key: MESOS-5636 > URL: https://issues.apache.org/jira/browse/MESOS-5636 > Project: Mesos > Issue Type: Improvement > Components: webui >Reporter: Tomasz Janiszewski >Assignee: Tomasz Janiszewski >Priority: Trivial > Attachments: mesos_agents_webui.png > > > State endpoint returns information about slaves used resources. Present this > data in agents page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5636) Support displaying used resources of Agents in Mesos webui
[ https://issues.apache.org/jira/browse/MESOS-5636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomasz Janiszewski updated MESOS-5636: -- Attachment: mesos_agents_webui.png > Support displaying used resources of Agents in Mesos webui > -- > > Key: MESOS-5636 > URL: https://issues.apache.org/jira/browse/MESOS-5636 > Project: Mesos > Issue Type: Improvement > Components: webui >Reporter: Tomasz Janiszewski >Priority: Trivial > Attachments: mesos_agents_webui.png > > > State endpoint returns information about slaves used resources. Present this > data in agents page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5633) User related shell environment is not set correctly in tasks
[ https://issues.apache.org/jira/browse/MESOS-5633?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15337097#comment-15337097 ] Jie Yu commented on MESOS-5633: --- Remember that Mesos task should always write to it's own sandbox ($MESOS_SANDBOX). I am wondering if it makes sense to set $HOME to $MESOS_SANDBOX. I am not sure if it'll break something. Is there a standard specifying how $HOME should be set? > User related shell environment is not set correctly in tasks > > > Key: MESOS-5633 > URL: https://issues.apache.org/jira/browse/MESOS-5633 > Project: Mesos > Issue Type: Bug >Reporter: haosdent > > If user specify the user field in {{FrameworkInfo}} or {{Task}}, both > {{setuid}} and {{setgroups}} are set correctly. However, some user related > shell variables, e.g., {{HOME}}, {{USER}} are still use root. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5265) Update mesos-execute to support docker volume isolator.
[ https://issues.apache.org/jira/browse/MESOS-5265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-5265: -- Sprint: Mesosphere Sprint 37 Story Points: 3 > Update mesos-execute to support docker volume isolator. > --- > > Key: MESOS-5265 > URL: https://issues.apache.org/jira/browse/MESOS-5265 > Project: Mesos > Issue Type: Bug >Reporter: Guangya Liu >Assignee: Guangya Liu > > The mesos-execute needs to be updated to support docker volume isolator. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5635) Agent failure during recovery prevents reregistration
[ https://issues.apache.org/jira/browse/MESOS-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-5635: -- Description: This issue was observed recently on an internal test cluster. Due to a bug in the agent code (MESOS-5629), regular segfaults were occurring on an agent. While the agent was recovering from one of these failures, it segfaulted again. After this time, we noticed that after beginning recovery, the agent did not print {{Finished recovery}}, and its logs did not show any indication of reregistering with the master. Looking at the master's logs, however, the following line was observed repeatedly, at intervals on the order of seconds: {code} W0617 21:27:07.010679 2016 master.cpp:4773] Agent 2b899dd3-3b1f-4520-a6b2-98e32196f723-S4 at slave(1)@10.10.0.87:5051 (10.10.0.87) attempted to re-register after removal; shutting it down {code} These re-registration attempts had no corresponding lines in the agent log. Subsequently deleting the contents of the agent's {{work_dir}} and restarting it led to a successful registration with a new agent ID: {code} I0617 21:29:01.246119 2011 master.cpp:4635] Registering agent at slave(1)@10.10.0.87:5051 (10.10.0.87) with id 2b899dd3-3b1f-4520-a6b2-98e32196f723-S5 {code} was: This issue was observed recently on an internal test cluster. Due to a bug in the agent code (MESOS-5629), regular segfaults were occurring on an agent. While the agent was recovering from one of these failures, it segfaulted again. After this time, we noticed that after recovery, the agent did not print {{Finished recovery}}, and its logs did not show any indication of reregistering with the master. Looking at the master's logs, however, the following line was observed repeatedly, at intervals on the order of seconds: {code} W0617 21:27:07.010679 2016 master.cpp:4773] Agent 2b899dd3-3b1f-4520-a6b2-98e32196f723-S4 at slave(1)@10.10.0.87:5051 (10.10.0.87) attempted to re-register after removal; shutting it down {code} These re-registration attempts had no corresponding lines in the agent log. Subsequently deleting the contents of the agent's {{work_dir}} and restarting it led to a successful registration with a new agent ID: {code} I0617 21:29:01.246119 2011 master.cpp:4635] Registering agent at slave(1)@10.10.0.87:5051 (10.10.0.87) with id 2b899dd3-3b1f-4520-a6b2-98e32196f723-S5 {code} > Agent failure during recovery prevents reregistration > - > > Key: MESOS-5635 > URL: https://issues.apache.org/jira/browse/MESOS-5635 > Project: Mesos > Issue Type: Bug >Reporter: Greg Mann > Labels: agent, mesosphere > > This issue was observed recently on an internal test cluster. Due to a bug in > the agent code (MESOS-5629), regular segfaults were occurring on an agent. > While the agent was recovering from one of these failures, it segfaulted > again. After this time, we noticed that after beginning recovery, the agent > did not print {{Finished recovery}}, and its logs did not show any indication > of reregistering with the master. Looking at the master's logs, however, the > following line was observed repeatedly, at intervals on the order of seconds: > {code} > W0617 21:27:07.010679 2016 master.cpp:4773] Agent > 2b899dd3-3b1f-4520-a6b2-98e32196f723-S4 at slave(1)@10.10.0.87:5051 > (10.10.0.87) attempted to re-register after removal; shutting it down > {code} > These re-registration attempts had no corresponding lines in the agent log. > Subsequently deleting the contents of the agent's {{work_dir}} and restarting > it led to a successful registration with a new agent ID: > {code} > I0617 21:29:01.246119 2011 master.cpp:4635] Registering agent at > slave(1)@10.10.0.87:5051 (10.10.0.87) with id > 2b899dd3-3b1f-4520-a6b2-98e32196f723-S5 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5636) Support displaying used resources of Agents in Mesos webui
[ https://issues.apache.org/jira/browse/MESOS-5636?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tomasz Janiszewski updated MESOS-5636: -- Attachment: mesos_agents_webui.png > Support displaying used resources of Agents in Mesos webui > -- > > Key: MESOS-5636 > URL: https://issues.apache.org/jira/browse/MESOS-5636 > Project: Mesos > Issue Type: Improvement > Components: webui >Reporter: Tomasz Janiszewski >Priority: Trivial > Attachments: mesos_agents_webui.png > > > State endpoint returns information about slaves used resources. Present this > data in agents page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5636) Support displaying used resources of Agents in Mesos webui
Tomasz Janiszewski created MESOS-5636: - Summary: Support displaying used resources of Agents in Mesos webui Key: MESOS-5636 URL: https://issues.apache.org/jira/browse/MESOS-5636 Project: Mesos Issue Type: Improvement Components: webui Reporter: Tomasz Janiszewski Priority: Trivial State endpoint returns information about slaves used resources. Present this data in agents page. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5635) Agent failure during recovery prevents reregistration
Greg Mann created MESOS-5635: Summary: Agent failure during recovery prevents reregistration Key: MESOS-5635 URL: https://issues.apache.org/jira/browse/MESOS-5635 Project: Mesos Issue Type: Bug Reporter: Greg Mann This issue was observed recently on an internal test cluster. Due to a bug in the agent code (MESOS-5629), regular segfaults were occurring on an agent. While the agent was recovering from one of these failures, it segfaulted again. After this time, we noticed that after recovery, the agent did not print {{Finished recovery}}, and its logs did not show any indication of reregistering with the master. Looking at the master's logs, however, the following line was observed repeatedly, at intervals on the order of seconds: {code} W0617 21:27:07.010679 2016 master.cpp:4773] Agent 2b899dd3-3b1f-4520-a6b2-98e32196f723-S4 at slave(1)@10.10.0.87:5051 (10.10.0.87) attempted to re-register after removal; shutting it down {code} These re-registration attempts had no corresponding lines in the agent log. Subsequently deleting the contents of the agent's {{work_dir}} and restarting it led to a successful registration with a new agent ID: {code} I0617 21:29:01.246119 2011 master.cpp:4635] Registering agent at slave(1)@10.10.0.87:5051 (10.10.0.87) with id 2b899dd3-3b1f-4520-a6b2-98e32196f723-S5 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4087) Introduce a module for logging executor/task output
[ https://issues.apache.org/jira/browse/MESOS-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336993#comment-15336993 ] Mallik Singaraju commented on MESOS-4087: - ok thanks joesph > Introduce a module for logging executor/task output > --- > > Key: MESOS-4087 > URL: https://issues.apache.org/jira/browse/MESOS-4087 > Project: Mesos > Issue Type: Task > Components: containerization, modules >Reporter: Joseph Wu >Assignee: Joseph Wu > Labels: logging, mesosphere > Fix For: 0.27.0 > > > Existing executor/task logs are logged to files in their sandbox directory, > with some nuances based on which containerizer is used (see background > section in linked document). > A logger for executor/task logs has the following requirements: > * The logger is given a command to run and must handle the stdout/stderr of > the command. > * The handling of stdout/stderr must be resilient across agent failover. > Logging should not stop if the agent fails. > * Logs should be readable, presumably via the web UI, or via some other > module-specific UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5630) Change build to always enable Nvidia GPU support for Linux
[ https://issues.apache.org/jira/browse/MESOS-5630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336983#comment-15336983 ] Benjamin Mahler commented on MESOS-5630: {noformat} commit da610431162e738615a59cb04fb69766b9a847d5 Author: Kevin KluesDate: Fri Jun 17 14:17:07 2016 -0700 Fixed Cmake build for Nvidia GPU support on Linux. Review: https://reviews.apache.org/r/48881/ {noformat} {noformat} commit 1f65937ba38eca54247447ceafd6ccdd93163cdc Author: Kevin Klues Date: Fri Jun 17 14:17:15 2016 -0700 Fixed Cmake build for Nvidia GPU support on Linux in stout. Review: https://reviews.apache.org/r/48882/ {noformat} {noformat} commit d2d5c409f51f689f523137b502f553225d3474ae Author: Kevin Klues Date: Fri Jun 17 14:17:20 2016 -0700 Fixed Cmake build for Nvidia GPU support on Linux in libprocess. Review: https://reviews.apache.org/r/48883/ {noformat} > Change build to always enable Nvidia GPU support for Linux > -- > > Key: MESOS-5630 > URL: https://issues.apache.org/jira/browse/MESOS-5630 > Project: Mesos > Issue Type: Improvement > Environment: Build / run unit tests in three build environments: > {noformat} > 1) CentOS 7 on GPU capable machine > 2) CentOS 7 on NON-GPU capable machine > 3) OSX > $ rm -rf build; ./bootstrap; mkdir build; cd build; ../configure; make -j > check; sudo GTEST_FILTER="*NVIDIA*" src/mesos-tests > {noformat} > Test support/build_docker.sh (to make sure we won't crash Apache's CI): > {noformat} > $ ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1' CONFIGURATION="--enable-libevent > --enable-ssl" COMPILER=gcc BUILDTOOL=autotools OS=centos:7 > support/docker_build.sh > $ ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1' CONFIGURATION="--enable-libevent > --enable-ssl" COMPILER=gcc BUILDTOOL=autotools OS=ubuntu:14.04 > support/docker_build.sh > {noformat} >Reporter: Kevin Klues >Assignee: Kevin Klues > Labels: gpu, mesosphere > Fix For: 1.0.0 > > > See Summary -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-5516) Implement GET_STATE Call in v1 agent API.
[ https://issues.apache.org/jira/browse/MESOS-5516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone reassigned MESOS-5516: - Assignee: (was: Vinod Kone) > Implement GET_STATE Call in v1 agent API. > - > > Key: MESOS-5516 > URL: https://issues.apache.org/jira/browse/MESOS-5516 > Project: Mesos > Issue Type: Task >Reporter: Vinod Kone > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5592) Pass NetworkInfo to CNI Plugins
[ https://issues.apache.org/jira/browse/MESOS-5592?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-5592: -- Assignee: Dan Osborne > Pass NetworkInfo to CNI Plugins > --- > > Key: MESOS-5592 > URL: https://issues.apache.org/jira/browse/MESOS-5592 > Project: Mesos > Issue Type: Improvement >Reporter: Dan Osborne >Assignee: Dan Osborne > Fix For: 1.0.0 > > > Mesos has adopted the Container Network Interface as a simple means of > networking Mesos tasks launched by the Unified Containerizer. The CNI > specification covers a minimum feature set, granting the flexibility to add > customized networking functionality in the form of agreements made between > the orchestrator and CNI plugin. > This proposal is to pass NetworkInfo.Labels to the CNI plugin by injecting it > into the CNI network configuration json during plugin invocation. > Design Doc on this change: > https://docs.google.com/document/d/1rxruCCcJqpppsQxQrzTbHFVnnW6CgQ2oTieYAmwL284/edit?usp=sharing > reviewboard: https://reviews.apache.org/r/48527/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5634) Add Framework Capability for GPU_RESOURCES
Kevin Klues created MESOS-5634: -- Summary: Add Framework Capability for GPU_RESOURCES Key: MESOS-5634 URL: https://issues.apache.org/jira/browse/MESOS-5634 Project: Mesos Issue Type: Task Reporter: Kevin Klues Assignee: Kevin Klues Fix For: 1.0.0 Due to the scarce resource problem described in MESOS-5377, we plan to introduce a GPU_RESOURCES Framework capability. This capability will allow the Mesos allocator to make better decisions about which frameworks should receive resources from GPU capable machines. In essence, the allocator will ONLY allocate resources from GPU capable machines to frameworks that have this capability. This is necessary to prevent non-GPU workloads from filling up the GPU machines and preventing GPU workloads to run. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4087) Introduce a module for logging executor/task output
[ https://issues.apache.org/jira/browse/MESOS-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336709#comment-15336709 ] Joseph Wu commented on MESOS-4087: -- Sounds like you're trying to build a custom solution for your specific framework. You might want to ask in the Spark community on how they've done logging. The {{ContainerLogger}} (this JIRA) is meant to encompass the stdout/stderr of *any* executor, and involves loading a module into your agents. If you are willing to dip into C++, you can write your own appender/forwarder. Examples: https://github.com/apache/mesos/tree/master/src/slave/container_loggers > Introduce a module for logging executor/task output > --- > > Key: MESOS-4087 > URL: https://issues.apache.org/jira/browse/MESOS-4087 > Project: Mesos > Issue Type: Task > Components: containerization, modules >Reporter: Joseph Wu >Assignee: Joseph Wu > Labels: logging, mesosphere > Fix For: 0.27.0 > > > Existing executor/task logs are logged to files in their sandbox directory, > with some nuances based on which containerizer is used (see background > section in linked document). > A logger for executor/task logs has the following requirements: > * The logger is given a command to run and must handle the stdout/stderr of > the command. > * The handling of stdout/stderr must be resilient across agent failover. > Logging should not stop if the agent fails. > * Logs should be readable, presumably via the web UI, or via some other > module-specific UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5400) Add preliminary support for parsing ELF files in stout.
[ https://issues.apache.org/jira/browse/MESOS-5400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336700#comment-15336700 ] Kevin Klues commented on MESOS-5400: {noformat} commit 7c0f57ff0ecb2b0e3e2cfe5eeca80e53d791c2d3 Author: Kevin Klues klue...@gmail.com Date: Fri Jun 17 01:32:29 2016 -0400 Added missing `stout/elf.hpp` file to `nobase_include_HEADERS`. Without this, files that #included `stout/elf.hpp` would fail with a `make distcheck` because this file was not being installed properly from a `make install`. Review: https://reviews.apache.org/r/48838/ {noformat} > Add preliminary support for parsing ELF files in stout. > --- > > Key: MESOS-5400 > URL: https://issues.apache.org/jira/browse/MESOS-5400 > Project: Mesos > Issue Type: Improvement >Reporter: Kevin Klues >Assignee: Kevin Klues >Priority: Minor > Fix For: 1.0.0 > > > The upcoming Nvidia GPU support for docker containers in Mesos relies on > consolidating all Nvidia shared libraries into a common location for > injecting a volume into a container. > As part of this, we need some preliminary parsing capabilities for ELF file > to infer things about each shared library we are consolidating. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5630) Change build to always enable Nvidia GPU support for Linux
[ https://issues.apache.org/jira/browse/MESOS-5630?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336672#comment-15336672 ] Kevin Klues commented on MESOS-5630: https://reviews.apache.org/r/48832/ > Change build to always enable Nvidia GPU support for Linux > -- > > Key: MESOS-5630 > URL: https://issues.apache.org/jira/browse/MESOS-5630 > Project: Mesos > Issue Type: Improvement > Environment: Build / run unit tests in three build environments: > {noformat} > 1) CentOS 7 on GPU capable machine > 2) CentOS 7 on NON-GPU capable machine > 3) OSX > $ rm -rf build; ./bootstrap; mkdir build; cd build; ../configure; make -j > check; sudo GTEST_FILTER="*NVIDIA*" src/mesos-tests > {noformat} > Test support/build_docker.sh (to make sure we won't crash Apache's CI): > {noformat} > $ ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1' CONFIGURATION="--enable-libevent > --enable-ssl" COMPILER=gcc BUILDTOOL=autotools OS=centos:7 > support/docker_build.sh > $ ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1' CONFIGURATION="--enable-libevent > --enable-ssl" COMPILER=gcc BUILDTOOL=autotools OS=ubuntu:14.04 > support/docker_build.sh > {noformat} >Reporter: Kevin Klues >Assignee: Kevin Klues > Labels: gpu, mesosphere > Fix For: 1.0.0 > > > See Summary -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4248) mesos slave can't start in CentOS-7 docker container
[ https://issues.apache.org/jira/browse/MESOS-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336624#comment-15336624 ] Justin Venus commented on MESOS-4248: - Yes, that is exactly what I want. Thank you for point out the ticket so I didn't have to go Jira spelunking today. > mesos slave can't start in CentOS-7 docker container > > > Key: MESOS-4248 > URL: https://issues.apache.org/jira/browse/MESOS-4248 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 0.26.0 > Environment: My host OS is Debian Jessie, the container OS is CentOS > 7.2. > {code} > # cat /etc/system-release > CentOS Linux release 7.2.1511 (Core) > # rpm -qa |grep mesos > mesosphere-zookeeper-3.4.6-0.1.20141204175332.centos7.x86_64 > mesosphere-el-repo-7-1.noarch > mesos-0.26.0-0.2.145.centos701406.x86_64 > $ docker version > Client: > Version: 1.9.1 > API version: 1.21 > Go version: go1.4.2 > Git commit: a34a1d5 > Built:Fri Nov 20 12:59:02 UTC 2015 > OS/Arch: linux/amd64 > Server: > Version: 1.9.1 > API version: 1.21 > Go version: go1.4.2 > Git commit: a34a1d5 > Built:Fri Nov 20 12:59:02 UTC 2015 > OS/Arch: linux/amd64 > {code} >Reporter: Yubao Liu > > // Check the "Environment" label above for kinds of software versions. > "systemctl start mesos-slave" can't start mesos-slave: > {code} > # journalctl -u mesos-slave > > Dec 24 10:35:25 mesos-slave1 systemd[1]: Started Mesos Slave. > Dec 24 10:35:25 mesos-slave1 systemd[1]: Starting Mesos Slave... > Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210180 12838 > logging.cpp:172] INFO level logging started! > Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210603 12838 > main.cpp:190] Build: 2015-12-16 23:06:16 by root > Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210625 12838 > main.cpp:192] Version: 0.26.0 > Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210634 12838 > main.cpp:195] Git tag: 0.26.0 > Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210644 12838 > main.cpp:199] Git SHA: d3717e5c4d1bf4fca5c41cd7ea54fae489028faa > Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210765 12838 > containerizer.cpp:142] Using isolation: posix/cpu,posix/mem,filesystem/posix > Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.215638 12838 > linux_launcher.cpp:103] Using /sys/fs/cgroup/freezer as the freezer hierarchy > for the Linux launcher > Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.220279 12838 > systemd.cpp:128] systemd version `219` detected > Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.227017 12838 > systemd.cpp:210] Started systemd slice `mesos_executors.slice` > Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: Failed to create a > containerizer: Could not create MesosContainerizer: Failed to create > launcher: Failed to locate systemd cgroups hierarchy: does not exist > Dec 24 10:35:25 mesos-slave1 systemd[1]: mesos-slave.service: main process > exited, code=exited, status=1/FAILURE > Dec 24 10:35:25 mesos-slave1 systemd[1]: Unit mesos-slave.service entered > failed state. > Dec 24 10:35:25 mesos-slave1 systemd[1]: mesos-slave.service failed. > {code} > I used strace to debug it, mesos-slave tried to access > "/sys/fs/cgroup/systemd/mesos_executors.slice", but it's actually at > "/sys/fs/cgroup/systemd/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope/mesos_executors.slice/", >mesos-slave should check "/proc/self/cgroup" to find those intermediate > directories: > {code} > # cat /proc/self/cgroup > 8:perf_event:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope > 7:blkio:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope > 6:net_cls,net_prio:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope > 5:freezer:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope > 4:devices:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope > 3:cpu,cpuacct:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope > 2:cpuset:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope > 1:name=systemd:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-5504) Implement GET_MAINTENANCE_SCHEDULE Call in v1 master API.
[ https://issues.apache.org/jira/browse/MESOS-5504?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336574#comment-15336574 ] Vinod Kone edited comment on MESOS-5504 at 6/17/16 5:59 PM: commit 87c079e979b8fbf04c4ed491f843f92266f3d7da Author: haosdent huangDate: Fri Jun 17 10:40:45 2016 -0700 Added test case `MasterAPITest.UpdateAndGetMaintenanceSchedule`. Review: https://reviews.apache.org/r/48259/ commit b73f0a50f6c5c2feb642827e3e6fbe0ec1a1c914 Author: haosdent huang Date: Fri Jun 17 10:40:39 2016 -0700 Implemented GET_MAINTENANCE_STATUS Call in v1 master API. Review: https://reviews.apache.org/r/48084/ commit 5f09adb9aa7b49e4d83104f36a14df1385c6880a Author: haosdent huang Date: Fri Jun 17 10:40:33 2016 -0700 Implemented GET_MAINTENANCE_SCHEDULE Call in v1 master API. Review: https://reviews.apache.org/r/48257/ was (Author: vinodkone): commit 87c079e979b8fbf04c4ed491f843f92266f3d7da Author: haosdent huang Date: Fri Jun 17 10:40:45 2016 -0700 Added test case `MasterAPITest.UpdateAndGetMaintenanceSchedule`. Review: https://reviews.apache.org/r/48259/ commit b73f0a50f6c5c2feb642827e3e6fbe0ec1a1c914 Author: haosdent huang Date: Fri Jun 17 10:40:39 2016 -0700 Implemented GET_MAINTENANCE_STATUS Call in v1 master API. Review: https://reviews.apache.org/r/48084/ > Implement GET_MAINTENANCE_SCHEDULE Call in v1 master API. > - > > Key: MESOS-5504 > URL: https://issues.apache.org/jira/browse/MESOS-5504 > Project: Mesos > Issue Type: Task >Reporter: Vinod Kone >Assignee: haosdent > Fix For: 1.0.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4087) Introduce a module for logging executor/task output
[ https://issues.apache.org/jira/browse/MESOS-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336558#comment-15336558 ] Mallik Singaraju commented on MESOS-4087: - I am looking at the stdout/stderr of the agent sandbox running the spark executor tasks on mesos. Here is how I am submitting my job from a jenkins slave which has spark submit on it. SPARK_JAVA_OPTS="\ -Dspark.executor.uri=https://s3.amazonaws.com//spark-1.6.1-bin-hadoop-2.6_scala-2.11.tgz \ -Dlog4j.configuration=log4j.properties \ " \ $SPARK_HOME/bin/spark-submit \ --class com.uptake.ad.AnomalyDetectionApp \ --deploy-mode cluster \ --verbose \ --conf spark.master=mesos://xx.xx.xx.xx:7070 \ --conf spark.ssl.enabled=true \ --conf spark.mesos.coarse=false \ --conf spark.cores.max=1 \ --conf spark.executor.memory=1G \ --conf spark.driver.memory=1G \ https://s3.amazonaws.com//.jar I want to override the log4j config which is defaulted to spark_home/conf with the one from the classpath in .jar when the spark executor task is being run. Goal is to add a graylog appender to log4j so that I can push the driver's as well as executor application specific logs to a central gray log server. Looks like when a executor task runs on mesos spark is always loading the log4j.properties from the SPARK_HOME/conf instead from .jar > Introduce a module for logging executor/task output > --- > > Key: MESOS-4087 > URL: https://issues.apache.org/jira/browse/MESOS-4087 > Project: Mesos > Issue Type: Task > Components: containerization, modules >Reporter: Joseph Wu >Assignee: Joseph Wu > Labels: logging, mesosphere > Fix For: 0.27.0 > > > Existing executor/task logs are logged to files in their sandbox directory, > with some nuances based on which containerizer is used (see background > section in linked document). > A logger for executor/task logs has the following requirements: > * The logger is given a command to run and must handle the stdout/stderr of > the command. > * The handling of stdout/stderr must be resilient across agent failover. > Logging should not stop if the agent fails. > * Logs should be readable, presumably via the web UI, or via some other > module-specific UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5633) User related shell environment is not set correctly in tasks
haosdent created MESOS-5633: --- Summary: User related shell environment is not set correctly in tasks Key: MESOS-5633 URL: https://issues.apache.org/jira/browse/MESOS-5633 Project: Mesos Issue Type: Bug Reporter: haosdent If user specify the user field in {{FrameworkInfo}} or {{Task}}, both {{setuid}} and {{setgroups}} are set correctly. However, some user related shell variables, e.g., {{HOME}}, {{USER}} are still use root. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5617) Mesos website preview incorrect in facebook
[ https://issues.apache.org/jira/browse/MESOS-5617?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] haosdent updated MESOS-5617: Attachment: facebook_post.png > Mesos website preview incorrect in facebook > --- > > Key: MESOS-5617 > URL: https://issues.apache.org/jira/browse/MESOS-5617 > Project: Mesos > Issue Type: Improvement > Components: project website >Reporter: haosdent >Assignee: haosdent >Priority: Minor > Attachments: facebook_post.png > > > We need follow > https://developers.facebook.com/docs/sharing/best-practices#images to avoid > the preview logo of the sharing relateds to Mesos website be cropped by > facebook. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4248) mesos slave can't start in CentOS-7 docker container
[ https://issues.apache.org/jira/browse/MESOS-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336464#comment-15336464 ] Joseph Wu commented on MESOS-4248: -- This might be related to what you want: [MESOS-5544]. > mesos slave can't start in CentOS-7 docker container > > > Key: MESOS-4248 > URL: https://issues.apache.org/jira/browse/MESOS-4248 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 0.26.0 > Environment: My host OS is Debian Jessie, the container OS is CentOS > 7.2. > {code} > # cat /etc/system-release > CentOS Linux release 7.2.1511 (Core) > # rpm -qa |grep mesos > mesosphere-zookeeper-3.4.6-0.1.20141204175332.centos7.x86_64 > mesosphere-el-repo-7-1.noarch > mesos-0.26.0-0.2.145.centos701406.x86_64 > $ docker version > Client: > Version: 1.9.1 > API version: 1.21 > Go version: go1.4.2 > Git commit: a34a1d5 > Built:Fri Nov 20 12:59:02 UTC 2015 > OS/Arch: linux/amd64 > Server: > Version: 1.9.1 > API version: 1.21 > Go version: go1.4.2 > Git commit: a34a1d5 > Built:Fri Nov 20 12:59:02 UTC 2015 > OS/Arch: linux/amd64 > {code} >Reporter: Yubao Liu > > // Check the "Environment" label above for kinds of software versions. > "systemctl start mesos-slave" can't start mesos-slave: > {code} > # journalctl -u mesos-slave > > Dec 24 10:35:25 mesos-slave1 systemd[1]: Started Mesos Slave. > Dec 24 10:35:25 mesos-slave1 systemd[1]: Starting Mesos Slave... > Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210180 12838 > logging.cpp:172] INFO level logging started! > Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210603 12838 > main.cpp:190] Build: 2015-12-16 23:06:16 by root > Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210625 12838 > main.cpp:192] Version: 0.26.0 > Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210634 12838 > main.cpp:195] Git tag: 0.26.0 > Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210644 12838 > main.cpp:199] Git SHA: d3717e5c4d1bf4fca5c41cd7ea54fae489028faa > Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210765 12838 > containerizer.cpp:142] Using isolation: posix/cpu,posix/mem,filesystem/posix > Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.215638 12838 > linux_launcher.cpp:103] Using /sys/fs/cgroup/freezer as the freezer hierarchy > for the Linux launcher > Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.220279 12838 > systemd.cpp:128] systemd version `219` detected > Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.227017 12838 > systemd.cpp:210] Started systemd slice `mesos_executors.slice` > Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: Failed to create a > containerizer: Could not create MesosContainerizer: Failed to create > launcher: Failed to locate systemd cgroups hierarchy: does not exist > Dec 24 10:35:25 mesos-slave1 systemd[1]: mesos-slave.service: main process > exited, code=exited, status=1/FAILURE > Dec 24 10:35:25 mesos-slave1 systemd[1]: Unit mesos-slave.service entered > failed state. > Dec 24 10:35:25 mesos-slave1 systemd[1]: mesos-slave.service failed. > {code} > I used strace to debug it, mesos-slave tried to access > "/sys/fs/cgroup/systemd/mesos_executors.slice", but it's actually at > "/sys/fs/cgroup/systemd/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope/mesos_executors.slice/", >mesos-slave should check "/proc/self/cgroup" to find those intermediate > directories: > {code} > # cat /proc/self/cgroup > 8:perf_event:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope > 7:blkio:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope > 6:net_cls,net_prio:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope > 5:freezer:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope > 4:devices:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope > 3:cpu,cpuacct:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope > 2:cpuset:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope > 1:name=systemd:/system.slice/docker-45875efce9019375cd0c5b29bb1a12275fb6033293f9bf3d97d774a1e5d4de52.scope > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4087) Introduce a module for logging executor/task output
[ https://issues.apache.org/jira/browse/MESOS-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336435#comment-15336435 ] Joseph Wu commented on MESOS-4087: -- Just to clarify, are you looking at the stdout/stderr of your {{spark-submit}} command? Or are you looking at the [agent sandboxes|http://mesos.apache.org/documentation/latest/sandbox/#where-is-it] for your spark executors? Under the default settings, the spark executors' sandboxes will have a {{stdout}} and {{stderr}} file for their stdout/stderr logging. If {{log4j}} places logs in a different location, you'll have to check that location. > Introduce a module for logging executor/task output > --- > > Key: MESOS-4087 > URL: https://issues.apache.org/jira/browse/MESOS-4087 > Project: Mesos > Issue Type: Task > Components: containerization, modules >Reporter: Joseph Wu >Assignee: Joseph Wu > Labels: logging, mesosphere > Fix For: 0.27.0 > > > Existing executor/task logs are logged to files in their sandbox directory, > with some nuances based on which containerizer is used (see background > section in linked document). > A logger for executor/task logs has the following requirements: > * The logger is given a command to run and must handle the stdout/stderr of > the command. > * The handling of stdout/stderr must be resilient across agent failover. > Logging should not stop if the agent fails. > * Logs should be readable, presumably via the web UI, or via some other > module-specific UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-4248) mesos slave can't start in CentOS-7 docker container
[ https://issues.apache.org/jira/browse/MESOS-4248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336432#comment-15336432 ] Justin Venus commented on MESOS-4248: - Thanks for pointing that ticket out. However, MESOS-4675 doesn't solve my use case. - I want to run systemd in a docker container - I want mesos-slave to setup the slice "mesos_executors.slice" - I want to use the cgroup isolators - I want mesos-executor tasks to survive a mesos-slave restart - Basically I want mesos-slave to work like it's on bare metal (especially in a docker container I'm carrying around patches for 0.25.0, 0.26.0 and testing 0.27.2 to make this work. I'll open a feature request in jira. Please notice systemd is in a CGroup {code} [root@mesos-slave05of2 /]# systemctl status â mesos-slave05of2 State: running Jobs: 0 queued Failed: 0 units Since: Wed 2016-06-08 21:41:38 UTC; 1 weeks 1 days ago CGroup: /system.slice/docker-6c53ffcbc602cc6b19149030f6f453a1febd7fc79bf472fa6227c1fecd7c053c.scope ââ1 /usr/lib/systemd/systemd --system --log-target=console --log-level=info --unit=mesos-slave.target ââmesos_executors.slice â ââ10139 python2.7 /var/lib/mesos/slaves/f54196a4-d706-4324-97f6-009e18022152-S8/frameworks/20160221-001235-380151 â ââ10171 /usr/bin/python2.7 /var/lib/mesos/slaves/f54196a4-d706-4324-97f6-009e18022152-S8/frameworks/20160221-0012 â ââ10282 /usr/bin/python2.7 /var/lib/mesos/slaves/f54196a4-d706-4324-97f6-009e18022152-S8/frameworks/20160221-0012 â ââ10283 /bin/bash -c echo '#!/bin/bash PEX_INSTALL=${PEX_INSTALL:-${HOME}/.pex/install} LD_LIBRARY_PATH=${LD_LIB â ââ12647 python2.7 /var/lib/mesos/slaves/f54196a4-d706-4324-97f6-009e18022152-S8/frameworks/20160221-001235-380151 â ââ12672 /usr/bin/python2.7 /var/lib/mesos/slaves/f54196a4-d706-4324-97f6-009e18022152-S8/frameworks/20160221-0012 â ââ12690 /usr/bin/python2.7 /var/lib/mesos/slaves/f54196a4-d706-4324-97f6-009e18022152-S8/frameworks/20160221-0012 â ââ12691 python2.7 /var/lib/mesos/slaves/f54196a4-d706-4324-97f6-009e18022152-S8/frameworks/20160221-001235-380151 ââsystem.slice ââthermos-observer.service â ââ142 python2.7 /usr/sbin/thermos_observer --mesos-root=/var/lib/mesos --port=1338 --log_to_disk=NONE --log_to_ ââmesos-slave.service â ââ 143 mesos-slave â ââ 187 mesos-docker-executor {code} > mesos slave can't start in CentOS-7 docker container > > > Key: MESOS-4248 > URL: https://issues.apache.org/jira/browse/MESOS-4248 > Project: Mesos > Issue Type: Bug > Components: slave >Affects Versions: 0.26.0 > Environment: My host OS is Debian Jessie, the container OS is CentOS > 7.2. > {code} > # cat /etc/system-release > CentOS Linux release 7.2.1511 (Core) > # rpm -qa |grep mesos > mesosphere-zookeeper-3.4.6-0.1.20141204175332.centos7.x86_64 > mesosphere-el-repo-7-1.noarch > mesos-0.26.0-0.2.145.centos701406.x86_64 > $ docker version > Client: > Version: 1.9.1 > API version: 1.21 > Go version: go1.4.2 > Git commit: a34a1d5 > Built:Fri Nov 20 12:59:02 UTC 2015 > OS/Arch: linux/amd64 > Server: > Version: 1.9.1 > API version: 1.21 > Go version: go1.4.2 > Git commit: a34a1d5 > Built:Fri Nov 20 12:59:02 UTC 2015 > OS/Arch: linux/amd64 > {code} >Reporter: Yubao Liu > > // Check the "Environment" label above for kinds of software versions. > "systemctl start mesos-slave" can't start mesos-slave: > {code} > # journalctl -u mesos-slave > > Dec 24 10:35:25 mesos-slave1 systemd[1]: Started Mesos Slave. > Dec 24 10:35:25 mesos-slave1 systemd[1]: Starting Mesos Slave... > Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210180 12838 > logging.cpp:172] INFO level logging started! > Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210603 12838 > main.cpp:190] Build: 2015-12-16 23:06:16 by root > Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210625 12838 > main.cpp:192] Version: 0.26.0 > Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210634 12838 > main.cpp:195] Git tag: 0.26.0 > Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210644 12838 > main.cpp:199] Git SHA: d3717e5c4d1bf4fca5c41cd7ea54fae489028faa > Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.210765 12838 > containerizer.cpp:142] Using isolation: posix/cpu,posix/mem,filesystem/posix > Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224 10:35:25.215638 12838 > linux_launcher.cpp:103] Using /sys/fs/cgroup/freezer as the freezer hierarchy > for the Linux launcher > Dec 24 10:35:25 mesos-slave1 mesos-slave[12845]: I1224
[jira] [Commented] (MESOS-5629) Agent segfaults after request to '/files/browse'
[ https://issues.apache.org/jira/browse/MESOS-5629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336423#comment-15336423 ] Greg Mann commented on MESOS-5629: -- I just did some testing as well - reliably reproduced the segfault before the fix, and was unable to induce it after the fix. LGTM! > Agent segfaults after request to '/files/browse' > > > Key: MESOS-5629 > URL: https://issues.apache.org/jira/browse/MESOS-5629 > Project: Mesos > Issue Type: Bug > Environment: CentOS 7, Mesos 1.0.0-rc1 with patches >Reporter: Greg Mann >Assignee: Joerg Schad >Priority: Blocker > Labels: authorization, mesosphere, security > Fix For: 1.0.0 > > Attachments: test-browse.py > > > We observed a number of agent segfaults today on an internal testing cluster. > Here is a log excerpt: > {code} > Jun 16 17:12:28 ip-10-10-0-87 mesos-slave[24818]: I0616 17:12:28.522925 24830 > status_update_manager.cpp:392] Received status update acknowledgement (UUID: > e79ab0f4-2fa2-4df2-9b59-89b97a482167) for task > datadog-monitor.804b138b-33e5-11e6-ac16-566ccbdde23e of framework > 6d4248cd-2832-4152-b5d0-defbf36f6759- > Jun 16 17:12:28 ip-10-10-0-87 mesos-slave[24818]: I0616 17:12:28.523006 24830 > status_update_manager.cpp:824] Checkpointing ACK for status update > TASK_RUNNING (UUID: e79ab0f4-2fa2-4df2-9b59-89b97a482167) for task > datadog-monitor.804b138b-33e5-11e6-ac16-566ccbdde23e of framework > 6d4248cd-2832-4152-b5d0-defbf36f6759- > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: I0616 17:12:29.147181 24824 > http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.87:33356 > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: *** Aborted at 1466097149 > (unix time) try "date -d @1466097149" if you are using GNU date *** > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: PC: @ 0x7ff4d68b12a3 > (unknown) > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: *** SIGSEGV (@0x0) received > by PID 24818 (TID 0x7ff4d31ab700) from PID 0; stack trace: *** > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d6431100 > (unknown) > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d68b12a3 > (unknown) > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7eced33 > process::dispatch<>() > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7e7aad7 > _ZNSt17_Function_handlerIFN7process6FutureIbEERK6OptionISsEEZN5mesos8internal5slave9Framework15recoverExecutorERKNSA_5state13ExecutorStateEEUlS6_E_E9_M_invokeERKSt9_Any_dataS6_ > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7bd1752 > mesos::internal::FilesProcess::authorize() > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7bd1bea > mesos::internal::FilesProcess::browse() > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7bd6e43 > std::_Function_handler<>::_M_invoke() > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d85478cb > _ZZZN7process11ProcessBase5visitERKNS_9HttpEventEENKUlRKNS_6FutureI6OptionINS_4http14authentication20AuthenticationResultE0_clESC_ENKUlRKNS4_IbEEE1_clESG_ > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d8551341 > process::ProcessManager::resume() > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d8551647 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d6909220 > (unknown) > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d6429dc5 > start_thread > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d615728d __clone > Jun 16 17:12:29 ip-10-10-0-87 systemd[1]: dcos-mesos-slave.service: main > process exited, code=killed, status=11/SEGV > Jun 16 17:12:29 ip-10-10-0-87 systemd[1]: Unit dcos-mesos-slave.service > entered failed state. > Jun 16 17:12:29 ip-10-10-0-87 systemd[1]: dcos-mesos-slave.service failed. > Jun 16 17:12:34 ip-10-10-0-87 systemd[1]: dcos-mesos-slave.service holdoff > time over, scheduling restart. > {code} > In every case, the stack trace indicates one of the {{/files/*}} endpoints; I > observed this a number of times coming from {{browse()}}, and twice from > {{read()}}. > The agent was built from the 1.0.0-rc1 branch, with two cherry-picks applied: > [this|https://reviews.apache.org/r/48563/] and > [this|https://reviews.apache.org/r/48566/], which were done to repair a > different [segfault issue|https://issues.apache.org/jira/browse/MESOS-5587] > on the master and agent. > Thanks go to [~bmahler] for digging into this a bit and discovering a > possible cause > [here|https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L5737-L5745], > where use of {{defer()}} may be necessary
[jira] [Commented] (MESOS-4087) Introduce a module for logging executor/task output
[ https://issues.apache.org/jira/browse/MESOS-4087?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336124#comment-15336124 ] Mallik Singaraju commented on MESOS-4087: - Hi, we are using spark 1.6.1 deployed and running on mesos and I need some info on how to capture the log of spark executors running on mesos in spark 1.6.1. We are not using the container based approach to deploy spark on mesos. Instead we are currently just deploying spark job (.jar) though spark-submit. I am not currently able to override the default behavior of spark executors always picking up the http://log4j.properties in the /conf . I tried setting the log4j.configuration to http://log4j.properties in the classpath of .jar and did supply that as argument to spark-submit. That does not seem to capture any logs of the spark executor tasks in mesos. I did figure out that you worked on the logging piece through JIRA. Do you have any recommendation on how to approach this? > Introduce a module for logging executor/task output > --- > > Key: MESOS-4087 > URL: https://issues.apache.org/jira/browse/MESOS-4087 > Project: Mesos > Issue Type: Task > Components: containerization, modules >Reporter: Joseph Wu >Assignee: Joseph Wu > Labels: logging, mesosphere > Fix For: 0.27.0 > > > Existing executor/task logs are logged to files in their sandbox directory, > with some nuances based on which containerizer is used (see background > section in linked document). > A logger for executor/task logs has the following requirements: > * The logger is given a command to run and must handle the stdout/stderr of > the command. > * The handling of stdout/stderr must be resilient across agent failover. > Logging should not stop if the agent fails. > * Logs should be readable, presumably via the web UI, or via some other > module-specific UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5629) Agent segfaults after request to '/files/browse'
[ https://issues.apache.org/jira/browse/MESOS-5629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15336016#comment-15336016 ] Joerg Schad commented on MESOS-5629: https://reviews.apache.org/r/48849/ > Agent segfaults after request to '/files/browse' > > > Key: MESOS-5629 > URL: https://issues.apache.org/jira/browse/MESOS-5629 > Project: Mesos > Issue Type: Bug > Environment: CentOS 7, Mesos 1.0.0-rc1 with patches >Reporter: Greg Mann >Assignee: Joerg Schad >Priority: Blocker > Labels: authorization, mesosphere, security > Fix For: 1.0.0 > > Attachments: test-browse.py > > > We observed a number of agent segfaults today on an internal testing cluster. > Here is a log excerpt: > {code} > Jun 16 17:12:28 ip-10-10-0-87 mesos-slave[24818]: I0616 17:12:28.522925 24830 > status_update_manager.cpp:392] Received status update acknowledgement (UUID: > e79ab0f4-2fa2-4df2-9b59-89b97a482167) for task > datadog-monitor.804b138b-33e5-11e6-ac16-566ccbdde23e of framework > 6d4248cd-2832-4152-b5d0-defbf36f6759- > Jun 16 17:12:28 ip-10-10-0-87 mesos-slave[24818]: I0616 17:12:28.523006 24830 > status_update_manager.cpp:824] Checkpointing ACK for status update > TASK_RUNNING (UUID: e79ab0f4-2fa2-4df2-9b59-89b97a482167) for task > datadog-monitor.804b138b-33e5-11e6-ac16-566ccbdde23e of framework > 6d4248cd-2832-4152-b5d0-defbf36f6759- > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: I0616 17:12:29.147181 24824 > http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.87:33356 > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: *** Aborted at 1466097149 > (unix time) try "date -d @1466097149" if you are using GNU date *** > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: PC: @ 0x7ff4d68b12a3 > (unknown) > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: *** SIGSEGV (@0x0) received > by PID 24818 (TID 0x7ff4d31ab700) from PID 0; stack trace: *** > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d6431100 > (unknown) > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d68b12a3 > (unknown) > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7eced33 > process::dispatch<>() > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7e7aad7 > _ZNSt17_Function_handlerIFN7process6FutureIbEERK6OptionISsEEZN5mesos8internal5slave9Framework15recoverExecutorERKNSA_5state13ExecutorStateEEUlS6_E_E9_M_invokeERKSt9_Any_dataS6_ > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7bd1752 > mesos::internal::FilesProcess::authorize() > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7bd1bea > mesos::internal::FilesProcess::browse() > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7bd6e43 > std::_Function_handler<>::_M_invoke() > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d85478cb > _ZZZN7process11ProcessBase5visitERKNS_9HttpEventEENKUlRKNS_6FutureI6OptionINS_4http14authentication20AuthenticationResultE0_clESC_ENKUlRKNS4_IbEEE1_clESG_ > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d8551341 > process::ProcessManager::resume() > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d8551647 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d6909220 > (unknown) > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d6429dc5 > start_thread > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d615728d __clone > Jun 16 17:12:29 ip-10-10-0-87 systemd[1]: dcos-mesos-slave.service: main > process exited, code=killed, status=11/SEGV > Jun 16 17:12:29 ip-10-10-0-87 systemd[1]: Unit dcos-mesos-slave.service > entered failed state. > Jun 16 17:12:29 ip-10-10-0-87 systemd[1]: dcos-mesos-slave.service failed. > Jun 16 17:12:34 ip-10-10-0-87 systemd[1]: dcos-mesos-slave.service holdoff > time over, scheduling restart. > {code} > In every case, the stack trace indicates one of the {{/files/*}} endpoints; I > observed this a number of times coming from {{browse()}}, and twice from > {{read()}}. > The agent was built from the 1.0.0-rc1 branch, with two cherry-picks applied: > [this|https://reviews.apache.org/r/48563/] and > [this|https://reviews.apache.org/r/48566/], which were done to repair a > different [segfault issue|https://issues.apache.org/jira/browse/MESOS-5587] > on the master and agent. > Thanks go to [~bmahler] for digging into this a bit and discovering a > possible cause > [here|https://github.com/apache/mesos/blob/master/src/slave/slave.cpp#L5737-L5745], > where use of {{defer()}} may be necessary to keep execution in the correct > context. -- This message was sent by Atlassian JIRA
[jira] [Commented] (MESOS-5629) Agent segfaults after request to '/files/browse'
[ https://issues.apache.org/jira/browse/MESOS-5629?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335967#comment-15335967 ] Joerg Schad commented on MESOS-5629: Most likely hypotheses the issue is capturing `this` in Framework::launchExecutor and Framework::recoverExecutor. The `Framework` will out of scope, but the `this` pointer to this is still kept in the lambda and hence dangling. Our solution is to remove the `this` capture and replace it by a value copy of the slave pid (which is the only attribute used from captured `this`). We have not been able to reproduce this on an AWS instance. [~greggomann] could you help out with verifying the patch (see next comment)? > Agent segfaults after request to '/files/browse' > > > Key: MESOS-5629 > URL: https://issues.apache.org/jira/browse/MESOS-5629 > Project: Mesos > Issue Type: Bug > Environment: CentOS 7, Mesos 1.0.0-rc1 with patches >Reporter: Greg Mann >Assignee: Joerg Schad >Priority: Blocker > Labels: authorization, mesosphere, security > Fix For: 1.0.0 > > Attachments: test-browse.py > > > We observed a number of agent segfaults today on an internal testing cluster. > Here is a log excerpt: > {code} > Jun 16 17:12:28 ip-10-10-0-87 mesos-slave[24818]: I0616 17:12:28.522925 24830 > status_update_manager.cpp:392] Received status update acknowledgement (UUID: > e79ab0f4-2fa2-4df2-9b59-89b97a482167) for task > datadog-monitor.804b138b-33e5-11e6-ac16-566ccbdde23e of framework > 6d4248cd-2832-4152-b5d0-defbf36f6759- > Jun 16 17:12:28 ip-10-10-0-87 mesos-slave[24818]: I0616 17:12:28.523006 24830 > status_update_manager.cpp:824] Checkpointing ACK for status update > TASK_RUNNING (UUID: e79ab0f4-2fa2-4df2-9b59-89b97a482167) for task > datadog-monitor.804b138b-33e5-11e6-ac16-566ccbdde23e of framework > 6d4248cd-2832-4152-b5d0-defbf36f6759- > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: I0616 17:12:29.147181 24824 > http.cpp:192] HTTP GET for /slave(1)/state from 10.10.0.87:33356 > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: *** Aborted at 1466097149 > (unix time) try "date -d @1466097149" if you are using GNU date *** > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: PC: @ 0x7ff4d68b12a3 > (unknown) > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: *** SIGSEGV (@0x0) received > by PID 24818 (TID 0x7ff4d31ab700) from PID 0; stack trace: *** > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d6431100 > (unknown) > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d68b12a3 > (unknown) > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7eced33 > process::dispatch<>() > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7e7aad7 > _ZNSt17_Function_handlerIFN7process6FutureIbEERK6OptionISsEEZN5mesos8internal5slave9Framework15recoverExecutorERKNSA_5state13ExecutorStateEEUlS6_E_E9_M_invokeERKSt9_Any_dataS6_ > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7bd1752 > mesos::internal::FilesProcess::authorize() > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7bd1bea > mesos::internal::FilesProcess::browse() > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d7bd6e43 > std::_Function_handler<>::_M_invoke() > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d85478cb > _ZZZN7process11ProcessBase5visitERKNS_9HttpEventEENKUlRKNS_6FutureI6OptionINS_4http14authentication20AuthenticationResultE0_clESC_ENKUlRKNS4_IbEEE1_clESG_ > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d8551341 > process::ProcessManager::resume() > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d8551647 > _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d6909220 > (unknown) > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d6429dc5 > start_thread > Jun 16 17:12:29 ip-10-10-0-87 mesos-slave[24818]: @ 0x7ff4d615728d __clone > Jun 16 17:12:29 ip-10-10-0-87 systemd[1]: dcos-mesos-slave.service: main > process exited, code=killed, status=11/SEGV > Jun 16 17:12:29 ip-10-10-0-87 systemd[1]: Unit dcos-mesos-slave.service > entered failed state. > Jun 16 17:12:29 ip-10-10-0-87 systemd[1]: dcos-mesos-slave.service failed. > Jun 16 17:12:34 ip-10-10-0-87 systemd[1]: dcos-mesos-slave.service holdoff > time over, scheduling restart. > {code} > In every case, the stack trace indicates one of the {{/files/*}} endpoints; I > observed this a number of times coming from {{browse()}}, and twice from > {{read()}}. > The agent was built from the 1.0.0-rc1 branch, with two cherry-picks applied: > [this|https://reviews.apache.org/r/48563/] and >
[jira] [Commented] (MESOS-5588) Improve error handling when parsing acls.
[ https://issues.apache.org/jira/browse/MESOS-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335936#comment-15335936 ] Till Toenshoff commented on MESOS-5588: --- The object count comparison appears like a great start - I like it. > Improve error handling when parsing acls. > - > > Key: MESOS-5588 > URL: https://issues.apache.org/jira/browse/MESOS-5588 > Project: Mesos > Issue Type: Improvement >Reporter: Joerg Schad >Assignee: Joerg Schad > Labels: mesosphere, security > Fix For: 1.0.0 > > > During parsing of the authorizer errors are ignored. This can lead to > undetected security issues. > Consider the following acl with an typo (usr instead of user) > {code} >"view_frameworks": [ > { > "principals": { "type": "ANY" }, > "usr": { "type": "NONE" } > } > ] > {code} > When the master is started with these flags it will interprete the acl int he > following way which gives any principal access to any framework. > {noformat} > view_frameworks { > principals { > type: ANY > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5588) Improve error handling when parsing acls.
[ https://issues.apache.org/jira/browse/MESOS-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Till Toenshoff updated MESOS-5588: -- Priority: Major (was: Blocker) > Improve error handling when parsing acls. > - > > Key: MESOS-5588 > URL: https://issues.apache.org/jira/browse/MESOS-5588 > Project: Mesos > Issue Type: Improvement >Reporter: Joerg Schad >Assignee: Joerg Schad > Labels: mesosphere, security > Fix For: 1.0.0 > > > During parsing of the authorizer errors are ignored. This can lead to > undetected security issues. > Consider the following acl with an typo (usr instead of user) > {code} >"view_frameworks": [ > { > "principals": { "type": "ANY" }, > "usr": { "type": "NONE" } > } > ] > {code} > When the master is started with these flags it will interprete the acl int he > following way which gives any principal access to any framework. > {noformat} > view_frameworks { > principals { > type: ANY > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5588) Improve error handling when parsing acls.
[ https://issues.apache.org/jira/browse/MESOS-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335934#comment-15335934 ] Till Toenshoff commented on MESOS-5588: --- This patch de-escalates the issue down from the original blocker as further changes will not change the API aka proto. > Improve error handling when parsing acls. > - > > Key: MESOS-5588 > URL: https://issues.apache.org/jira/browse/MESOS-5588 > Project: Mesos > Issue Type: Improvement >Reporter: Joerg Schad >Assignee: Joerg Schad >Priority: Blocker > Labels: mesosphere, security > Fix For: 1.0.0 > > > During parsing of the authorizer errors are ignored. This can lead to > undetected security issues. > Consider the following acl with an typo (usr instead of user) > {code} >"view_frameworks": [ > { > "principals": { "type": "ANY" }, > "usr": { "type": "NONE" } > } > ] > {code} > When the master is started with these flags it will interprete the acl int he > following way which gives any principal access to any framework. > {noformat} > view_frameworks { > principals { > type: ANY > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5588) Improve error handling when parsing acls.
[ https://issues.apache.org/jira/browse/MESOS-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335933#comment-15335933 ] Till Toenshoff commented on MESOS-5588: --- {noformat} commit a1a9108338b37f2aea0a575dfc7cbca5b8489cc1 Author: Alexander RojasDate: Fri Jun 17 13:02:38 2016 +0200 Marked some optional fields in acls.proto as required. The messages `GetEndpoints`, `ViewFramework`, `ViewTask`, `ViewExecutor` and `AccessSandbox` all have optional authorization objects as a result of copy and pasting previous message, but their semantics were those of an required field, which led to some unexpected behavior when a user misstyped any entry there. This patch sets the fields to their actual expected values. Review: https://reviews.apache.org/r/48781/ {noformat} > Improve error handling when parsing acls. > - > > Key: MESOS-5588 > URL: https://issues.apache.org/jira/browse/MESOS-5588 > Project: Mesos > Issue Type: Improvement >Reporter: Joerg Schad >Assignee: Joerg Schad >Priority: Blocker > Labels: mesosphere, security > Fix For: 1.0.0 > > > During parsing of the authorizer errors are ignored. This can lead to > undetected security issues. > Consider the following acl with an typo (usr instead of user) > {code} >"view_frameworks": [ > { > "principals": { "type": "ANY" }, > "usr": { "type": "NONE" } > } > ] > {code} > When the master is started with these flags it will interprete the acl int he > following way which gives any principal access to any framework. > {noformat} > view_frameworks { > principals { > type: ANY > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5632) Orphaned docker container not killed if executor has exited
[ https://issues.apache.org/jira/browse/MESOS-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335734#comment-15335734 ] Mansheng Yang commented on MESOS-5632: -- yes - restarting the agent will kill the two containers and start a new one > Orphaned docker container not killed if executor has exited > --- > > Key: MESOS-5632 > URL: https://issues.apache.org/jira/browse/MESOS-5632 > Project: Mesos > Issue Type: Bug > Components: docker, slave >Reporter: Mansheng Yang > > [This ticket|https://issues.apache.org/jira/browse/MESOS-3573] is marked as > resolved but it was only partially fixed. > As mentioned in that ticket, if you start a docker container, kill the > docker-executor process, then a new container will be started but the old one > will still be there. > Some logs: > {noformat} > I0617 15:01:22.851604 7285 docker.cpp:877] Recovering container > '71695f70-afad-421d-8636-deb6724ecaca' for executor > 'kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d' of framework > '317ab6ce-d599-4ad4-bae2-eb74a6c42d87-' > I0617 15:01:22.853303 7285 docker.cpp:2107] Executor for container > '71695f70-afad-421d-8636-deb6724ecaca' has exited > I0617 15:01:22.853327 7285 docker.cpp:1826] Destroying container > '71695f70-afad-421d-8636-deb6724ecaca' > I0617 15:01:22.853575 7285 docker.cpp:1954] Running docker stop on container > '71695f70-afad-421d-8636-deb6724ecaca' > I0617 15:01:22.853607 7285 docker.cpp:1956] Running docker stop on container > 'mesos-cbb3d52c-b6dd-4b7e-864d-705fc2fab983-S4.71695f70-afad-421d-8636-deb6724ecaca'0 > I0617 15:01:22.854801 7283 slave.cpp:4767] Sending reconnect request to > executor 'kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d' of framework > 317ab6ce-d599-4ad4-bae2-eb74a6c42d87- at executor(1)@127.0.1.1:56304 > E0617 15:01:22.855870 7283 process.cpp:2040] Failed to shutdown socket with > fd 10: Transport endpoint is not connected > E0617 15:01:22.855974 7283 slave.cpp:4118] Termination of executor > 'kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d' of framework > 317ab6ce-d599-4ad4-bae2-eb74a6c42d87- failed: Unknown container: > 71695f70-afad-421d-8636-deb6724ecaca > I0617 15:01:22.857015 7283 slave.cpp:3257] Handling status update > TASK_FAILED (UUID: b5dfa1dc-62db-4fb5-93c8-958d22f930df) for task > kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d of framework > 317ab6ce-d599-4ad4-bae2-eb74a6c42d87- from @0.0.0.0:0 > W0617 15:01:22.858330 7288 docker.cpp:1403] Ignoring updating unknown > container: 71695f70-afad-421d-8636-deb6724ecaca > I0617 15:01:22.858819 7288 status_update_manager.cpp:320] Received status > update TASK_FAILED (UUID: b5dfa1dc-62db-4fb5-93c8-958d22f930df) for task > kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d of framework > 317ab6ce-d599-4ad4-bae2-eb74a6c42d87- > I0617 15:01:22.858986 7288 status_update_manager.cpp:824] Checkpointing > UPDATE for status update TASK_FAILED (UUID: > b5dfa1dc-62db-4fb5-93c8-958d22f930df) for task > kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d of framework > 317ab6ce-d599-4ad4-bae2-eb74a6c42d87- > W0617 15:01:22.920336 7289 slave.cpp:3601] Dropping status update > TASK_FAILED (UUID: b5dfa1dc-62db-4fb5-93c8-958d22f930df) for task > kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d of framework > 317ab6ce-d599-4ad4-bae2-eb74a6c42d87- sent by status update manager > because the agent is in RECOVERING state > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5632) Orphaned docker container not killed if executor has exited
[ https://issues.apache.org/jira/browse/MESOS-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335650#comment-15335650 ] haosdent commented on MESOS-5632: - If restart Mesos Agent work for you? > Orphaned docker container not killed if executor has exited > --- > > Key: MESOS-5632 > URL: https://issues.apache.org/jira/browse/MESOS-5632 > Project: Mesos > Issue Type: Bug > Components: docker, slave >Reporter: Mansheng Yang > > [This ticket|https://issues.apache.org/jira/browse/MESOS-3573] is marked as > resolved but it was only partially fixed. > As mentioned in that ticket, if you start a docker container, kill the > docker-executor process, then a new container will be started but the old one > will still be there. > Some logs: > {noformat} > I0617 15:01:22.851604 7285 docker.cpp:877] Recovering container > '71695f70-afad-421d-8636-deb6724ecaca' for executor > 'kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d' of framework > '317ab6ce-d599-4ad4-bae2-eb74a6c42d87-' > I0617 15:01:22.853303 7285 docker.cpp:2107] Executor for container > '71695f70-afad-421d-8636-deb6724ecaca' has exited > I0617 15:01:22.853327 7285 docker.cpp:1826] Destroying container > '71695f70-afad-421d-8636-deb6724ecaca' > I0617 15:01:22.853575 7285 docker.cpp:1954] Running docker stop on container > '71695f70-afad-421d-8636-deb6724ecaca' > I0617 15:01:22.853607 7285 docker.cpp:1956] Running docker stop on container > 'mesos-cbb3d52c-b6dd-4b7e-864d-705fc2fab983-S4.71695f70-afad-421d-8636-deb6724ecaca'0 > I0617 15:01:22.854801 7283 slave.cpp:4767] Sending reconnect request to > executor 'kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d' of framework > 317ab6ce-d599-4ad4-bae2-eb74a6c42d87- at executor(1)@127.0.1.1:56304 > E0617 15:01:22.855870 7283 process.cpp:2040] Failed to shutdown socket with > fd 10: Transport endpoint is not connected > E0617 15:01:22.855974 7283 slave.cpp:4118] Termination of executor > 'kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d' of framework > 317ab6ce-d599-4ad4-bae2-eb74a6c42d87- failed: Unknown container: > 71695f70-afad-421d-8636-deb6724ecaca > I0617 15:01:22.857015 7283 slave.cpp:3257] Handling status update > TASK_FAILED (UUID: b5dfa1dc-62db-4fb5-93c8-958d22f930df) for task > kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d of framework > 317ab6ce-d599-4ad4-bae2-eb74a6c42d87- from @0.0.0.0:0 > W0617 15:01:22.858330 7288 docker.cpp:1403] Ignoring updating unknown > container: 71695f70-afad-421d-8636-deb6724ecaca > I0617 15:01:22.858819 7288 status_update_manager.cpp:320] Received status > update TASK_FAILED (UUID: b5dfa1dc-62db-4fb5-93c8-958d22f930df) for task > kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d of framework > 317ab6ce-d599-4ad4-bae2-eb74a6c42d87- > I0617 15:01:22.858986 7288 status_update_manager.cpp:824] Checkpointing > UPDATE for status update TASK_FAILED (UUID: > b5dfa1dc-62db-4fb5-93c8-958d22f930df) for task > kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d of framework > 317ab6ce-d599-4ad4-bae2-eb74a6c42d87- > W0617 15:01:22.920336 7289 slave.cpp:3601] Dropping status update > TASK_FAILED (UUID: b5dfa1dc-62db-4fb5-93c8-958d22f930df) for task > kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d of framework > 317ab6ce-d599-4ad4-bae2-eb74a6c42d87- sent by status update manager > because the agent is in RECOVERING state > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5632) Orphaned docker container not killed if executor has exited
[ https://issues.apache.org/jira/browse/MESOS-5632?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mansheng Yang updated MESOS-5632: - Description: [This ticket|https://issues.apache.org/jira/browse/MESOS-3573] is marked as resolved but it was only partially fixed. As mentioned in that ticket, if you start a docker container, kill the docker-executor process, then a new container will be started but the old one will still be there. Some logs: {noformat} I0617 15:01:22.851604 7285 docker.cpp:877] Recovering container '71695f70-afad-421d-8636-deb6724ecaca' for executor 'kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d' of framework '317ab6ce-d599-4ad4-bae2-eb74a6c42d87-' I0617 15:01:22.853303 7285 docker.cpp:2107] Executor for container '71695f70-afad-421d-8636-deb6724ecaca' has exited I0617 15:01:22.853327 7285 docker.cpp:1826] Destroying container '71695f70-afad-421d-8636-deb6724ecaca' I0617 15:01:22.853575 7285 docker.cpp:1954] Running docker stop on container '71695f70-afad-421d-8636-deb6724ecaca' I0617 15:01:22.853607 7285 docker.cpp:1956] Running docker stop on container 'mesos-cbb3d52c-b6dd-4b7e-864d-705fc2fab983-S4.71695f70-afad-421d-8636-deb6724ecaca'0 I0617 15:01:22.854801 7283 slave.cpp:4767] Sending reconnect request to executor 'kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d' of framework 317ab6ce-d599-4ad4-bae2-eb74a6c42d87- at executor(1)@127.0.1.1:56304 E0617 15:01:22.855870 7283 process.cpp:2040] Failed to shutdown socket with fd 10: Transport endpoint is not connected E0617 15:01:22.855974 7283 slave.cpp:4118] Termination of executor 'kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d' of framework 317ab6ce-d599-4ad4-bae2-eb74a6c42d87- failed: Unknown container: 71695f70-afad-421d-8636-deb6724ecaca I0617 15:01:22.857015 7283 slave.cpp:3257] Handling status update TASK_FAILED (UUID: b5dfa1dc-62db-4fb5-93c8-958d22f930df) for task kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d of framework 317ab6ce-d599-4ad4-bae2-eb74a6c42d87- from @0.0.0.0:0 W0617 15:01:22.858330 7288 docker.cpp:1403] Ignoring updating unknown container: 71695f70-afad-421d-8636-deb6724ecaca I0617 15:01:22.858819 7288 status_update_manager.cpp:320] Received status update TASK_FAILED (UUID: b5dfa1dc-62db-4fb5-93c8-958d22f930df) for task kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d of framework 317ab6ce-d599-4ad4-bae2-eb74a6c42d87- I0617 15:01:22.858986 7288 status_update_manager.cpp:824] Checkpointing UPDATE for status update TASK_FAILED (UUID: b5dfa1dc-62db-4fb5-93c8-958d22f930df) for task kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d of framework 317ab6ce-d599-4ad4-bae2-eb74a6c42d87- W0617 15:01:22.920336 7289 slave.cpp:3601] Dropping status update TASK_FAILED (UUID: b5dfa1dc-62db-4fb5-93c8-958d22f930df) for task kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d of framework 317ab6ce-d599-4ad4-bae2-eb74a6c42d87- sent by status update manager because the agent is in RECOVERING state {noformat} > Orphaned docker container not killed if executor has exited > --- > > Key: MESOS-5632 > URL: https://issues.apache.org/jira/browse/MESOS-5632 > Project: Mesos > Issue Type: Bug > Components: docker, slave >Reporter: Mansheng Yang > > [This ticket|https://issues.apache.org/jira/browse/MESOS-3573] is marked as > resolved but it was only partially fixed. > As mentioned in that ticket, if you start a docker container, kill the > docker-executor process, then a new container will be started but the old one > will still be there. > Some logs: > {noformat} > I0617 15:01:22.851604 7285 docker.cpp:877] Recovering container > '71695f70-afad-421d-8636-deb6724ecaca' for executor > 'kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d' of framework > '317ab6ce-d599-4ad4-bae2-eb74a6c42d87-' > I0617 15:01:22.853303 7285 docker.cpp:2107] Executor for container > '71695f70-afad-421d-8636-deb6724ecaca' has exited > I0617 15:01:22.853327 7285 docker.cpp:1826] Destroying container > '71695f70-afad-421d-8636-deb6724ecaca' > I0617 15:01:22.853575 7285 docker.cpp:1954] Running docker stop on container > '71695f70-afad-421d-8636-deb6724ecaca' > I0617 15:01:22.853607 7285 docker.cpp:1956] Running docker stop on container > 'mesos-cbb3d52c-b6dd-4b7e-864d-705fc2fab983-S4.71695f70-afad-421d-8636-deb6724ecaca'0 > I0617 15:01:22.854801 7283 slave.cpp:4767] Sending reconnect request to > executor 'kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d' of framework > 317ab6ce-d599-4ad4-bae2-eb74a6c42d87- at executor(1)@127.0.1.1:56304 > E0617 15:01:22.855870 7283 process.cpp:2040] Failed to shutdown socket with > fd 10: Transport endpoint is not connected > E0617 15:01:22.855974 7283 slave.cpp:4118] Termination of executor > 'kafka2.3802f3c9-3459-11e6-bf06-6e0c5199624d' of framework >
[jira] [Created] (MESOS-5632) Orphaned docker container not killed if executor has exited
Mansheng Yang created MESOS-5632: Summary: Orphaned docker container not killed if executor has exited Key: MESOS-5632 URL: https://issues.apache.org/jira/browse/MESOS-5632 Project: Mesos Issue Type: Bug Components: docker, slave Reporter: Mansheng Yang -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-5631) Implement clang-tidy check for incorrect use of capturing lambdas with Futures
[ https://issues.apache.org/jira/browse/MESOS-5631?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Bannier updated MESOS-5631: Description: When one enqueues capturing lambdas to a {{Future}} with {{then}} or the {{onXXX}} variations, in general any actor might execute that callback (no constraints imposed per se). This can lead to hard to understand dependencies or bugs if the lambda needs to access external state (i.e. anything it captures by references/pointer to instead of by value); instead such callbacks should always be constraint to a specific actor with {{dispatch}}/{{defer}} to ensure the pointed to data isn't modified in a concurrent thread. was: When one enqueues capturing lambdas to a {{Future}} with {{then}} or then {{onXXX}} variations, in general any actor might execute that callback (no constraints imposed per se). This can lead to hard to understand dependencies or bugs if the lambda needs to access external state (i.e. anything it captures by references/pointer to instead of by value); instead such callbacks should always be constraint to a specific actor with {{dispatch}}/{{defer}} to ensure the pointed to data isn't modified in a concurrent thread. > Implement clang-tidy check for incorrect use of capturing lambdas with Futures > -- > > Key: MESOS-5631 > URL: https://issues.apache.org/jira/browse/MESOS-5631 > Project: Mesos > Issue Type: Improvement >Reporter: Benjamin Bannier > > When one enqueues capturing lambdas to a {{Future}} with {{then}} or the > {{onXXX}} variations, in general any actor might execute that callback (no > constraints imposed per se). > This can lead to hard to understand dependencies or bugs if the lambda needs > to access external state (i.e. anything it captures by references/pointer to > instead of by value); instead such callbacks should always be constraint to a > specific actor with {{dispatch}}/{{defer}} to ensure the pointed to data > isn't modified in a concurrent thread. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5631) Implement clang-tidy check for incorrect use of capturing lambdas with Futures
Benjamin Bannier created MESOS-5631: --- Summary: Implement clang-tidy check for incorrect use of capturing lambdas with Futures Key: MESOS-5631 URL: https://issues.apache.org/jira/browse/MESOS-5631 Project: Mesos Issue Type: Improvement Reporter: Benjamin Bannier When one enqueues capturing lambdas to a {{Future}} with {{then}} or then {{onXXX}} variations, in general any actor might execute that callback (no constraints imposed per se). This can lead to hard to understand dependencies or bugs if the lambda needs to access external state (i.e. anything it captures by references/pointer to instead of by value); instead such callbacks should always be constraint to a specific actor with {{dispatch}}/{{defer}} to ensure the pointed to data isn't modified in a concurrent thread. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-4967) Oversubscription for reservation
[ https://issues.apache.org/jira/browse/MESOS-4967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Klaus Ma reassigned MESOS-4967: --- Assignee: Klaus Ma > Oversubscription for reservation > > > Key: MESOS-4967 > URL: https://issues.apache.org/jira/browse/MESOS-4967 > Project: Mesos > Issue Type: Epic > Components: allocation, framework, master >Reporter: Klaus Ma >Assignee: Klaus Ma > Labels: IBM, mesosphere > > Reserved resources allow frameworks and cluster operators to ensure > sufficient resources are available when needed. Reservations are usually > made to guarantee there are enough resources under peak loads. Often times, > reserved resources are not actually allocated; in other words, the frameworks > do not use those resources and they sit reserved, but idle. > This underutilization is either an opportunity cost or a direct cost, > particularly to the cluster operator. Reserved but unallocated resources > held by a Lender Framework could be optimistically offered to other > frameworks, which we refer to as Tenant Frameworks. When the resources are > requested back by the Lender Framework, some of the Tenant Frameworkâs tasks > are evicted and the original resource offer guarantee is preserved. > The first step is to identify when resources are reserved, but not allocated. > We then offer these reserved resources to other frameworks, but mark these > offered resources as revocable resources. This allows Tenant Frameworks to > use these resources temporarily in a 'best-effort' fashion, knowing that they > could be revoked or reclaimed at any time. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5491) Implement GET_AGENTS Call in v1 master API.
[ https://issues.apache.org/jira/browse/MESOS-5491?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335541#comment-15335541 ] zhou xing commented on MESOS-5491: -- two RRs: https://reviews.apache.org/r/48841/ & https://reviews.apache.org/r/48438/ > Implement GET_AGENTS Call in v1 master API. > --- > > Key: MESOS-5491 > URL: https://issues.apache.org/jira/browse/MESOS-5491 > Project: Mesos > Issue Type: Task >Reporter: Vinod Kone >Assignee: zhou xing > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-5588) Improve error handling when parsing acls.
[ https://issues.apache.org/jira/browse/MESOS-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335528#comment-15335528 ] Joerg Schad edited comment on MESOS-5588 at 6/17/16 6:37 AM: - 1) Yes that happens to every protobuf conversion (see my early comments on this ticket about changing the parsing). But in this case it yields a security critical issue. IMO this is part of the scope of this ticket (Improve error handling when parsing acls, but if you want to split the 1.0 blocker relevant part into an extra ticket that seems fine with me. I agree that the second part is not a blocker (as it does not involve an API change), but I would not say that it is a low priority wish was (Author: js84): I agree that the second part is not a blocker (as it does not involve an API change), but I would not say that it is a low priority wish > Improve error handling when parsing acls. > - > > Key: MESOS-5588 > URL: https://issues.apache.org/jira/browse/MESOS-5588 > Project: Mesos > Issue Type: Improvement >Reporter: Joerg Schad >Assignee: Joerg Schad >Priority: Blocker > Labels: mesosphere, security > Fix For: 1.0.0 > > > During parsing of the authorizer errors are ignored. This can lead to > undetected security issues. > Consider the following acl with an typo (usr instead of user) > {code} >"view_frameworks": [ > { > "principals": { "type": "ANY" }, > "usr": { "type": "NONE" } > } > ] > {code} > When the master is started with these flags it will interprete the acl int he > following way which gives any principal access to any framework. > {noformat} > view_frameworks { > principals { > type: ANY > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5588) Improve error handling when parsing acls.
[ https://issues.apache.org/jira/browse/MESOS-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335528#comment-15335528 ] Joerg Schad commented on MESOS-5588: I agree that the second part is not a blocker (as it does not involve an API change), but I would not say that it is a low priority wish > Improve error handling when parsing acls. > - > > Key: MESOS-5588 > URL: https://issues.apache.org/jira/browse/MESOS-5588 > Project: Mesos > Issue Type: Improvement >Reporter: Joerg Schad >Assignee: Joerg Schad >Priority: Blocker > Labels: mesosphere, security > Fix For: 1.0.0 > > > During parsing of the authorizer errors are ignored. This can lead to > undetected security issues. > Consider the following acl with an typo (usr instead of user) > {code} >"view_frameworks": [ > { > "principals": { "type": "ANY" }, > "usr": { "type": "NONE" } > } > ] > {code} > When the master is started with these flags it will interprete the acl int he > following way which gives any principal access to any framework. > {noformat} > view_frameworks { > principals { > type: ANY > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-5630) Change build to always enable Nvidia GPU support for Linux
Kevin Klues created MESOS-5630: -- Summary: Change build to always enable Nvidia GPU support for Linux Key: MESOS-5630 URL: https://issues.apache.org/jira/browse/MESOS-5630 Project: Mesos Issue Type: Improvement Environment: Build / run unit tests in three build environments: {noformat} 1) CentOS 7 on GPU capable machine 2) CentOS 7 on NON-GPU capable machine 3) OSX $ rm -rf build; ./bootstrap; mkdir build; cd build; ../configure; make -j check; sudo GTEST_FILTER="*NVIDIA*" src/mesos-tests {noformat} Test support/build_docker.sh (to make sure we won't crash Apache's CI): {noformat} $ ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1' CONFIGURATION="--enable-libevent --enable-ssl" COMPILER=gcc BUILDTOOL=autotools OS=centos:7 support/docker_build.sh $ ENVIRONMENT='GLOG_v=1 MESOS_VERBOSE=1' CONFIGURATION="--enable-libevent --enable-ssl" COMPILER=gcc BUILDTOOL=autotools OS=ubuntu:14.04 support/docker_build.sh {noformat} Reporter: Kevin Klues Assignee: Kevin Klues Fix For: 1.0.0 See Summary -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5588) Improve error handling when parsing acls.
[ https://issues.apache.org/jira/browse/MESOS-5588?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15335506#comment-15335506 ] Alexander Rojas commented on MESOS-5588: # What you describe is not an ACLs problem but it affects every protobuf/json conversion in Mesos, so probably we should open another Jira entry for that. # I do not think the behavior you describe is a blocker, since it doesn't represent a regression nor a change in the API, the patch provided deals with the blocked part. But what you suggest sounds more like a low priority whish. > Improve error handling when parsing acls. > - > > Key: MESOS-5588 > URL: https://issues.apache.org/jira/browse/MESOS-5588 > Project: Mesos > Issue Type: Improvement >Reporter: Joerg Schad >Assignee: Joerg Schad >Priority: Blocker > Labels: mesosphere, security > Fix For: 1.0.0 > > > During parsing of the authorizer errors are ignored. This can lead to > undetected security issues. > Consider the following acl with an typo (usr instead of user) > {code} >"view_frameworks": [ > { > "principals": { "type": "ANY" }, > "usr": { "type": "NONE" } > } > ] > {code} > When the master is started with these flags it will interprete the acl int he > following way which gives any principal access to any framework. > {noformat} > view_frameworks { > principals { > type: ANY > } > } > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)