[jira] [Updated] (MESOS-6988) CLONE - WebUI redirect doesn't work with stats from /metric/snapshot
[ https://issues.apache.org/jira/browse/MESOS-6988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Xu updated MESOS-6988: -- Affects Version/s: (was: 1.0.0) 1.1.0 > CLONE - WebUI redirect doesn't work with stats from /metric/snapshot > > > Key: MESOS-6988 > URL: https://issues.apache.org/jira/browse/MESOS-6988 > Project: Mesos > Issue Type: Bug > Components: webui >Affects Versions: 1.1.0 >Reporter: Yan Xu > > The issue described in MESOS-6446 is still not fixed in 1.1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6988) CLONE - WebUI redirect doesn't work with stats from /metric/snapshot
[ https://issues.apache.org/jira/browse/MESOS-6988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Xu updated MESOS-6988: -- Priority: Major (was: Blocker) > CLONE - WebUI redirect doesn't work with stats from /metric/snapshot > > > Key: MESOS-6988 > URL: https://issues.apache.org/jira/browse/MESOS-6988 > Project: Mesos > Issue Type: Bug > Components: webui >Affects Versions: 1.1.0 >Reporter: Yan Xu > > The issue described in MESOS-6446 is still not fixed in 1.1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6988) CLONE - WebUI redirect doesn't work with stats from /metric/snapshot
[ https://issues.apache.org/jira/browse/MESOS-6988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Xu updated MESOS-6988: -- Shepherd: (was: Vinod Kone) > CLONE - WebUI redirect doesn't work with stats from /metric/snapshot > > > Key: MESOS-6988 > URL: https://issues.apache.org/jira/browse/MESOS-6988 > Project: Mesos > Issue Type: Bug > Components: webui >Affects Versions: 1.0.0 >Reporter: Yan Xu >Priority: Blocker > > The issue described in MESOS-6446 is still not fixed in 1.1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6988) CLONE - WebUI redirect doesn't work with stats from /metric/snapshot
[ https://issues.apache.org/jira/browse/MESOS-6988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Xu updated MESOS-6988: -- Target Version/s: (was: 1.0.2, 1.1.0) > CLONE - WebUI redirect doesn't work with stats from /metric/snapshot > > > Key: MESOS-6988 > URL: https://issues.apache.org/jira/browse/MESOS-6988 > Project: Mesos > Issue Type: Bug > Components: webui >Affects Versions: 1.0.0 >Reporter: Yan Xu >Priority: Blocker > > The issue described in MESOS-6446 is still not fixed in 1.1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6988) CLONE - WebUI redirect doesn't work with stats from /metric/snapshot
[ https://issues.apache.org/jira/browse/MESOS-6988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Xu updated MESOS-6988: -- Assignee: (was: haosdent) > CLONE - WebUI redirect doesn't work with stats from /metric/snapshot > > > Key: MESOS-6988 > URL: https://issues.apache.org/jira/browse/MESOS-6988 > Project: Mesos > Issue Type: Bug > Components: webui >Affects Versions: 1.0.0 >Reporter: Yan Xu >Priority: Blocker > > The issue described in MESOS-6446 is still not fixed in 1.1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6988) CLONE - WebUI redirect doesn't work with stats from /metric/snapshot
[ https://issues.apache.org/jira/browse/MESOS-6988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Xu updated MESOS-6988: -- Fix Version/s: (was: 1.2.0) (was: 1.0.2) (was: 1.1.0) > CLONE - WebUI redirect doesn't work with stats from /metric/snapshot > > > Key: MESOS-6988 > URL: https://issues.apache.org/jira/browse/MESOS-6988 > Project: Mesos > Issue Type: Bug > Components: webui >Affects Versions: 1.0.0 >Reporter: Yan Xu >Priority: Blocker > > The issue described in MESOS-6446 is still not fixed in 1.1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6988) CLONE - WebUI redirect doesn't work with stats from /metric/snapshot
[ https://issues.apache.org/jira/browse/MESOS-6988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Xu updated MESOS-6988: -- Description: The issue described in MESOS-6446 is still not fixed in 1.1.0. (was: The issue ) > CLONE - WebUI redirect doesn't work with stats from /metric/snapshot > > > Key: MESOS-6988 > URL: https://issues.apache.org/jira/browse/MESOS-6988 > Project: Mesos > Issue Type: Bug > Components: webui >Affects Versions: 1.0.0 >Reporter: Yan Xu >Assignee: haosdent >Priority: Blocker > Fix For: 1.0.2, 1.1.0, 1.2.0 > > > The issue described in MESOS-6446 is still not fixed in 1.1.0. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6988) CLONE - WebUI redirect doesn't work with stats from /metric/snapshot
[ https://issues.apache.org/jira/browse/MESOS-6988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yan Xu updated MESOS-6988: -- Description: The issue (was: After Mesos 1.0, the webUI redirect is hidden from the users so you can go to any of the master and the webUI is populated with state.json from the leading master. This doesn't include stats from /metric/snapshot though as it is not redirected. The user ends up seeing some fields with empty values.) > CLONE - WebUI redirect doesn't work with stats from /metric/snapshot > > > Key: MESOS-6988 > URL: https://issues.apache.org/jira/browse/MESOS-6988 > Project: Mesos > Issue Type: Bug > Components: webui >Affects Versions: 1.0.0 >Reporter: Yan Xu >Assignee: haosdent >Priority: Blocker > Fix For: 1.0.2, 1.1.0, 1.2.0 > > > The issue -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6988) CLONE - WebUI redirect doesn't work with stats from /metric/snapshot
Yan Xu created MESOS-6988: - Summary: CLONE - WebUI redirect doesn't work with stats from /metric/snapshot Key: MESOS-6988 URL: https://issues.apache.org/jira/browse/MESOS-6988 Project: Mesos Issue Type: Bug Components: webui Affects Versions: 1.0.0 Reporter: Yan Xu Assignee: haosdent Priority: Blocker Fix For: 1.0.2, 1.1.0, 1.2.0 After Mesos 1.0, the webUI redirect is hidden from the users so you can go to any of the master and the webUI is populated with state.json from the leading master. This doesn't include stats from /metric/snapshot though as it is not redirected. The user ends up seeing some fields with empty values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6953) A compromised mesos-master node can execute code as root on agents.
[ https://issues.apache.org/jira/browse/MESOS-6953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15837018#comment-15837018 ] Anindya Sinha commented on MESOS-6953: -- In a normal case (when master is not compromised), we should always have the same acls for {{run_tasks}} on each agent of the cluster, so the framework should be sure that the tasks would launch on any agent if it passes authorization on the master. In the case of a compromised master, we do not want the agent to launch tasks as a privileged user. The check against the {{run_tasks}} acl on the agent is just for that purpose. Regarding the live upgrade case: If this functionality is desired (i.e. to protect against running tasks on the agent as privileged users through a compromised master), we need to add the {{run_tasks}} acl (not all acls) on each agent that matches with the {{run_tasks}} acl on the master. Another option instead of using framework principal as the "subject" could be to add another flag for mesos-slave that enlists the {{whitelisted_users}} (instead of using {{acls}}) which the agent checks to ensure that the task user for the task that is going to be launched is included in that list of whitelisted users. The reason of using {{acls}} on the agent is mainly to reuse existing authorization module. > A compromised mesos-master node can execute code as root on agents. > --- > > Key: MESOS-6953 > URL: https://issues.apache.org/jira/browse/MESOS-6953 > Project: Mesos > Issue Type: Bug > Components: security >Reporter: Anindya Sinha >Assignee: Anindya Sinha > Labels: security, slave > > mesos-master has a `--[no-]root_submissions` flag that controls whether > frameworks with `root` user are admitted to the cluster. > However, if a mesos-master node is compromised, it can attempt to schedule > tasks on agent as the `root` user. Since mesos-agent has no check against > tasks running on the agent for specific users, tasks can get run with `root` > privileges can get run within the container on the agent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6446) WebUI redirect doesn't work with stats from /metric/snapshot
[ https://issues.apache.org/jira/browse/MESOS-6446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15836998#comment-15836998 ] Yan Xu commented on MESOS-6446: --- You should be able to directly read the metric endpoint, this ticket is about the webUI should read from the leading master's metrics. Alright I'll open a new one. > WebUI redirect doesn't work with stats from /metric/snapshot > > > Key: MESOS-6446 > URL: https://issues.apache.org/jira/browse/MESOS-6446 > Project: Mesos > Issue Type: Bug > Components: webui >Affects Versions: 1.0.0 >Reporter: Yan Xu >Assignee: haosdent >Priority: Blocker > Fix For: 1.0.2, 1.1.0, 1.2.0 > > Attachments: Screen Shot 2016-10-21 at 12.04.23 PM.png, > webui_metrics.gif > > > After Mesos 1.0, the webUI redirect is hidden from the users so you can go to > any of the master and the webUI is populated with state.json from the leading > master. > This doesn't include stats from /metric/snapshot though as it is not > redirected. The user ends up seeing some fields with empty values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6446) WebUI redirect doesn't work with stats from /metric/snapshot
[ https://issues.apache.org/jira/browse/MESOS-6446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15836985#comment-15836985 ] Adam B commented on MESOS-6446: --- And please open a new (cloned even) ticket for the non-leading masters, since we've already committed some fixes to 3 different releases, and set the fixVersions accordingly. It'll be easier to track the fixVersions for the new issue/fix/backports. > WebUI redirect doesn't work with stats from /metric/snapshot > > > Key: MESOS-6446 > URL: https://issues.apache.org/jira/browse/MESOS-6446 > Project: Mesos > Issue Type: Bug > Components: webui >Affects Versions: 1.0.0 >Reporter: Yan Xu >Assignee: haosdent >Priority: Blocker > Fix For: 1.0.2, 1.1.0, 1.2.0 > > Attachments: Screen Shot 2016-10-21 at 12.04.23 PM.png, > webui_metrics.gif > > > After Mesos 1.0, the webUI redirect is hidden from the users so you can go to > any of the master and the webUI is populated with state.json from the leading > master. > This doesn't include stats from /metric/snapshot though as it is not > redirected. The user ends up seeing some fields with empty values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6446) WebUI redirect doesn't work with stats from /metric/snapshot
[ https://issues.apache.org/jira/browse/MESOS-6446?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15836981#comment-15836981 ] Adam B commented on MESOS-6446: --- But might you need to read the metrics for the non-leading masters themselves, instead of always getting the metrics from the leading master? I'm not sure we always want to redirect for metrics.. > WebUI redirect doesn't work with stats from /metric/snapshot > > > Key: MESOS-6446 > URL: https://issues.apache.org/jira/browse/MESOS-6446 > Project: Mesos > Issue Type: Bug > Components: webui >Affects Versions: 1.0.0 >Reporter: Yan Xu >Assignee: haosdent >Priority: Blocker > Fix For: 1.0.2, 1.1.0, 1.2.0 > > Attachments: Screen Shot 2016-10-21 at 12.04.23 PM.png, > webui_metrics.gif > > > After Mesos 1.0, the webUI redirect is hidden from the users so you can go to > any of the master and the webUI is populated with state.json from the leading > master. > This doesn't include stats from /metric/snapshot though as it is not > redirected. The user ends up seeing some fields with empty values. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-5116) Investigate supporting accounting only mode in XFS isolator
[ https://issues.apache.org/jira/browse/MESOS-5116?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15836891#comment-15836891 ] James Peach edited comment on MESOS-5116 at 1/25/17 1:10 AM: - | Stop storing agent flags in the XFS disk isolator. | https://reviews.apache.org/r/55896/ | | Add support for not enforcing XFS quotas. | https://reviews.apache.org/r/55897/ | | Update XFS disk isolator documentation. | https://reviews.apache.org/r/55903/ | was (Author: jamespeach): | Stop storing agent flags in the XFS disk isolator. | https://reviews.apache.org/r/55896/ | | Add support for not enforcing XFS quotas. |https://reviews.apache.org/r/55897/ | > Investigate supporting accounting only mode in XFS isolator > --- > > Key: MESOS-5116 > URL: https://issues.apache.org/jira/browse/MESOS-5116 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: Yan Xu >Assignee: James Peach > > The initial implementation of XFS isolator always enforces the disk quota > limit. In contrast, Posix disk isolator supports optionally monitoring the > disk usage without enforcement. This eases the transition into disk quota > enforcement mode. > Mesos agent provides a {{flags.enforce_container_disk_quota}} flag to turn on > enforcement when the Posix isolator is added. With XFS either we support it > as well or we need to change the flag so it's Posix disk isolator specific. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6953) A compromised mesos-master node can execute code as root on agents.
[ https://issues.apache.org/jira/browse/MESOS-6953?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15836937#comment-15836937 ] Adam B commented on MESOS-6953: --- cc: [~arojas] Interesting.. So you use the framework principal as the "subject", although it's the master that's actually making the request? So, now, if a framework wants to run a task, it must have permission not just on the masters, but also on every agent (where it might want to run)? What if it has the ACL on some agents, but not others? How would it discover that, by trial and error? What's the live upgrade story here? Operators must copy the run_tasks ACL from the masters to all agents (and restart the agents)? > A compromised mesos-master node can execute code as root on agents. > --- > > Key: MESOS-6953 > URL: https://issues.apache.org/jira/browse/MESOS-6953 > Project: Mesos > Issue Type: Bug > Components: security >Reporter: Anindya Sinha >Assignee: Anindya Sinha > Labels: security, slave > > mesos-master has a `--[no-]root_submissions` flag that controls whether > frameworks with `root` user are admitted to the cluster. > However, if a mesos-master node is compromised, it can attempt to schedule > tasks on agent as the `root` user. Since mesos-agent has no check against > tasks running on the agent for specific users, tasks can get run with `root` > privileges can get run within the container on the agent. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6375) Support hierarchical resource allocation roles.
[ https://issues.apache.org/jira/browse/MESOS-6375?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15836740#comment-15836740 ] Neil Conway commented on MESOS-6375: Design doc: https://docs.google.com/document/d/1Ie2-6O400ayNXtRqipHq6_CCQ4wOoLWzoqql3b0Y6HU/edit# > Support hierarchical resource allocation roles. > --- > > Key: MESOS-6375 > URL: https://issues.apache.org/jira/browse/MESOS-6375 > Project: Mesos > Issue Type: Epic > Components: allocation >Reporter: Benjamin Mahler > > Currently mesos provides a non-hierarchical resource allocation model, in > which all roles are siblings of one another. > Organizations often have a need for hierarchical resource allocation > constraints, whether for fair sharing of resources or for specifying quota > constraints. > Consider the following fair sharing hierarchy based on "shares": > {noformat} > ^ ^ > / \ / \ > / \ / \ >eng (3) sales (1) => eng (75%) sales (25%) > ^ ^ >/ \ / \ > / \ / \ > ads (2)build (1) ads (66%) build (33%) > {noformat} > The hierarchy specifies that the engineering organization should get 3x as > many resources as sales, and within these resources the ads team should get > 2x as many resources as the build team. The implication of this is that, if > the ads team is not using some of its resources, the build team and > engineering organization will be able to use these resources before the sales > organization can. Without a hierarchy, the resources unused by the ads team > would be re-distributed among all other roles (rather than only its siblings). > Quota can also apply in a hierarchical manner: > {noformat} > ^ > / \ > / \ >eng (90 cpus) sales (10 cpus) > ^ >/ \ > / \ > ads (50 cpus) build (10 cpus) > {noformat} > See https://people.eecs.berkeley.edu/~alig/papers/h-drf.pdf for some > discussion w.r.t. sharing resources in a hierarchical model. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-6896) Support backend per container.
[ https://issues.apache.org/jira/browse/MESOS-6896?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gilbert Song reassigned MESOS-6896: --- Assignee: Gilbert Song > Support backend per container. > -- > > Key: MESOS-6896 > URL: https://issues.apache.org/jira/browse/MESOS-6896 > Project: Mesos > Issue Type: Improvement > Components: containerization >Reporter: Gilbert Song >Assignee: Gilbert Song > Labels: backend, containerizer > > Currently, the container backend is determined by the agent flag and all > containers are using the same backend. It is possible to achieve backend per > container by introducing a user facing API, which fulfills more robust use > cases (e.g., imagine that a group of container/nested container running an > application, while some containers only read from huge images and some others > only write to pluggable volumes). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-6904) Perform batching of allocations to reduce allocator queue backlogging.
[ https://issues.apache.org/jira/browse/MESOS-6904?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15817026#comment-15817026 ] Yan Xu edited comment on MESOS-6904 at 1/24/17 9:55 PM: Reviews currently in progress: https://reviews.apache.org/r/51027/ https://reviews.apache.org/r/51028/ https://reviews.apache.org/r/52534/ https://reviews.apache.org/r/55852/ https://reviews.apache.org/r/55893/ https://reviews.apache.org/r/55874/ was (Author: jjanco): Reviews currently in progress: https://reviews.apache.org/r/51027/ https://reviews.apache.org/r/51028/ https://reviews.apache.org/r/52534/ WIP from [~gyliu] https://reviews.apache.org/r/51621/ > Perform batching of allocations to reduce allocator queue backlogging. > -- > > Key: MESOS-6904 > URL: https://issues.apache.org/jira/browse/MESOS-6904 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Jacob Janco >Assignee: Jacob Janco >Priority: Critical > Labels: allocator > > Per MESOS-3157: > {quote} > Our deployment environments have a lot of churn, with many short-live > frameworks that often revive offers. Running the allocator takes a long time > (from seconds up to minutes). > In this situation, event-triggered allocation causes the event queue in the > allocator process to get very long, and the allocator effectively becomes > unresponsive (eg. a revive offers message takes too long to come to the head > of the queue). > {quote} > To remedy the above scenario, it is proposed to perform batching of the > enqueued allocation operations so that a single allocation operation can > satisfy N enqueued allocations. This should reduce the potential for > backlogging in the allocator. See the discussion > [here|https://issues.apache.org/jira/browse/MESOS-3157?focusedCommentId=14728377=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14728377] > in MESOS-3157. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6987) Incorrect metrics when framework on unreachable agent is torndown
Neil Conway created MESOS-6987: -- Summary: Incorrect metrics when framework on unreachable agent is torndown Key: MESOS-6987 URL: https://issues.apache.org/jira/browse/MESOS-6987 Project: Mesos Issue Type: Bug Components: master Reporter: Neil Conway Assignee: Neil Conway Priority: Minor Attachments: disconnect_framework_metrics_wrong-1.patch See attached patch. Scenario: * task T for framework F is launched on agent X * agent X is marked unreachable * framework F is torn-down * agent X re-registers * task T is shutdown The task is listed as "killed" in the {{/tasks}} endpoint, but "unreachable" in the master's metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6987) Incorrect metrics when framework on unreachable agent is torndown
[ https://issues.apache.org/jira/browse/MESOS-6987?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Neil Conway updated MESOS-6987: --- Attachment: disconnect_framework_metrics_wrong-1.patch > Incorrect metrics when framework on unreachable agent is torndown > - > > Key: MESOS-6987 > URL: https://issues.apache.org/jira/browse/MESOS-6987 > Project: Mesos > Issue Type: Bug > Components: master >Reporter: Neil Conway >Assignee: Neil Conway >Priority: Minor > Labels: mesosphere, metrics > Attachments: disconnect_framework_metrics_wrong-1.patch > > > See attached patch. Scenario: > * task T for framework F is launched on agent X > * agent X is marked unreachable > * framework F is torn-down > * agent X re-registers > * task T is shutdown > The task is listed as "killed" in the {{/tasks}} endpoint, but "unreachable" > in the master's metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6986) abort in DRFSorter::add
Yvan Royon created MESOS-6986: - Summary: abort in DRFSorter::add Key: MESOS-6986 URL: https://issues.apache.org/jira/browse/MESOS-6986 Project: Mesos Issue Type: Bug Components: allocation Affects Versions: 1.0.1 Environment: Mesosphere Enterprise DC/OS, CoreOS Reporter: Yvan Royon My mesos-master process terminated on SIGABRT. The CHECK failed in function {{DRFSorter::add}}: https://github.com/apache/mesos/blob/master/src/master/allocator/sorter/drf/sorter.cpp#L74 It seems there is a condition during framework registration where names are lost? We are using the mesos-go library ({{next}} branch), which uses the new HTTP API. The framework is custom Go code. The crash is hard to reliably reproduce. {code} mesos-master[90061]: F0119 01:07:57.426159 90086 sorter.cpp:73] Check failed: !contains(name) mesos-master[90061]: *** Check failure stack trace: *** mesos-master[90061]: @ 0x7f960d9299fd google::LogMessage::Fail() mesos-master[90061]: @ 0x7f960d92b82d google::LogMessage::SendToLog() mesos-master[90061]: @ 0x7f960d9295ec google::LogMessage::Flush() mesos-master[90061]: @ 0x7f960d92c129 google::LogMessageFatal::~LogMessageFatal() mesos-master[90061]: @ 0x7f960d03460d mesos::internal::master::allocator::DRFSorter::add() mesos-master[90061]: @ 0x7f960d021177 mesos::internal::master::allocator::internal::HierarchicalAllocatorProcess::addFramework() mesos-master[90061]: @ 0x7f960d8b9381 process::ProcessManager::resume() mesos-master[90061]: @ 0x7f960d8b9687 _ZNSt6thread5_ImplISt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvEUt_vEEE6_M_runEv mesos-master[90061]: @ 0x7f960bf52d73 (unknown) mesos-master[90061]: @ 0x7f960b74f52c (unknown) mesos-master[90061]: @ 0x7f960b49180d (unknown) systemd[1]: dcos-mesos-master.service: Main process exited, code=killed, status=6/ABRT {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6985) os::getenv() can segfault
[ https://issues.apache.org/jira/browse/MESOS-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15836657#comment-15836657 ] Benjamin Bannier commented on MESOS-6985: - Are we sure this is caused by {{os::getenv}} itself? In test code we sometimes call e.g., {{os::setenv}} to read the values later. We avoid this in non-test code as {{::getenv}} does not need to be reentrant, and would ideally not perform environment mutations in test code either once multiple actors are running. > os::getenv() can segfault > - > > Key: MESOS-6985 > URL: https://issues.apache.org/jira/browse/MESOS-6985 > Project: Mesos > Issue Type: Bug > Components: stout > Environment: ASF CI, Ubuntu 14.04 and CentOS 7 both with and without > libevent/SSL >Reporter: Greg Mann > Labels: stout > Attachments: > MasterMaintenanceTest.InverseOffersFilters-truncated.txt, > MasterTest.MultipleExecutors.txt > > > This was observed on ASF CI. The segfault first showed up on CI on 9/20/16 > and has been produced by the tests {{MasterTest.MultipleExecutors}} and > {{MasterMaintenanceTest.InverseOffersFilters}}. In both cases, > {{os::getenv()}} segfaults with the same stack trace: > {code} > *** Aborted at 1485241617 (unix time) try "date -d @1485241617" if you are > using GNU date *** > PC: @ 0x2ad59e3ae82d (unknown) > I0124 07:06:57.422080 28619 exec.cpp:162] Version: 1.2.0 > *** SIGSEGV (@0xf0) received by PID 28591 (TID 0x2ad5a7b87700) from PID 240; > stack trace: *** > I0124 07:06:57.422336 28615 exec.cpp:212] Executor started at: > executor(75)@172.17.0.2:45752 with pid 28591 > @ 0x2ad5ab953197 (unknown) > @ 0x2ad5ab957479 (unknown) > @ 0x2ad59e165330 (unknown) > @ 0x2ad59e3ae82d (unknown) > @ 0x2ad594631358 os::getenv() > @ 0x2ad59aba6acf mesos::internal::slave::executorEnvironment() > @ 0x2ad59ab845c0 mesos::internal::slave::Framework::launchExecutor() > @ 0x2ad59ab818a2 mesos::internal::slave::Slave::_run() > @ 0x2ad59ac1ec10 > _ZZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS_6FutureIbEERKNS1_13FrameworkInfoERKNS1_12ExecutorInfoERK6OptionINS1_8TaskInfoEERKSF_INS1_13TaskGroupInfoEES6_S9_SC_SH_SL_EEvRKNS_3PIDIT_EEMSP_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_ENKUlPNS_11ProcessBaseEE_clES16_ > @ 0x2ad59ac1e6bf > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal5slave5SlaveERKNS0_6FutureIbEERKNS5_13FrameworkInfoERKNS5_12ExecutorInfoERK6OptionINS5_8TaskInfoEERKSJ_INS5_13TaskGroupInfoEESA_SD_SG_SL_SP_EEvRKNS0_3PIDIT_EEMST_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x2ad59bce2304 std::function<>::operator()() > @ 0x2ad59bcc9824 process::ProcessBase::visit() > @ 0x2ad59bd4028e process::DispatchEvent::visit() > @ 0x2ad594616df1 process::ProcessBase::serve() > @ 0x2ad59bcc72b7 process::ProcessManager::resume() > @ 0x2ad59bcd567c > process::ProcessManager::init_threads()::$_2::operator()() > @ 0x2ad59bcd5585 > _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvE3$_2vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE > @ 0x2ad59bcd std::_Bind_simple<>::operator()() > @ 0x2ad59bcd552c std::thread::_Impl<>::_M_run() > @ 0x2ad59d9e6a60 (unknown) > @ 0x2ad59e15d184 start_thread > @ 0x2ad59e46d37d (unknown) > make[4]: *** [check-local] Segmentation fault > {code} > Find attached the full log from a failed run of > {{MasterTest.MultipleExecutors}} and a truncated log from a failed run of > {{MasterMaintenanceTest.InverseOffersFilters}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6985) os::getenv() can segfault
[ https://issues.apache.org/jira/browse/MESOS-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-6985: - Attachment: MasterMaintenanceTest.InverseOffersFilters-truncated.txt > os::getenv() can segfault > - > > Key: MESOS-6985 > URL: https://issues.apache.org/jira/browse/MESOS-6985 > Project: Mesos > Issue Type: Bug > Components: stout > Environment: ASF CI, Ubuntu 14.04 and CentOS 7 both with and without > libevent/SSL >Reporter: Greg Mann > Labels: stout > Attachments: > MasterMaintenanceTest.InverseOffersFilters-truncated.txt, > MasterTest.MultipleExecutors.txt > > > This was observed on ASF CI. The segfault first showed up on CI on 9/20/16 > and has been produced by the tests {{MasterTest.MultipleExecutors}} and > {{MasterMaintenanceTest.InverseOffersFilters}}. In both cases, > {{os::getenv()}} segfaults with the same stack trace: > {code} > *** Aborted at 1485241617 (unix time) try "date -d @1485241617" if you are > using GNU date *** > PC: @ 0x2ad59e3ae82d (unknown) > I0124 07:06:57.422080 28619 exec.cpp:162] Version: 1.2.0 > *** SIGSEGV (@0xf0) received by PID 28591 (TID 0x2ad5a7b87700) from PID 240; > stack trace: *** > I0124 07:06:57.422336 28615 exec.cpp:212] Executor started at: > executor(75)@172.17.0.2:45752 with pid 28591 > @ 0x2ad5ab953197 (unknown) > @ 0x2ad5ab957479 (unknown) > @ 0x2ad59e165330 (unknown) > @ 0x2ad59e3ae82d (unknown) > @ 0x2ad594631358 os::getenv() > @ 0x2ad59aba6acf mesos::internal::slave::executorEnvironment() > @ 0x2ad59ab845c0 mesos::internal::slave::Framework::launchExecutor() > @ 0x2ad59ab818a2 mesos::internal::slave::Slave::_run() > @ 0x2ad59ac1ec10 > _ZZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS_6FutureIbEERKNS1_13FrameworkInfoERKNS1_12ExecutorInfoERK6OptionINS1_8TaskInfoEERKSF_INS1_13TaskGroupInfoEES6_S9_SC_SH_SL_EEvRKNS_3PIDIT_EEMSP_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_ENKUlPNS_11ProcessBaseEE_clES16_ > @ 0x2ad59ac1e6bf > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal5slave5SlaveERKNS0_6FutureIbEERKNS5_13FrameworkInfoERKNS5_12ExecutorInfoERK6OptionINS5_8TaskInfoEERKSJ_INS5_13TaskGroupInfoEESA_SD_SG_SL_SP_EEvRKNS0_3PIDIT_EEMST_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x2ad59bce2304 std::function<>::operator()() > @ 0x2ad59bcc9824 process::ProcessBase::visit() > @ 0x2ad59bd4028e process::DispatchEvent::visit() > @ 0x2ad594616df1 process::ProcessBase::serve() > @ 0x2ad59bcc72b7 process::ProcessManager::resume() > @ 0x2ad59bcd567c > process::ProcessManager::init_threads()::$_2::operator()() > @ 0x2ad59bcd5585 > _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvE3$_2vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE > @ 0x2ad59bcd std::_Bind_simple<>::operator()() > @ 0x2ad59bcd552c std::thread::_Impl<>::_M_run() > @ 0x2ad59d9e6a60 (unknown) > @ 0x2ad59e15d184 start_thread > @ 0x2ad59e46d37d (unknown) > make[4]: *** [check-local] Segmentation fault > {code} > Find attached the full log from a failed run of > {{MasterTest.MultipleExecutors}} and a truncated log from a failed run of > {{MasterMaintenanceTest.InverseOffersFilters}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-6985) os::getenv() can segfault
[ https://issues.apache.org/jira/browse/MESOS-6985?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Greg Mann updated MESOS-6985: - Attachment: MasterTest.MultipleExecutors.txt > os::getenv() can segfault > - > > Key: MESOS-6985 > URL: https://issues.apache.org/jira/browse/MESOS-6985 > Project: Mesos > Issue Type: Bug > Components: stout > Environment: ASF CI, Ubuntu 14.04 and CentOS 7 both with and without > libevent/SSL >Reporter: Greg Mann > Labels: stout > Attachments: MasterTest.MultipleExecutors.txt > > > This was observed on ASF CI. The segfault first showed up on CI on 9/20/16 > and has been produced by the tests {{MasterTest.MultipleExecutors}} and > {{MasterMaintenanceTest.InverseOffersFilters}}. In both cases, > {{os::getenv()}} segfaults with the same stack trace: > {code} > *** Aborted at 1485241617 (unix time) try "date -d @1485241617" if you are > using GNU date *** > PC: @ 0x2ad59e3ae82d (unknown) > I0124 07:06:57.422080 28619 exec.cpp:162] Version: 1.2.0 > *** SIGSEGV (@0xf0) received by PID 28591 (TID 0x2ad5a7b87700) from PID 240; > stack trace: *** > I0124 07:06:57.422336 28615 exec.cpp:212] Executor started at: > executor(75)@172.17.0.2:45752 with pid 28591 > @ 0x2ad5ab953197 (unknown) > @ 0x2ad5ab957479 (unknown) > @ 0x2ad59e165330 (unknown) > @ 0x2ad59e3ae82d (unknown) > @ 0x2ad594631358 os::getenv() > @ 0x2ad59aba6acf mesos::internal::slave::executorEnvironment() > @ 0x2ad59ab845c0 mesos::internal::slave::Framework::launchExecutor() > @ 0x2ad59ab818a2 mesos::internal::slave::Slave::_run() > @ 0x2ad59ac1ec10 > _ZZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS_6FutureIbEERKNS1_13FrameworkInfoERKNS1_12ExecutorInfoERK6OptionINS1_8TaskInfoEERKSF_INS1_13TaskGroupInfoEES6_S9_SC_SH_SL_EEvRKNS_3PIDIT_EEMSP_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_ENKUlPNS_11ProcessBaseEE_clES16_ > @ 0x2ad59ac1e6bf > _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal5slave5SlaveERKNS0_6FutureIbEERKNS5_13FrameworkInfoERKNS5_12ExecutorInfoERK6OptionINS5_8TaskInfoEERKSJ_INS5_13TaskGroupInfoEESA_SD_SG_SL_SP_EEvRKNS0_3PIDIT_EEMST_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ > @ 0x2ad59bce2304 std::function<>::operator()() > @ 0x2ad59bcc9824 process::ProcessBase::visit() > @ 0x2ad59bd4028e process::DispatchEvent::visit() > @ 0x2ad594616df1 process::ProcessBase::serve() > @ 0x2ad59bcc72b7 process::ProcessManager::resume() > @ 0x2ad59bcd567c > process::ProcessManager::init_threads()::$_2::operator()() > @ 0x2ad59bcd5585 > _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvE3$_2vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE > @ 0x2ad59bcd std::_Bind_simple<>::operator()() > @ 0x2ad59bcd552c std::thread::_Impl<>::_M_run() > @ 0x2ad59d9e6a60 (unknown) > @ 0x2ad59e15d184 start_thread > @ 0x2ad59e46d37d (unknown) > make[4]: *** [check-local] Segmentation fault > {code} > Find attached the full log from a failed run of > {{MasterTest.MultipleExecutors}} and a truncated log from a failed run of > {{MasterMaintenanceTest.InverseOffersFilters}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6985) os::getenv() can segfault
Greg Mann created MESOS-6985: Summary: os::getenv() can segfault Key: MESOS-6985 URL: https://issues.apache.org/jira/browse/MESOS-6985 Project: Mesos Issue Type: Bug Components: stout Environment: ASF CI, Ubuntu 14.04 and CentOS 7 both with and without libevent/SSL Reporter: Greg Mann This was observed on ASF CI. The segfault first showed up on CI on 9/20/16 and has been produced by the tests {{MasterTest.MultipleExecutors}} and {{MasterMaintenanceTest.InverseOffersFilters}}. In both cases, {{os::getenv()}} segfaults with the same stack trace: {code} *** Aborted at 1485241617 (unix time) try "date -d @1485241617" if you are using GNU date *** PC: @ 0x2ad59e3ae82d (unknown) I0124 07:06:57.422080 28619 exec.cpp:162] Version: 1.2.0 *** SIGSEGV (@0xf0) received by PID 28591 (TID 0x2ad5a7b87700) from PID 240; stack trace: *** I0124 07:06:57.422336 28615 exec.cpp:212] Executor started at: executor(75)@172.17.0.2:45752 with pid 28591 @ 0x2ad5ab953197 (unknown) @ 0x2ad5ab957479 (unknown) @ 0x2ad59e165330 (unknown) @ 0x2ad59e3ae82d (unknown) @ 0x2ad594631358 os::getenv() @ 0x2ad59aba6acf mesos::internal::slave::executorEnvironment() @ 0x2ad59ab845c0 mesos::internal::slave::Framework::launchExecutor() @ 0x2ad59ab818a2 mesos::internal::slave::Slave::_run() @ 0x2ad59ac1ec10 _ZZN7process8dispatchIN5mesos8internal5slave5SlaveERKNS_6FutureIbEERKNS1_13FrameworkInfoERKNS1_12ExecutorInfoERK6OptionINS1_8TaskInfoEERKSF_INS1_13TaskGroupInfoEES6_S9_SC_SH_SL_EEvRKNS_3PIDIT_EEMSP_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_ENKUlPNS_11ProcessBaseEE_clES16_ @ 0x2ad59ac1e6bf _ZNSt17_Function_handlerIFvPN7process11ProcessBaseEEZNS0_8dispatchIN5mesos8internal5slave5SlaveERKNS0_6FutureIbEERKNS5_13FrameworkInfoERKNS5_12ExecutorInfoERK6OptionINS5_8TaskInfoEERKSJ_INS5_13TaskGroupInfoEESA_SD_SG_SL_SP_EEvRKNS0_3PIDIT_EEMST_FvT0_T1_T2_T3_T4_ET5_T6_T7_T8_T9_EUlS2_E_E9_M_invokeERKSt9_Any_dataS2_ @ 0x2ad59bce2304 std::function<>::operator()() @ 0x2ad59bcc9824 process::ProcessBase::visit() @ 0x2ad59bd4028e process::DispatchEvent::visit() @ 0x2ad594616df1 process::ProcessBase::serve() @ 0x2ad59bcc72b7 process::ProcessManager::resume() @ 0x2ad59bcd567c process::ProcessManager::init_threads()::$_2::operator()() @ 0x2ad59bcd5585 _ZNSt12_Bind_simpleIFZN7process14ProcessManager12init_threadsEvE3$_2vEE9_M_invokeIJEEEvSt12_Index_tupleIJXspT_EEE @ 0x2ad59bcd std::_Bind_simple<>::operator()() @ 0x2ad59bcd552c std::thread::_Impl<>::_M_run() @ 0x2ad59d9e6a60 (unknown) @ 0x2ad59e15d184 start_thread @ 0x2ad59e46d37d (unknown) make[4]: *** [check-local] Segmentation fault {code} Find attached the full log from a failed run of {{MasterTest.MultipleExecutors}} and a truncated log from a failed run of {{MasterMaintenanceTest.InverseOffersFilters}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-6984) Pull out the docker image build step out of `support/docker-build.sh`.
Michael Park created MESOS-6984: --- Summary: Pull out the docker image build step out of `support/docker-build.sh`. Key: MESOS-6984 URL: https://issues.apache.org/jira/browse/MESOS-6984 Project: Mesos Issue Type: Task Reporter: Michael Park The {{support/docker-build.sh}} script currently writes a {{Dockerfile}} and performs a docker build, runs the image then deletes the image. The docker build step is quite expensive, and are often flaky. We should simply pull a docker image from Dockerhub so that we can make our CI more stable and efficient. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-6320) Implement clang-tidy check to catch incorrect flags hierarchies
[ https://issues.apache.org/jira/browse/MESOS-6320?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15836569#comment-15836569 ] Michael Park commented on MESOS-6320: - {noformat} commit d76f8d298b9f302c92ce4d0ff7ebed9e116a95a6 Author: Benjamin BannierDate: Wed Dec 21 19:33:30 2016 +0100 [clang-tidy] Added Mesos check of custom Flags classes. This change fixes MESOS-6320. {noformat} > Implement clang-tidy check to catch incorrect flags hierarchies > --- > > Key: MESOS-6320 > URL: https://issues.apache.org/jira/browse/MESOS-6320 > Project: Mesos > Issue Type: Bug >Reporter: Benjamin Bannier >Assignee: Benjamin Bannier > Labels: clang-tidy, mesosphere > Fix For: 1.2.0 > > > Classes need to always use {{virtual}} inheritance when being derived from > {{FlagsBase}}. Also, in order to compose such derived flags they should be > inherited virtually again. > Some examples: > {code} > struct A : virtual FlagsBase {}; // OK > struct B : FlagsBase {}; // ERROR > struct C : A {}; // ERROR > {code} > We should implement a clang-tidy checker to catch such wrong inheritance > issues. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-5393) XFS disk isolator should disallow sandbox writes when no 'disk' is used in executor/task
[ https://issues.apache.org/jira/browse/MESOS-5393?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15836565#comment-15836565 ] James Peach commented on MESOS-5393: Implemented as a 1-block quota. Note that this makes it impossible to run a task because the quota gets used by agent logs. > XFS disk isolator should disallow sandbox writes when no 'disk' is used in > executor/task > > > Key: MESOS-5393 > URL: https://issues.apache.org/jira/browse/MESOS-5393 > Project: Mesos > Issue Type: Bug > Components: containerization >Affects Versions: 1.0.0 >Reporter: Yan Xu >Assignee: James Peach > > This is similar to MESOS-5081 and was left as a TODO in the first patch for > the XFS isolator. > {noformat:title=} > // TODO(jpeach) If there's no disk resource attached, we should set the > // minimum quota (1 block), since a zero quota would be unconstrained. > {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)