[jira] [Assigned] (MESOS-9669) Deprecate v0 quota calls.
[ https://issues.apache.org/jira/browse/MESOS-9669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-9669: -- Assignee: Benjamin Mahler > Deprecate v0 quota calls. > - > > Key: MESOS-9669 > URL: https://issues.apache.org/jira/browse/MESOS-9669 > Project: Mesos > Issue Type: Improvement >Reporter: Meng Zhu >Assignee: Benjamin Mahler >Priority: Major > Labels: mesosphere, resource-management > > Once we introduce the new quota APIs in MESOS-8068, we should deprecate the > `/quota` endpoint. We should mark this as deprecated and hide it in our > documentation. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (MESOS-9937) 53598228fe should be backported to 1.7.x
[ https://issues.apache.org/jira/browse/MESOS-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-9937: -- Assignee: Greg Mann Priority: Blocker (was: Major) Target Version/s: 1.7.3 Marking as a blocker for the next 1.7.x release. Greg please reassign if someone else can pick this up. > 53598228fe should be backported to 1.7.x > > > Key: MESOS-9937 > URL: https://issues.apache.org/jira/browse/MESOS-9937 > Project: Mesos > Issue Type: Bug >Reporter: longfei >Assignee: Greg Mann >Priority: Blocker > > Commit 53598228fe on the master branch should be backported to 1.7.x. > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9852) Slow memory growth in master due to deferred deletion of offer filters and timers.
[ https://issues.apache.org/jira/browse/MESOS-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905316#comment-16905316 ] Benjamin Mahler commented on MESOS-9852: {quote} Do you mean max_*_tasks_per_framework? Would this history take hundreds of MBs? I'll try... {quote} Yes, for task history: {noformat} --max_completed_frameworks --max_completed_tasks_per_framework {noformat} {quote} I found that every terminated(no matter completed or unreachable) task would be put into slaves.unreachableTasks and would only be erased in _doRegistryGc. {quote} This will only happen for unreachable agents. Please file a ticket if you see otherwise. cc [~greggomann] [~vinodkone] At this point I don't see the leak described in this ticket in the memory profiling data, so we can continue the discussion on the mailing list or in slack, to avoid spamming the watchers of this ticket. > Slow memory growth in master due to deferred deletion of offer filters and > timers. > -- > > Key: MESOS-9852 > URL: https://issues.apache.org/jira/browse/MESOS-9852 > Project: Mesos > Issue Type: Bug > Components: allocation, master >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Critical > Labels: resource-management > Fix For: 1.5.4, 1.6.3, 1.7.3, 1.8.1, 1.9.0 > > Attachments: _tmp_libprocess.Do1MrG_profile (1).dump, > _tmp_libprocess.Do1MrG_profile (1).svg, _tmp_libprocess.Do1MrG_profile > 24hours.dump, _tmp_libprocess.Do1MrG_profile 24hours.svg, screenshot-1.png, > statistics > > > The allocator does not keep a handle to the offer filter timer, which means > it cannot remove the timer overhead (in this case memory) when removing the > offer filter earlier (e.g. due to revive): > https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1338-L1352 > In addition, the offer filter is allocated on the heap but not deleted until > the timer fires (which might take forever!): > https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1321 > https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1408-L1413 > https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L2249 > We'll need to try to backport this to all active release branches. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9852) Slow memory growth in master due to deferred deletion of offer filters and timers.
[ https://issues.apache.org/jira/browse/MESOS-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904850#comment-16904850 ] Benjamin Mahler commented on MESOS-9852: [~carlone] not sure if you intended to reply to my message but I noticed you attached the additional 24 hour data. Looking at it, it appears to be mostly due to task history. If you don't care about the task history, you can tune the master's flags to reduce the amount of framework / task history stored. > Slow memory growth in master due to deferred deletion of offer filters and > timers. > -- > > Key: MESOS-9852 > URL: https://issues.apache.org/jira/browse/MESOS-9852 > Project: Mesos > Issue Type: Bug > Components: allocation, master >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Critical > Labels: resource-management > Fix For: 1.5.4, 1.6.3, 1.7.3, 1.8.1, 1.9.0 > > Attachments: _tmp_libprocess.Do1MrG_profile (1).dump, > _tmp_libprocess.Do1MrG_profile (1).svg, _tmp_libprocess.Do1MrG_profile > 24hours.dump, _tmp_libprocess.Do1MrG_profile 24hours.svg, screenshot-1.png, > statistics > > > The allocator does not keep a handle to the offer filter timer, which means > it cannot remove the timer overhead (in this case memory) when removing the > offer filter earlier (e.g. due to revive): > https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1338-L1352 > In addition, the offer filter is allocated on the heap but not deleted until > the timer fires (which might take forever!): > https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1321 > https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1408-L1413 > https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L2249 > We'll need to try to backport this to all active release branches. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (MESOS-9932) Removal of a role from the suppression list should be equivalent to REVIVE.
[ https://issues.apache.org/jira/browse/MESOS-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-9932: -- Assignee: Benjamin Mahler > Removal of a role from the suppression list should be equivalent to REVIVE. > --- > > Key: MESOS-9932 > URL: https://issues.apache.org/jira/browse/MESOS-9932 > Project: Mesos > Issue Type: Improvement > Components: allocation, scheduler api >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Major > Labels: resource-management > > [~timcharper] and [~daa] pointed out that removal of a role from the > suppression list (e.g. via UPDATE_FRAMEWORK) does not clear filters. This > means that schedulers have to issue a separate explicit REVIVE for the roles > they want to remove. > It seems like these are not the semantics we want, and we should instead be > clearing filters upon removing a role from the suppression list. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (MESOS-9932) Removal of a role from the suppression list should be equivalent to REVIVE.
Benjamin Mahler created MESOS-9932: -- Summary: Removal of a role from the suppression list should be equivalent to REVIVE. Key: MESOS-9932 URL: https://issues.apache.org/jira/browse/MESOS-9932 Project: Mesos Issue Type: Improvement Components: allocation, scheduler api Reporter: Benjamin Mahler [~timcharper] and [~daa] pointed out that removal of a role from the suppression list (e.g. via UPDATE_FRAMEWORK) does not clear filters. This means that schedulers have to issue a separate explicit REVIVE for the roles they want to remove. It seems like these are not the semantics we want, and we should instead be clearing filters upon removing a role from the suppression list. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9852) Slow memory growth in master due to deferred deletion of offer filters and timers.
[ https://issues.apache.org/jira/browse/MESOS-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904005#comment-16904005 ] Benjamin Mahler commented on MESOS-9852: {quote} The newest commit is 8e8c6c0. {quote} [~carlone] this is what the /version endpoint shows? I don't see anything abnormal looking, just a combination of increasing number of connections, task history, and offer filters. But the sample you took is only looking at 35 MB of memory growth. Can you run this over the course of a very long time period to try to capture a large amount of the memory increase? E.g. 12 hours - 72 hours? Be sure to show the same graph as before so we know what the memory consumption history looked like. > Slow memory growth in master due to deferred deletion of offer filters and > timers. > -- > > Key: MESOS-9852 > URL: https://issues.apache.org/jira/browse/MESOS-9852 > Project: Mesos > Issue Type: Bug > Components: allocation, master >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Critical > Labels: resource-management > Fix For: 1.5.4, 1.6.3, 1.7.3, 1.8.1, 1.9.0 > > Attachments: _tmp_libprocess.Do1MrG_profile (1).dump, > _tmp_libprocess.Do1MrG_profile (1).svg, screenshot-1.png, statistics > > > The allocator does not keep a handle to the offer filter timer, which means > it cannot remove the timer overhead (in this case memory) when removing the > offer filter earlier (e.g. due to revive): > https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1338-L1352 > In addition, the offer filter is allocated on the heap but not deleted until > the timer fires (which might take forever!): > https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1321 > https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1408-L1413 > https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L2249 > We'll need to try to backport this to all active release branches. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9852) Slow memory growth in master due to deferred deletion of offer filters and timers.
[ https://issues.apache.org/jira/browse/MESOS-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901248#comment-16901248 ] Benjamin Mahler commented on MESOS-9852: [~carlone] we can figure out whether it has this fix if we have the commit sha. You can check this by hitting the /version endpoint on the master. In any case, please include the memory profiling data as well. > Slow memory growth in master due to deferred deletion of offer filters and > timers. > -- > > Key: MESOS-9852 > URL: https://issues.apache.org/jira/browse/MESOS-9852 > Project: Mesos > Issue Type: Bug > Components: allocation, master >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Critical > Labels: resource-management > Fix For: 1.5.4, 1.6.3, 1.7.3, 1.8.1, 1.9.0 > > Attachments: screenshot-1.png > > > The allocator does not keep a handle to the offer filter timer, which means > it cannot remove the timer overhead (in this case memory) when removing the > offer filter earlier (e.g. due to revive): > https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1338-L1352 > In addition, the offer filter is allocated on the heap but not deleted until > the timer fires (which might take forever!): > https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1321 > https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1408-L1413 > https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L2249 > We'll need to try to backport this to all active release branches. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9852) Slow memory growth in master due to deferred deletion of offer filters and timers.
[ https://issues.apache.org/jira/browse/MESOS-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899214#comment-16899214 ] Benjamin Mahler commented on MESOS-9852: Hi [~carlone], 1.7.3 is not released yet, are you referring to the 1.7.x release branch with the fix in this ticket applied? Please report your findings using the built in memory profiling: http://mesos.apache.org/documentation/latest/memory-profiling/ > Slow memory growth in master due to deferred deletion of offer filters and > timers. > -- > > Key: MESOS-9852 > URL: https://issues.apache.org/jira/browse/MESOS-9852 > Project: Mesos > Issue Type: Bug > Components: allocation, master >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Critical > Labels: resource-management > Fix For: 1.5.4, 1.6.3, 1.7.3, 1.8.1, 1.9.0 > > Attachments: screenshot-1.png > > > The allocator does not keep a handle to the offer filter timer, which means > it cannot remove the timer overhead (in this case memory) when removing the > offer filter earlier (e.g. due to revive): > https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1338-L1352 > In addition, the offer filter is allocated on the heap but not deleted until > the timer fires (which might take forever!): > https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1321 > https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1408-L1413 > https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L2249 > We'll need to try to backport this to all active release branches. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-8069) Role-related endpoints need to reflect hierarchical accounting.
[ https://issues.apache.org/jira/browse/MESOS-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897302#comment-16897302 ] Benjamin Mahler commented on MESOS-8069: This was done for the v0 /roles endpoint but still needs to be done for v1 GET_ROLES. > Role-related endpoints need to reflect hierarchical accounting. > --- > > Key: MESOS-8069 > URL: https://issues.apache.org/jira/browse/MESOS-8069 > Project: Mesos > Issue Type: Bug > Components: agent, HTTP API, master >Reporter: Benjamin Mahler >Assignee: Till Toenshoff >Priority: Major > Labels: mesosphere, multitenancy, resource-management > Attachments: Screen Shot 2018-03-06 at 15.06.04.png > > > With the introduction of hierarchical roles, the role-related endpoints need > to be updated to provide aggregated accounting information. > For example, information about how many resources are allocated to "/eng" > should include the resources allocated to "/eng/frontend" and "/eng/backend", > since quota guarantees and limits are also applied on the aggregation. > This also affects the UI display, for example the 'Roles' tab. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (MESOS-9427) Revisit quota documentation.
[ https://issues.apache.org/jira/browse/MESOS-9427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-9427: -- Assignee: Benjamin Mahler > Revisit quota documentation. > > > Key: MESOS-9427 > URL: https://issues.apache.org/jira/browse/MESOS-9427 > Project: Mesos > Issue Type: Documentation > Components: allocation, documentation >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Major > Labels: multitenancy, resource-management > > At this point the quota documentation in the docs/ folder has become rather > stale. It would be good to at least update any inaccuracies and ideally > re-write it to better reflect the current thinking. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9758) Take ports out of the roles endpoints.
[ https://issues.apache.org/jira/browse/MESOS-9758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897300#comment-16897300 ] Benjamin Mahler commented on MESOS-9758: v0 /roles no longer has ports, but v1 GET_ROLES still has it. > Take ports out of the roles endpoints. > -- > > Key: MESOS-9758 > URL: https://issues.apache.org/jira/browse/MESOS-9758 > Project: Mesos > Issue Type: Improvement >Reporter: Meng Zhu >Priority: Major > Labels: resource-management > > It does not make sense to combine ports across agents. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-6200) Hope mesos support soft and hard cpu/memory resource in the task
[ https://issues.apache.org/jira/browse/MESOS-6200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896580#comment-16896580 ] Benjamin Mahler commented on MESOS-6200: [~xds2000] I think this request is about minimum / maximum container cpu / memory and I don't think that rlimits is the way to accomplish that. We will be working on it via MESOS-9916. > Hope mesos support soft and hard cpu/memory resource in the task > > > Key: MESOS-6200 > URL: https://issues.apache.org/jira/browse/MESOS-6200 > Project: Mesos > Issue Type: Improvement > Components: containerization, docker, scheduler api >Affects Versions: 0.28.2 > Environment: CentOS 7 > Kernel 3.10.0-327.28.3.el7.x86_64 > Mesos 0.28.2 > Docker 1.11.2 >Reporter: Lei Xu >Priority: Major > Labels: resource-management > > The Docker executor maybe could support soft/hard resource limit to enable > more flexible resources sharing among the applications. > || || CPU || Memory || > | hard limit| --cpu-period & --cpu-quota | --memory & --memory-swap| > | soft limit| --cpu-shares | --memory-reservation| > And now the task protobuf message has only one resource struct that used to > describe the cgroup limit, and the docker executor handle is like the > following, only --memory and --cpu-shares were set: > {code} > if (resources.isSome()) { > // TODO(yifan): Support other resources (e.g. disk). > Option cpus = resources.get().cpus(); > if (cpus.isSome()) { > uint64_t cpuShare = > std::max((uint64_t) (CPU_SHARES_PER_CPU * cpus.get()), > MIN_CPU_SHARES); > argv.push_back("--cpu-shares"); > argv.push_back(stringify(cpuShare)); > } > Option mem = resources.get().mem(); > if (mem.isSome()) { > Bytes memLimit = std::max(mem.get(), MIN_MEMORY); > argv.push_back("--memory"); > argv.push_back(stringify(memLimit.bytes())); > } > } > {code} > I hope that the executor and the protobuf message could separate the resource > to the two parts: soft and hard. Then the user could set 2 levels resource > limits for the docker. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (MESOS-9916) Support per-container cpu / memory bursting.
Benjamin Mahler created MESOS-9916: -- Summary: Support per-container cpu / memory bursting. Key: MESOS-9916 URL: https://issues.apache.org/jira/browse/MESOS-9916 Project: Mesos Issue Type: Epic Components: containerization, scheduler api Reporter: Benjamin Mahler Currently, the cgroup cpu policy is burned in at the agent level. The user can start the agent with {{--cgroups_enable_cfs}} to apply cfs quota to all containers (effectively disallowing exceeding the requested amount of cpus for all containers on the agent). The agent does not allow containers to exceed the requested memory (except when a container's requested memory is shrunk). We should instead enable per-container cpu / memory bursting via per-container cpu and memory requests / limits. See kubernetes for an example of a per container cpu/memory bursting API: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (MESOS-9915) Store a role tree in the master.
Benjamin Mahler created MESOS-9915: -- Summary: Store a role tree in the master. Key: MESOS-9915 URL: https://issues.apache.org/jira/browse/MESOS-9915 Project: Mesos Issue Type: Improvement Components: master Reporter: Benjamin Mahler Currently, both the master and allocator track known roles in maps (note however that the master does not currently have complete tracking of known roles). These Role structs track some information about roles, but currently do not track information hierarchically. As a result, when per-role resource quantities were exposed in the API, we had to add code outside of the master's Role struct to perform the hierarchical aggregation. It would be nice if the master (and allocator) had a complete Role tree stored and updated in an event driven manner to obtain information cheaply at any point in time. Ideally this role tree abstraction can be shared (e.g. with the allocator) which may not be trivial since the information tracked might differ. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (MESOS-9861) Make PushGauges support floating point stats.
[ https://issues.apache.org/jira/browse/MESOS-9861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-9861: -- Assignee: Benjamin Mahler > Make PushGauges support floating point stats. > - > > Key: MESOS-9861 > URL: https://issues.apache.org/jira/browse/MESOS-9861 > Project: Mesos > Issue Type: Bug > Components: metrics >Reporter: Meng Zhu >Assignee: Benjamin Mahler >Priority: Major > Labels: foundations, resource-management > > Currently, PushGauges are modeled against counters. Thus it does not support > floating point stats. This prevents many existing PullGauges to use it. We > need to add support for floating point stat. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (MESOS-9912) Webui roles table sorting treats 0 entries as largest values.
Benjamin Mahler created MESOS-9912: -- Summary: Webui roles table sorting treats 0 entries as largest values. Key: MESOS-9912 URL: https://issues.apache.org/jira/browse/MESOS-9912 Project: Mesos Issue Type: Bug Components: webui Reporter: Benjamin Mahler Currently, the webui roles table displays dashes ("-") for zero entries to ease readability of non-zero entries, however this alters the column sorting behavior to treat these entries as larger than any number. The expected behavior is to have the "-" entries be treated as zero. Ideally we can fix this without having to stick zeroes everywhere and reduce the readability of the table. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (MESOS-9603) Add quota limits metrics.
[ https://issues.apache.org/jira/browse/MESOS-9603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-9603: -- Assignee: Benjamin Mahler > Add quota limits metrics. > - > > Key: MESOS-9603 > URL: https://issues.apache.org/jira/browse/MESOS-9603 > Project: Mesos > Issue Type: Task >Reporter: Meng Zhu >Assignee: Benjamin Mahler >Priority: Major > Labels: resource-management > -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9901) Specialize jsonify for protobuf Maps.
[ https://issues.apache.org/jira/browse/MESOS-9901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16891171#comment-16891171 ] Benjamin Mahler commented on MESOS-9901: [~bbannier] hm.. not sure how the existing format was produced but it doesn't comply with the standard mapping? https://developers.google.com/protocol-buffers/docs/proto3#json I think we should just bite the bullet and send out an email to make the breaking change to get towards the proto3 standard json mapping. > Specialize jsonify for protobuf Maps. > - > > Key: MESOS-9901 > URL: https://issues.apache.org/jira/browse/MESOS-9901 > Project: Mesos > Issue Type: Improvement > Components: json api >Reporter: Meng Zhu >Priority: Major > > Jsonify current treats protobuf as a regular repeated field. For example, for > the schema > {noformat} > message QuotaConfig { > required string role = 1; > map guarantees = 2; > map limits = 3; > } > {noformat} > it will produce: > {noformat} > "configs": [ > { > "role": "role1", > "guarantees": [ > { > "key": "cpus", > "value": { > "value": 1 > } > }, > { > "key": "mem", > "value": { > "value": 512 > } > } > ] > {noformat} > This output cannot be parsed back to proto messages. We need to specialize > jsonify for Maps type. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (MESOS-9897) Remove java and python language bindings from the source tree.
Benjamin Mahler created MESOS-9897: -- Summary: Remove java and python language bindings from the source tree. Key: MESOS-9897 URL: https://issues.apache.org/jira/browse/MESOS-9897 Project: Mesos Issue Type: Task Reporter: Benjamin Mahler The java and python bindings are not well maintained and now that we have the HTTP based V1 scheduler and executor APIs it would be good to remove the burden of carrying the java and python bindings. I've targeted this for the 2.0 milestone so that we remember to do this, since this is a breaking change. If there's no objections from users, we could find a way to remove them prior to 2.0. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (MESOS-9896) Consider using protobuf provided json conversion facilities rather than custom ones.
Benjamin Mahler created MESOS-9896: -- Summary: Consider using protobuf provided json conversion facilities rather than custom ones. Key: MESOS-9896 URL: https://issues.apache.org/jira/browse/MESOS-9896 Project: Mesos Issue Type: Task Components: stout Reporter: Benjamin Mahler Currently, stout provides custom JSON to protobuf conversion facilities, some of which use protobuf reflection. When upgrading protobuf to 3.7.x in MESOS-9755, we found that the v0 /state response of the master slowed down, and it appears to be due to a performance regression in the protobuf reflection code. We should file an issue with protobuf, but we should also look into using the json conversion code that protobuf provides to see if that can help avoid the regression. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (MESOS-9890) /roles and GET_ROLES does not always expose parent roles.
[ https://issues.apache.org/jira/browse/MESOS-9890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885568#comment-16885568 ] Benjamin Mahler edited comment on MESOS-9890 at 7/15/19 9:13 PM: - https://reviews.apache.org/r/71073/ https://reviews.apache.org/r/71077/ was (Author: bmahler): https://reviews.apache.org/r/71073/ (no test yet) > /roles and GET_ROLES does not always expose parent roles. > - > > Key: MESOS-9890 > URL: https://issues.apache.org/jira/browse/MESOS-9890 > Project: Mesos > Issue Type: Bug >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Major > Labels: resource-management > > If some descendant roles are present in frameworks, then the parent roles > will not be exposed in the /roles and GET_ROLES endpoints. > This is because the tracking is currently based on frameworks being > subscribed to the role. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-9890) /roles and GET_ROLES does not always expose parent roles.
[ https://issues.apache.org/jira/browse/MESOS-9890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885568#comment-16885568 ] Benjamin Mahler commented on MESOS-9890: https://reviews.apache.org/r/71073/ (no test yet) > /roles and GET_ROLES does not always expose parent roles. > - > > Key: MESOS-9890 > URL: https://issues.apache.org/jira/browse/MESOS-9890 > Project: Mesos > Issue Type: Bug >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Major > Labels: resource-management > > If some descendant roles are present in frameworks, then the parent roles > will not be exposed in the /roles and GET_ROLES endpoints. > This is because the tracking is currently based on frameworks being > subscribed to the role. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (MESOS-9888) /roles and GET_ROLES do not expose roles with only static reservations
[ https://issues.apache.org/jira/browse/MESOS-9888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-9888: -- Assignee: Benjamin Mahler > /roles and GET_ROLES do not expose roles with only static reservations > -- > > Key: MESOS-9888 > URL: https://issues.apache.org/jira/browse/MESOS-9888 > Project: Mesos > Issue Type: Bug >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Major > Labels: resource-management > > If a role is only known to the master because of an agent with static > reservations to that role, it will not be shown in the /roles and GET_ROLES > APIs. > This is because the roles are tracked based on frameworks primarily. We'll > need to update the tracking to include when there are agents with > reservations. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (MESOS-9890) /roles and GET_ROLES does not always expose parent roles.
[ https://issues.apache.org/jira/browse/MESOS-9890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-9890: -- Assignee: Benjamin Mahler > /roles and GET_ROLES does not always expose parent roles. > - > > Key: MESOS-9890 > URL: https://issues.apache.org/jira/browse/MESOS-9890 > Project: Mesos > Issue Type: Bug >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Major > Labels: resource-management > > If some descendant roles are present in frameworks, then the parent roles > will not be exposed in the /roles and GET_ROLES endpoints. > This is because the tracking is currently based on frameworks being > subscribed to the role. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (MESOS-9890) /roles and GET_ROLES does not always expose parent roles.
Benjamin Mahler created MESOS-9890: -- Summary: /roles and GET_ROLES does not always expose parent roles. Key: MESOS-9890 URL: https://issues.apache.org/jira/browse/MESOS-9890 Project: Mesos Issue Type: Bug Reporter: Benjamin Mahler If some descendant roles are present in frameworks, then the parent roles will not be exposed in the /roles and GET_ROLES endpoints. This is because the tracking is currently based on frameworks being subscribed to the role. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-5037) foreachkey behaviour is not expected in multimap
[ https://issues.apache.org/jira/browse/MESOS-5037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884048#comment-16884048 ] Benjamin Mahler commented on MESOS-5037: [~haosd...@gmail.com] Can you file a separate ticket for the performance problem? And we can keep this ticket as a foreachkey issue? > foreachkey behaviour is not expected in multimap > > > Key: MESOS-5037 > URL: https://issues.apache.org/jira/browse/MESOS-5037 > Project: Mesos > Issue Type: Bug > Components: stout >Reporter: haosdent >Priority: Major > Labels: foundations, stout > > Currently the {{foreachkey}} implementation is > {code} > #define foreachkey(VAR, COL)\ > foreachpair (VAR, __foreach__::ignore, COL) > {code} > This works in most structures. But in multimap, one key may map to multi > values. This means there are multi pairs which have same key. So when call > {{foreachkey}}, the {{key}} would duplicated when iteration. My idea to solve > this is we prefer call {{foreach}} on {{(COL).keys()}} if {{keys()}} method > exists in {{COL}}. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Comment Edited] (MESOS-5037) foreachkey behaviour is not expected in multimap
[ https://issues.apache.org/jira/browse/MESOS-5037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884044#comment-16884044 ] Benjamin Mahler edited comment on MESOS-5037 at 7/12/19 5:59 PM: - [~bmahler] Sure, it is https://github.com/apache/mesos/blob/9932550e9632e7fbb9a45b217793c7f508f57001/src/master/master.cpp#L7707-L7708 {code} void Master::__reregisterSlave( ... foreachkey (FrameworkID frameworkId, slaves.unreachableTasks.at(slaveInfo.id())) { ... foreach (TaskID taskId, slaves.unreachableTasks.at(slaveInfo.id()).get(frameworkId)) { {code} Our case is when network flapping, 3~4 agents reregister, then master would CPU full and could not process any requests during that period. was (Author: haosd...@gmail.com): [~bmahler] Sure, it is https://github.com/apache/mesos/blob/master/src/master/master.cpp#L7707-L7708 {code} void Master::__reregisterSlave( ... foreachkey (FrameworkID frameworkId, slaves.unreachableTasks.at(slaveInfo.id())) { ... foreach (TaskID taskId, slaves.unreachableTasks.at(slaveInfo.id()).get(frameworkId)) { {code} Our case is when network flapping, 3~4 agents reregister, then master would CPU full and could not process any requests during that period. > foreachkey behaviour is not expected in multimap > > > Key: MESOS-5037 > URL: https://issues.apache.org/jira/browse/MESOS-5037 > Project: Mesos > Issue Type: Bug > Components: stout >Reporter: haosdent >Priority: Major > Labels: foundations, stout > > Currently the {{foreachkey}} implementation is > {code} > #define foreachkey(VAR, COL)\ > foreachpair (VAR, __foreach__::ignore, COL) > {code} > This works in most structures. But in multimap, one key may map to multi > values. This means there are multi pairs which have same key. So when call > {{foreachkey}}, the {{key}} would duplicated when iteration. My idea to solve > this is we prefer call {{foreach}} on {{(COL).keys()}} if {{keys()}} method > exists in {{COL}}. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-5037) foreachkey behaviour is not expected in multimap
[ https://issues.apache.org/jira/browse/MESOS-5037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884011#comment-16884011 ] Benjamin Mahler commented on MESOS-5037: [~haosd...@gmail.com] can you post a link to the code in question? > foreachkey behaviour is not expected in multimap > > > Key: MESOS-5037 > URL: https://issues.apache.org/jira/browse/MESOS-5037 > Project: Mesos > Issue Type: Bug > Components: stout >Reporter: haosdent >Priority: Major > Labels: foundations, stout > > Currently the {{foreachkey}} implementation is > {code} > #define foreachkey(VAR, COL)\ > foreachpair (VAR, __foreach__::ignore, COL) > {code} > This works in most structures. But in multimap, one key may map to multi > values. This means there are multi pairs which have same key. So when call > {{foreachkey}}, the {{key}} would duplicated when iteration. My idea to solve > this is we prefer call {{foreach}} on {{(COL).keys()}} if {{keys()}} method > exists in {{COL}}. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-5037) foreachkey behaviour is not expected in multimap
[ https://issues.apache.org/jira/browse/MESOS-5037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883340#comment-16883340 ] Benjamin Mahler commented on MESOS-5037: [~haosd...@gmail.com] foreachkey indeed sounds problematic for multimap. I didn't follow the CPU load issue you found. Can you file a related ticket explaining it? Be sure to show the code in question that is inducing the cpu load, and attach perf data if possible. > foreachkey behaviour is not expected in multimap > > > Key: MESOS-5037 > URL: https://issues.apache.org/jira/browse/MESOS-5037 > Project: Mesos > Issue Type: Bug > Components: stout >Reporter: haosdent >Priority: Major > Labels: foundations, stout > > Currently the {{foreachkey}} implementation is > {code} > #define foreachkey(VAR, COL)\ > foreachpair (VAR, __foreach__::ignore, COL) > {code} > This works in most structures. But in multimap, one key may map to multi > values. This means there are multi pairs which have same key. So when call > {{foreachkey}}, the {{key}} would duplicated when iteration. My idea to solve > this is we prefer call {{foreach}} on {{(COL).keys()}} if {{keys()}} method > exists in {{COL}}. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Commented] (MESOS-8789) Role-related endpoints should display distinct offered and allocated resources.
[ https://issues.apache.org/jira/browse/MESOS-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883290#comment-16883290 ] Benjamin Mahler commented on MESOS-8789: {noformat} commit d6738bcc86525e1ac661d2027a1934134426255f Author: Benjamin Mahler Date: Wed Jul 10 19:36:54 2019 -0400 Added Role::reserved, Role::allocated, Role::offered to master. This provides a breakdown of resource quantities on a per-role basis, that would aid debugging if shown in the endpoints and roles table in the ui. Review: https://reviews.apache.org/r/71050 {noformat} {noformat} commit 69c8feab6a62b1728872a367a8ed28f88eb029d3 (HEAD -> master, apache/master) Author: Benjamin Mahler Date: Wed Jul 10 20:09:31 2019 -0400 Added reserved, offered, allocated resources to the /roles endpoint. This provides helpful information for debugging, as well as for the webui to display in the roles table. Review: https://reviews.apache.org/r/71053 {noformat} > Role-related endpoints should display distinct offered and allocated > resources. > --- > > Key: MESOS-8789 > URL: https://issues.apache.org/jira/browse/MESOS-8789 > Project: Mesos > Issue Type: Improvement > Components: agent, HTTP API, master >Affects Versions: 1.5.0 >Reporter: Till Toenshoff >Assignee: Benjamin Mahler >Priority: Major > Labels: mesosphere, multitenancy, resource-management > > The role endpoints currently show accumulated values for resources > (allocated), containing offered resources. For gaining an overview showing > our allocated resources separately from the offered resources could improve > the signal quality, depending on the use case. > This also affects the UI display, for example the "Roles" tab. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Created] (MESOS-9888) /roles and GET_ROLES do not expose roles with only static reservations
Benjamin Mahler created MESOS-9888: -- Summary: /roles and GET_ROLES do not expose roles with only static reservations Key: MESOS-9888 URL: https://issues.apache.org/jira/browse/MESOS-9888 Project: Mesos Issue Type: Bug Reporter: Benjamin Mahler If a role is only known to the master because of an agent with static reservations to that role, it will not be shown in the /roles and GET_ROLES APIs. This is because the roles are tracked based on frameworks primarily. We'll need to update the tracking to include when there are agents with reservations. -- This message was sent by Atlassian JIRA (v7.6.14#76016)
[jira] [Assigned] (MESOS-8503) Improve UI when displaying frameworks with many roles.
[ https://issues.apache.org/jira/browse/MESOS-8503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-8503: -- Assignee: (was: Armand Grillet) > Improve UI when displaying frameworks with many roles. > -- > > Key: MESOS-8503 > URL: https://issues.apache.org/jira/browse/MESOS-8503 > Project: Mesos > Issue Type: Task >Reporter: Armand Grillet >Priority: Major > Attachments: Screen Shot 2018-01-29 à 10.38.05.png > > > The /frameworks UI endpoint displays all the roles of each framework in a > table: > !Screen Shot 2018-01-29 à 10.38.05.png! > This is not readable if a framework has many roles. We thus need to provide a > solution to only display a few roles per framework and show more when a user > wants to see all of them. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9618) Display quota consumption in the webui.
[ https://issues.apache.org/jira/browse/MESOS-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-9618: -- Assignee: Benjamin Mahler > Display quota consumption in the webui. > --- > > Key: MESOS-9618 > URL: https://issues.apache.org/jira/browse/MESOS-9618 > Project: Mesos > Issue Type: Improvement > Components: webui >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Major > Labels: resource-management > > Currently, the Roles table in the webui displays allocation and quota > guarantees / limits. However, quota "consumption" is different from > allocation, in that reserved resources are always considered consumed against > the quota. > This discrepancy has led to confusion from users. One exampled occurred when > an agent was added with a large reservation exceeding the memory quota > guarantee. The user sees memory chopping in offers, and since the scheduler > didn't want to use the reservation, it can't launch its tasks. > If consumption is shown in the UI, we should include a tool tip that > indicates how consumed is calculated so that users know how to interpret it. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9886) RoleTest.RolesEndpointContainsConsumedQuota is flaky.
Benjamin Mahler created MESOS-9886: -- Summary: RoleTest.RolesEndpointContainsConsumedQuota is flaky. Key: MESOS-9886 URL: https://issues.apache.org/jira/browse/MESOS-9886 Project: Mesos Issue Type: Bug Reporter: Benjamin Mahler Assignee: Benjamin Mahler {noformat} [ RUN ] RoleTest.RolesEndpointContainsConsumedQuota I0710 07:05:42.670790 9995 cluster.cpp:176] Creating default 'local' authorizer I0710 07:05:42.672238 master.cpp:440] Master 8db40cec-43ef-41a1-89a4-4f7b877d8f13 (ip-172-16-10-69.ec2.internal) started on 172.16.10.69:37082 I0710 07:05:42.672256 master.cpp:443] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregiste r_timeout="10mins" --allocation_interval="1secs" --allocator="hierarchical" --authenticate_agents="true" --authenticate _frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwr ite="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/1d 0m6o/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http _authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initializ e="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_co mpleted_tasks_per_framework="1000" --max_operator_event_stream_subscribers="1000" --max_unreachable_tasks_per_framework ="1000" --memory_profiling="false" --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --publish_per_framework _metrics="true" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout=" 1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry _store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --role_sorter="drf" --root_submission s="true" --version="false" --webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/1d0m6o/master" --zk_session_time out="10secs" I0710 07:05:42.672351 master.cpp:492] Master only allowing authenticated frameworks to register I0710 07:05:42.672356 master.cpp:498] Master only allowing authenticated agents to register I0710 07:05:42.672360 master.cpp:504] Master only allowing authenticated HTTP frameworks to register I0710 07:05:42.672364 credentials.hpp:37] Loading credentials for authentication from '/tmp/1d0m6o/credentials' I0710 07:05:42.672430 master.cpp:548] Using default 'crammd5' authenticator I0710 07:05:42.672466 http.cpp:975] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I0710 07:05:42.672508 http.cpp:975] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite ' I0710 07:05:42.672538 http.cpp:975] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler ' I0710 07:05:42.672569 master.cpp:629] Authorization enabled I0710 07:05:42.672658 10001 hierarchical.cpp:241] Initialized hierarchical allocator process I0710 07:05:42.672685 10001 whitelist_watcher.cpp:77] No whitelist given I0710 07:05:42.673316 10001 master.cpp:2150] Elected as the leading master! I0710 07:05:42.673331 10001 master.cpp:1664] Recovering from registrar I0710 07:05:42.673616 10001 registrar.cpp:339] Recovering registrar I0710 07:05:42.673874 10001 registrar.cpp:383] Successfully fetched the registry (0B) in 239104ns I0710 07:05:42.673923 10001 registrar.cpp:487] Applied 1 operations in 7745ns; attempting to update the registry I0710 07:05:42.674052 registrar.cpp:544] Successfully updated the registry in 108032ns I0710 07:05:42.674082 registrar.cpp:416] Successfully recovered registrar I0710 07:05:42.674152 master.cpp:1799] Recovered 0 agents from the registry (180B); allowing 10mins for agents to reregister I0710 07:05:42.674185 9996 hierarchical.cpp:280] Skipping recovery of hierarchical allocator: nothing to recover W0710 07:05:42.676100 9995 process.cpp:2877] Attempted to spawn already running process files@172.16.10.69:37082 I0710 07:05:42.676537 9995 containerizer.cpp:314] Using isolation { environment_secret, posix/cpu, posix/mem, filesyst em/posix, network/cni } I0710 07:05:42.678514 9995 linux_launcher.cpp:144] Using /cgroup/freezer as the freezer hierarchy for the Linux launch er I0710 07:05:42.678980 9995 provisioner.cpp:298] Using default backend 'copy' I0710 07:05:42.680043 9995 cluster.cpp:510] Creating default 'local' authorizer I0710 07:05:42.680832 9998 slave.cpp:265] Mesos agent started on (522)@172.16.10.69:37082 I0710 07:05:42.680850 9998 slave.cpp:266] Flags at startup: --acls="" --appc_simple_discovery_uri_prefix="http://; --a
[jira] [Commented] (MESOS-9755) Upgrade bundled protobuf to 3.7.x.
[ https://issues.apache.org/jira/browse/MESOS-9755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16878882#comment-16878882 ] Benjamin Mahler commented on MESOS-9755: For posterity, it looks like there is a performance regression in the v0 API when upgrading to protobuf 3.7.1: Master: {noformat} [ RUN ] AgentFrameworkTaskCountContentType/MasterStateQuery_BENCHMARK_Test.GetState/0 Test setup: 1000 agents with a total of 1 running tasks and 1 completed tasks v0 '/state' response took 177.001464ms [ OK ] AgentFrameworkTaskCountContentType/MasterStateQuery_BENCHMARK_Test.GetState/0 (4593 ms) [ RUN ] AgentFrameworkTaskCountContentType/MasterStateQuery_BENCHMARK_Test.GetState/1 Test setup: 1 agents with a total of 10 running tasks and 10 completed tasks v0 '/state' response took 1.802505171secs [ OK ] AgentFrameworkTaskCountContentType/MasterStateQuery_BENCHMARK_Test.GetState/1 (51571 ms) [ RUN ] AgentFrameworkTaskCountContentType/MasterStateQuery_BENCHMARK_Test.GetState/2 Test setup: 2 agents with a total of 20 running tasks and 20 completed tasks v0 '/state' response took 3.164482263secs [ OK ] AgentFrameworkTaskCountContentType/MasterStateQuery_BENCHMARK_Test.GetState/2 (104737 ms) {noformat} After upgrading to 3.7.1: {noformat} [ RUN ] AgentFrameworkTaskCountContentType/MasterStateQuery_BENCHMARK_Test.GetState/0 Test setup: 1000 agents with a total of 1 running tasks and 1 completed tasks v0 '/state' response took 253.753947ms [ OK ] AgentFrameworkTaskCountContentType/MasterStateQuery_BENCHMARK_Test.GetState/0 (6107 ms) [ RUN ] AgentFrameworkTaskCountContentType/MasterStateQuery_BENCHMARK_Test.GetState/1 Test setup: 1 agents with a total of 10 running tasks and 10 completed tasks v0 '/state' response took 2.118297secs [ OK ] AgentFrameworkTaskCountContentType/MasterStateQuery_BENCHMARK_Test.GetState/1 (58902 ms) [ RUN ] AgentFrameworkTaskCountContentType/MasterStateQuery_BENCHMARK_Test.GetState/2 Test setup: 2 agents with a total of 20 running tasks and 20 completed tasks v0 '/state' response took 4.150050151secs [ OK ] AgentFrameworkTaskCountContentType/MasterStateQuery_BENCHMARK_Test.GetState/2 (116661 ms) {noformat} It appears to be due to a performance regression in the reflection code in protobuf. We may want to investigate further with the protobuf maintainers and/or investigate using the built in json conversion support rather than our reflection based implementation. > Upgrade bundled protobuf to 3.7.x. > -- > > Key: MESOS-9755 > URL: https://issues.apache.org/jira/browse/MESOS-9755 > Project: Mesos > Issue Type: Wish >Reporter: Kaiwalya Joshi >Priority: Major > Labels: foundations, integration, protobuf > > We're noticing the following warning emitted by the JVM on JDK9+ for Google > Protobuf _v3.5.0_ > {code} > WARNING: An illegal reflective access operation has occurred > WARNING: Illegal reflective access by com.google.protobuf.UnsafeUtil > (file:/home/kjoshi/.gradle/caches/modules-2/files-2.1/com.google.protobuf/protobuf-java/3.5.0/200fb936907fbab5e521d148026f6033d4aa539e/protobuf-java-3.5.0.jar) > to field java.nio.Buffer.address > WARNING: Please consider reporting this to the maintainers of > com.google.protobuf.UnsafeUtil > {code} > This warning is fixed in ProtoBuf versions [_v3.7.0_ and > above|https://github.com/protocolbuffers/protobuf/releases/tag/v3.7.0]. > As the current access warning can turn into an access violation in later > versions of the JDK, we're requesting Mesos to update to a version of > ProtoBuf that incorporates the needed fixes. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9755) Upgrade bundled protobuf to 3.7.x.
[ https://issues.apache.org/jira/browse/MESOS-9755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16878791#comment-16878791 ] Benjamin Mahler commented on MESOS-9755: Note that upgrading protobuf to 3.7.x breaks the grpc build in the mesos autotools build: {noformat} [HOSTCXX] Compiling src/compiler/cpp_plugin.cc [HOSTCXX] Compiling src/compiler/node_plugin.cc [HOSTCXX] Compiling src/compiler/csharp_plugin.cc [HOSTCXX] Compiling src/compiler/php_plugin.cc [HOSTCXX] Compiling src/compiler/objective_c_plugin.cc [HOSTCXX] Compiling src/compiler/python_plugin.cc [HOSTCXX] Compiling src/compiler/ruby_plugin.cc [HOSTLD] Linking /home/bmahler/git/mesos3/build/3rdparty/grpc-1.10.0/bins/opt/grpc_python_plugin [HOSTLD] Linking /home/bmahler/git/mesos3/build/3rdparty/grpc-1.10.0/bins/opt/grpc_csharp_plugin [HOSTLD] Linking /home/bmahler/git/mesos3/build/3rdparty/grpc-1.10.0/bins/opt/grpc_objective_c_plugin [HOSTLD] Linking /home/bmahler/git/mesos3/build/3rdparty/grpc-1.10.0/bins/opt/grpc_ruby_plugin [HOSTLD] Linking /home/bmahler/git/mesos3/build/3rdparty/grpc-1.10.0/bins/opt/grpc_node_plugin [HOSTLD] Linking /home/bmahler/git/mesos3/build/3rdparty/grpc-1.10.0/bins/opt/grpc_php_plugin [HOSTLD] Linking /home/bmahler/git/mesos3/build/3rdparty/grpc-1.10.0/bins/opt/grpc_cpp_plugin [PROTOC] Generating protobuf CC file from src/proto/grpc/health/v1/health.proto [PROTOC] Generating protobuf CC file from src/proto/grpc/testing/echo_messages.proto [PROTOC] Generating protobuf CC file from src/proto/grpc/testing/payloads.proto [PROTOC] Generating protobuf CC file from src/proto/grpc/core/stats.proto [PROTOC] Generating protobuf CC file from src/proto/grpc/testing/messages.proto third_party/protobuf/src: warning: directory does not exist. third_party/protobuf/srcthird_party/protobuf/src: warning: directory does not exist.: warning: directory does not exist. third_party/protobuf/src: warning: directory does not exist. third_party/protobuf/src: warning: directory does not exist. [GRPC]Generating gRPC's protobuf service CC file from src/proto/grpc/health/v1/health.proto [GRPC]Generating gRPC's protobuf service CC file from src/proto/grpc/testing/payloads.proto [GRPC]Generating gRPC's protobuf service CC file from src/proto/grpc/core/stats.proto third_party/protobuf/src: warning: directory does not exist. [GRPC]Generating gRPC's protobuf service CC file from src/proto/grpc/testing/echo_messages.proto [PROTOC] Generating protobuf CC file from src/proto/grpc/testing/echo.proto third_party/protobuf/src: warning: directory does not exist. [PROTOC] Generating protobuf CC file from src/proto/grpc/testing/duplicate/echo_duplicate.proto third_party/protobuf/src: warning: directory does not exist. [PROTOC] Generating protobuf CC file from src/proto/grpc/testing/stats.proto [libprotobuf FATAL google/protobuf/generated_message_util.cc:794] CHECK failed: (scc->visit_status.load(std::memory_order_relaxed)) == (SCCInfoBase::kRunning): terminate called after throwing an instance of 'google::protobuf::FatalException' what(): CHECK failed: (scc->visit_status.load(std::memory_order_relaxed)) == (SCCInfoBase::kRunning): [libprotobuf FATAL google/protobuf/generated_message_util.cc:794] CHECK failed: (scc->visit_status.load(std::memory_order_relaxed)) == (SCCInfoBase::kRunning): terminate called after throwing an instance of 'google::protobuf::FatalException' what(): CHECK failed: (scc->visit_status.load(std::memory_order_relaxed)) == (SCCInfoBase::kRunning): [libprotobuf FATAL google/protobuf/generated_message_util.cc:794] CHECK failed: (scc->visit_status.load(std::memory_order_relaxed)) == (SCCInfoBase::kRunning): terminate called after throwing an instance of 'google::protobuf::FatalException' what(): CHECK failed: (scc->visit_status.load(std::memory_order_relaxed)) == (SCCInfoBase::kRunning): [GRPC]Generating gRPC's protobuf service CC file from src/proto/grpc/testing/messages.proto third_party/protobuf/src: warning: directory does not exist. third_party/protobuf/src: warning: directory does not exist. third_party/protobuf/src: warning: directory does not exist. third_party/protobuf/src: warning: directory does not exist. [libprotobuf FATAL google/protobuf/generated_message_util.cc:794] CHECK failed: (scc->visit_status.load(std::memory_order_relaxed)) == (SCCInfoBase::kRunning): terminate called after throwing an instance of 'google::protobuf::FatalException' what(): CHECK failed: (scc->visit_status.load(std::memory_order_relaxed)) == (SCCInfoBase::kRunning): third_party/protobuf/src: warning: directory does not exist. [libprotobuf FATAL google/protobuf/generated_message_util.cc:794] CHECK failed: (scc->visit_status.load(std::memory_order_relaxed)) == (SCCInfoBase::kRunning): terminate called after throwing an instance of 'google::protobuf::FatalException' what(): CHECK failed:
[jira] [Created] (MESOS-9881) StorageLocalResourceProviderTest.RetryOperationStatusUpdateAfterRecovery is flaky.
Benjamin Mahler created MESOS-9881: -- Summary: StorageLocalResourceProviderTest.RetryOperationStatusUpdateAfterRecovery is flaky. Key: MESOS-9881 URL: https://issues.apache.org/jira/browse/MESOS-9881 Project: Mesos Issue Type: Improvement Reporter: Benjamin Mahler This failed in CI: {noformat} 1 tests failed. FAILED: CSIVersion/StorageLocalResourceProviderTest.RetryOperationStatusUpdateAfterRecovery/v0 Error Message: ../../../3rdparty/libprocess/include/process/gmock.hpp:667 Mock function called more times than expected - returning default value. Function call: filter(@0x5617542ee270 master@172.17.0.3:35735, @0x7f83cc053c30 264-byte object <48-23 06-32 84-7F 00-00 40-DE 07-CC 83-7F 00-00 2B-00 00-00 00-00 00-00 2B-00 00-00 00-00 00-00 4C-65 6E-67 74-68 00-6F 20-AF 00-54 17-56 00-00 10-AF 00-54 17-56 00-00 02-00 00-00 AC-11 00-03 ... 20-20 05-CC 83-7F 00-00 00-00 00-00 6E-20 76-61 50-2B 4B-53 17-56 00-00 40-2B 4B-53 17-56 00-00 60-DA 07-CC 83-7F 00-00 CA-03 00-00 00-00 00-00 CA-03 00-00 00-00 00-00 10-01 00-00 00-00 00-00>) Returns: false Expected: to be never called Actual: called once - over-saturated and active Stack Trace: ../../../3rdparty/libprocess/include/process/gmock.hpp:667 Mock function called more times than expected - returning default value. Function call: filter(@0x5617542ee270 master@172.17.0.3:35735, @0x7f83cc053c30 264-byte object <48-23 06-32 84-7F 00-00 40-DE 07-CC 83-7F 00-00 2B-00 00-00 00-00 00-00 2B-00 00-00 00-00 00-00 4C-65 6E-67 74-68 00-6F 20-AF 00-54 17-56 00-00 10-AF 00-54 17-56 00-00 02-00 00-00 AC-11 00-03 ... 20-20 05-CC 83-7F 00-00 00-00 00-00 6E-20 76-61 50-2B 4B-53 17-56 00-00 40-2B 4B-53 17-56 00-00 60-DA 07-CC 83-7F 00-00 CA-03 00-00 00-00 00-00 CA-03 00-00 00-00 00-00 10-01 00-00 00-00 00-00>) Returns: false Expected: to be never called Actual: called once - over-saturated and active {noformat} Full test output: {noformat} [ RUN ] CSIVersion/StorageLocalResourceProviderTest.RetryOperationStatusUpdateAfterRecovery/v0 I0702 06:51:02.172196 6961 cluster.cpp:176] Creating default 'local' authorizer I0702 06:51:02.183229 17274 master.cpp:440] Master c310f701-ca24-4ea8-a4be-df3aa3637194 (005dc56bde82) started on 172.17.0.3:35735 I0702 06:51:02.184095 17274 master.cpp:443] Flags at startup: --acls="" --agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" --allocation_interval="50ms" --allocator="hierarchical" --authenticate_agents="true" --authenticate_frameworks="true" --authenticate_http_frameworks="true" --authenticate_http_readonly="true" --authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" --authorizers="local" --credentials="/tmp/Pq6bYz/credentials" --filter_gpu_resources="true" --framework_sorter="drf" --help="false" --hostname_lookup="true" --http_authenticators="basic" --http_framework_authenticators="basic" --initialize_driver_logging="true" --log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" --max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" --max_operator_event_stream_subscribers="1000" --max_unreachable_tasks_per_framework="1000" --memory_profiling="false" --min_allocatable_resources="cpus:0.01|mem:32" --port="5050" --publish_per_framework_metrics="true" --quiet="false" --recovery_agent_removal_limit="100%" --registry="in_memory" --registry_fetch_timeout="1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" --registry_max_agent_count="102400" --registry_store_timeout="100secs" --registry_strict="false" --require_agent_domain="false" --role_sorter="drf" --root_submissions="true" --version="false" --webui_dir="/tmp/SRC/build/mesos-1.9.0/_inst/share/mesos/webui" --work_dir="/tmp/Pq6bYz/master" --zk_session_timeout="10secs" I0702 06:51:02.185236 17274 master.cpp:492] Master only allowing authenticated frameworks to register I0702 06:51:02.185819 17274 master.cpp:498] Master only allowing authenticated agents to register I0702 06:51:02.186395 17274 master.cpp:504] Master only allowing authenticated HTTP frameworks to register I0702 06:51:02.186951 17274 credentials.hpp:37] Loading credentials for authentication from '/tmp/Pq6bYz/credentials' I0702 06:51:02.187907 17274 master.cpp:548] Using default 'crammd5' authenticator I0702 06:51:02.188771 17274 http.cpp:975] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readonly' I0702 06:51:02.189630 17274 http.cpp:975] Creating default 'basic' HTTP authenticator for realm 'mesos-master-readwrite' I0702 06:51:02.190573 17274 http.cpp:975] Creating default 'basic' HTTP authenticator for realm 'mesos-master-scheduler' I0702 06:51:02.191690 17274 master.cpp:629] Authorization enabled I0702 06:51:02.195374 17265
[jira] [Created] (MESOS-9880) Update SUPPRESS/REVIVE calls to return error codes / 200 OK.
Benjamin Mahler created MESOS-9880: -- Summary: Update SUPPRESS/REVIVE calls to return error codes / 200 OK. Key: MESOS-9880 URL: https://issues.apache.org/jira/browse/MESOS-9880 Project: Mesos Issue Type: Improvement Components: master, scheduler api Reporter: Benjamin Mahler Currently, the SUPPRESS/REVIVE calls always return '202 Accepted' even if the call is invalid. Instead, to be aligned with UPDATE_FRAMEWORK, these calls should: -Return 200 OK if successful. -Return appropriate error response if invalid or erroneous. For the v0 driver, this means: -Send back a FrameworkErrorMessage if invalid or erroneous. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9871) Expose quota consumption in /roles endpoint.
Benjamin Mahler created MESOS-9871: -- Summary: Expose quota consumption in /roles endpoint. Key: MESOS-9871 URL: https://issues.apache.org/jira/browse/MESOS-9871 Project: Mesos Issue Type: Task Components: master Reporter: Benjamin Mahler Assignee: Benjamin Mahler As part of exposing quota consumption to users and displaying quota consumption in the ui, we will need to add it to the /roles endpoint (which is currently what the ui uses for the roles table). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9870) Simultaneous adding/removal of a role from framework's roles and its suppressed roles crashes the master.
[ https://issues.apache.org/jira/browse/MESOS-9870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-9870: -- Assignee: Andrei Sekretenko Target Version/s: 1.5.4, 1.6.3, 1.7.3, 1.8.1, 1.9.0 (was: 1.9.0) > Simultaneous adding/removal of a role from framework's roles and its > suppressed roles crashes the master. > - > > Key: MESOS-9870 > URL: https://issues.apache.org/jira/browse/MESOS-9870 > Project: Mesos > Issue Type: Bug >Reporter: Andrei Sekretenko >Assignee: Andrei Sekretenko >Priority: Blocker > Labels: resource-management > > Calling UPDATE_FRAMEWORK with a new role added both to 'FrameworkInfo.roles` > and `suppressed_roles` crashes the master. > The first place which doesn't expect this is increasing a `suppressed` > allocator metric: > [https://github.com/apache/mesos/blob/fe7be9701e92d863734621ae1a3d339bb8598044/src/master/allocator/mesos/hierarchical.cpp#L507] > [ > https://github.com/apache/mesos/blob/fe7be9701e92d863734621ae1a3d339bb8598044/src/master/allocator/mesos/metrics.cpp#L255] > Probably there are other similar places. > Adding a new role in a suppressed state via re-subscribing should also > trigger this bug - haven't checked it -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9870) Adding a new role in a suppressed state crashes the master.
[ https://issues.apache.org/jira/browse/MESOS-9870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874294#comment-16874294 ] Benjamin Mahler commented on MESOS-9870: Marked this as a blocker for the 1.9.0 release. > Adding a new role in a suppressed state crashes the master. > --- > > Key: MESOS-9870 > URL: https://issues.apache.org/jira/browse/MESOS-9870 > Project: Mesos > Issue Type: Bug >Reporter: Andrei Sekretenko >Priority: Major > > Calling UPDATE_FRAMEWORK with a new role added both to 'FrameworkInfo.roles` > and `suppressed_roles` crashes the master. > The first place which doesn't expect this is increasing a `suppressed` > allocator metric: > [https://github.com/apache/mesos/blob/fe7be9701e92d863734621ae1a3d339bb8598044/src/master/allocator/mesos/hierarchical.cpp#L507] > [ > https://github.com/apache/mesos/blob/fe7be9701e92d863734621ae1a3d339bb8598044/src/master/allocator/mesos/metrics.cpp#L255] > Probably there are other similar places. > Adding a new role in a suppressed state via re-subscribing should also > trigger this bug - haven't checked it -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7899) Expose sandboxes using virtual paths and hide the agent work directory.
[ https://issues.apache.org/jira/browse/MESOS-7899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872880#comment-16872880 ] Benjamin Mahler commented on MESOS-7899: Hi [~tomq42]! I'd like to direct you instead to the user@ mailing list or slack (e.g. #containerizer) to get help with this. > Expose sandboxes using virtual paths and hide the agent work directory. > --- > > Key: MESOS-7899 > URL: https://issues.apache.org/jira/browse/MESOS-7899 > Project: Mesos > Issue Type: Task >Reporter: Zhitao Li >Assignee: Zhitao Li >Priority: Major > Fix For: 1.5.0 > > > {{Files}} interface already supports a virtual file system. We should figure > out a way to enable this in {{ /files/download}} endpoint to hide agent > sandbox. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9124) Agent reconfiguration can cause master to unsuppress on scheduler's behalf
[ https://issues.apache.org/jira/browse/MESOS-9124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16871826#comment-16871826 ] Benjamin Mahler commented on MESOS-9124: Backporting this fix to active release branches. > Agent reconfiguration can cause master to unsuppress on scheduler's behalf > -- > > Key: MESOS-9124 > URL: https://issues.apache.org/jira/browse/MESOS-9124 > Project: Mesos > Issue Type: Bug > Components: allocation, master >Affects Versions: 1.5.3, 1.6.2, 1.7.2 >Reporter: Greg Mann >Assignee: Greg Mann >Priority: Major > Labels: foundations, mesosphere > Fix For: 1.8.0 > > > When agent reconfiguration was enabled in Mesos, the allocator was also > updated to remove all offer filters associated with an agent when that > agent's attributes change. In addition, whenever filters for an agent are > removed, the framework is unsuppressed for any roles that had filters on the > agent. > While this ensures that schedulers will have an opportunity to use resources > on an agent after reconfiguration, modifying the scheduler's suppression may > put the scheduler in an inconsistent state, where it believes it is > suppressed in a particular role when it is not. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9856) REVIVE call with specified role(s) clears filters for all roles of a framework.
[ https://issues.apache.org/jira/browse/MESOS-9856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-9856: -- Assignee: Andrei Sekretenko > REVIVE call with specified role(s) clears filters for all roles of a > framework. > --- > > Key: MESOS-9856 > URL: https://issues.apache.org/jira/browse/MESOS-9856 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Benjamin Mahler >Assignee: Andrei Sekretenko >Priority: Major > Labels: resource-management > > As pointed out by [~asekretenko], the REVIVE implementation in the allocator > incorrectly clears decline filters for all of the framework's roles, rather > than only those that were specified in the REVIVE call: > https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1392 > This should only clear filters for the roles specified in the REVIVE call. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9852) Slow memory growth due to deferred deletion of offer filters and timers.
[ https://issues.apache.org/jira/browse/MESOS-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-9852: -- Assignee: Benjamin Mahler > Slow memory growth due to deferred deletion of offer filters and timers. > > > Key: MESOS-9852 > URL: https://issues.apache.org/jira/browse/MESOS-9852 > Project: Mesos > Issue Type: Bug > Components: allocation, master >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Critical > Labels: resource-management > > The allocator does not keep a handle to the offer filter timer, which means > it cannot remove the timer overhead (in this case memory) when removing the > offer filter earlier (e.g. due to revive): > https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1338-L1352 > In addition, the offer filter is allocated on the heap but not deleted until > the timer fires (which might take forever!): > https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1321 > https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1408-L1413 > https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L2249 > We'll need to try to backport this to all active release branches. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9856) REVIVE call with specified role(s) clears filters for all roles of a framework.
Benjamin Mahler created MESOS-9856: -- Summary: REVIVE call with specified role(s) clears filters for all roles of a framework. Key: MESOS-9856 URL: https://issues.apache.org/jira/browse/MESOS-9856 Project: Mesos Issue Type: Bug Components: allocation Reporter: Benjamin Mahler As pointed out by [~asekretenko], the REVIVE implementation in the allocator incorrectly clears decline filters for all of the framework's roles, rather than only those that were specified in the REVIVE call: https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1392 This should only clear filters for the roles specified in the REVIVE call. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8789) Role-related endpoints should display distinct offered and allocated resources.
[ https://issues.apache.org/jira/browse/MESOS-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-8789: -- Assignee: Benjamin Mahler (was: Till Toenshoff) > Role-related endpoints should display distinct offered and allocated > resources. > --- > > Key: MESOS-8789 > URL: https://issues.apache.org/jira/browse/MESOS-8789 > Project: Mesos > Issue Type: Improvement > Components: agent, HTTP API, master >Affects Versions: 1.5.0 >Reporter: Till Toenshoff >Assignee: Benjamin Mahler >Priority: Major > Labels: mesosphere, multitenancy, resource-management > > The role endpoints currently show accumulated values for resources > (allocated), containing offered resources. For gaining an overview showing > our allocated resources separately from the offered resources could improve > the signal quality, depending on the use case. > This also affects the UI display, for example the "Roles" tab. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8790) Deprecate Role::resources in favor of Role::allocated and Role::offered.
[ https://issues.apache.org/jira/browse/MESOS-8790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-8790: -- Assignee: Benjamin Mahler (was: Till Toenshoff) > Deprecate Role::resources in favor of Role::allocated and Role::offered. > > > Key: MESOS-8790 > URL: https://issues.apache.org/jira/browse/MESOS-8790 > Project: Mesos > Issue Type: Improvement > Components: HTTP API, master >Affects Versions: 1.5.0 >Reporter: Till Toenshoff >Assignee: Benjamin Mahler >Priority: Minor > Labels: mesosphere, multitenancy, resource-management > > There are upcoming enhancements around role related resource accounting. The > changes will add a more detailed role related resources accounting. > We need to retire the {{resources}} member of the {{Role}} Message in > mesos.proto (V0 + V1). This in turn means that we follow this deprecation on > the role-related endpoints as well, adding {{allocated}} to both "/roles" as > well as "GET_ROLES". -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9852) Slow memory growth due to deferred deletion of offer filters and timers.
Benjamin Mahler created MESOS-9852: -- Summary: Slow memory growth due to deferred deletion of offer filters and timers. Key: MESOS-9852 URL: https://issues.apache.org/jira/browse/MESOS-9852 Project: Mesos Issue Type: Bug Components: allocation, master Reporter: Benjamin Mahler The allocator does not keep a handle to the offer filter timer, which means it cannot remove the timer overhead (in this case memory) when removing the offer filter earlier (e.g. due to revive): https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1338-L1352 In addition, the offer filter is allocated on the heap but not deleted until the timer fires (which might take forever!): https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1321 https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1408-L1413 https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L2249 We'll need to try to backport this to all active release branches. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9813) Track role consumed quota for all roles in the allocator.
[ https://issues.apache.org/jira/browse/MESOS-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866034#comment-16866034 ] Benjamin Mahler commented on MESOS-9813: Note that the consumption metrics we expose should not include what is offered, which means we can't simply use the allocator's tracking of quota consumption, since it's unable to distinguish between offered and allocated. > Track role consumed quota for all roles in the allocator. > - > > Key: MESOS-9813 > URL: https://issues.apache.org/jira/browse/MESOS-9813 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Meng Zhu >Priority: Major > Labels: resource-management > > We are already tracking role consumed quota for roles with non-default quota > in the allocator. We should expand that to track all roles' consumptions > which will then be exposed through metrics later. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9849) Add support for per-role REVIVE / SUPPRESS to V0 scheduler driver.
Benjamin Mahler created MESOS-9849: -- Summary: Add support for per-role REVIVE / SUPPRESS to V0 scheduler driver. Key: MESOS-9849 URL: https://issues.apache.org/jira/browse/MESOS-9849 Project: Mesos Issue Type: Task Components: scheduler driver Reporter: Benjamin Mahler Unfortunately, there are still schedulers that are using the v0 bindings and are unable to move to v1 before wanting to use the per-role REVIVE / SUPPRESS calls. We'll need to add per-role REVIVE / SUPPRESS into the v1 scheduler driver. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9793) Implement UPDATE_FRAMEWORK call in V0 API
[ https://issues.apache.org/jira/browse/MESOS-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16864803#comment-16864803 ] Benjamin Mahler commented on MESOS-9793: [~asekretenko] friendly reminder to add the ticket to the reviews. I assume this ticket is also tracking the python bindings? > Implement UPDATE_FRAMEWORK call in V0 API > - > > Key: MESOS-9793 > URL: https://issues.apache.org/jira/browse/MESOS-9793 > Project: Mesos > Issue Type: Task >Reporter: Andrei Sekretenko >Assignee: Andrei Sekretenko >Priority: Major > Labels: multitenancy, resource-management > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9808) libprocess can deadlock on termination (cleanup() vs use() + terminate())
[ https://issues.apache.org/jira/browse/MESOS-9808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-9808: -- Assignee: Benjamin Mahler > libprocess can deadlock on termination (cleanup() vs use() + terminate()) > - > > Key: MESOS-9808 > URL: https://issues.apache.org/jira/browse/MESOS-9808 > Project: Mesos > Issue Type: Bug >Reporter: Andrei Sekretenko >Assignee: Benjamin Mahler >Priority: Major > Labels: foundations > Attachments: deadlock_stacks.txt, deadlock_stacks_filtered.txt, > deadlock_stacks_with_fix.txt > > > Using the process::loop() together with the common pattern of using > libprocess (Process wrapper + dispatching) is prone to causing a deadlock on > libprocess termination if the code does not wait for the loop exit before > termination. > *The deadlock itself is not directly caused by the process::loop(), though.* > It occurs in a following setup with two processes (let's name them A and B). > Thread 1 tries to cleanup process A. It locks processes_mutex and hangs here: > > [https://github.com/apache/mesos/blob/663bfa68b6ab68f4c28ed6a01ac42ac2ad23ac07/3rdparty/libprocess/src/process.cpp#L3079] > waiting for the process A to have no strong references. > Thread 2 begins with creating a ProcessReference in > ProcessManager::deliver(UPID&) called for process: > [https://github.com/apache/mesos/blob/663bfa68b6ab68f4c28ed6a01ac42ac2ad23ac07/3rdparty/libprocess/src/process.cpp#L2799] > and ends up waiting for processes_mutex in ProcessManager::terminate() for > process B: > > [https://github.com/apache/mesos/blob/663bfa68b6ab68f4c28ed6a01ac42ac2ad23ac07/3rdparty/libprocess/src/process.cpp#L3155] > - > In the observed case, terminate() for process B was triggered by a > destructor of a process-wrapping object owned by a libprocess loop executing > on A. > I'm attaching the stacks captured at the deadlock. Stacks of the threads > which lock one another are in [^deadlock_stacks_filtered.txt] Note frame #1 > in Thread 5 (waiting for all references to expire) and frames #48 and #8 in > Thread 19 (creating a reference and waiting for a processes_mutex). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9808) libprocess can deadlock on termination (cleanup() vs use() + terminate())
[ https://issues.apache.org/jira/browse/MESOS-9808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855287#comment-16855287 ] Benjamin Mahler commented on MESOS-9808: Thanks for looking into this [~asekretenko]! This can happen when a dispatch has objects that are bound into it whose destructors will do any of the following: * terminate a process * dispatch to a process using a UPID that didn't resolve to a Process upon construction (highly doubt we have any code doing this) * send a message to a local Process (i.e. in the same OS process) (doubt this will be an issue outside of testing since we use dispatch for local components) The issue is that we currently destruct dropped DispatchEvents to TERMINATING Processes while holding the TERMINATING ProcessReference (whoops!), and so we can execute further calls that try to block on the processes_mutex (e.g. terminate()) while the cleanup of the TERMINATING Process is spinning waiting for transient references to go away. I'm not sure how common the terminate case above is, but it's the most worrying. Probably it makes sense to backport the fix to at least 1.8.x, and ideally further back. I wrote a fix, and spent some time trying to test this but gave up after being unable to figure out how to reliably get into a deadlock state without races. The fix is here: https://reviews.apache.org/r/70778/ Can you let me know if it fixes the issue that you saw without your workaround? > libprocess can deadlock on termination (cleanup() vs use() + terminate()) > - > > Key: MESOS-9808 > URL: https://issues.apache.org/jira/browse/MESOS-9808 > Project: Mesos > Issue Type: Bug >Reporter: Andrei Sekretenko >Priority: Major > Labels: foundations > Attachments: deadlock_stacks.txt, deadlock_stacks_filtered.txt > > > Using the process::loop() together with the common pattern of using > libprocess (Process wrapper + dispatching) is prone to causing a deadlock on > libprocess termination if the code does not wait for the loop exit before > termination. > *The deadlock itself is not directly caused by the process::loop(), though.* > It occurs in a following setup with two processes (let's name them A and B). > Thread 1 tries to cleanup process A. It locks processes_mutex and hangs here: > > [https://github.com/apache/mesos/blob/663bfa68b6ab68f4c28ed6a01ac42ac2ad23ac07/3rdparty/libprocess/src/process.cpp#L3079] > waiting for the process A to have no strong references. > Thread 2 begins with creating a ProcessReference in > ProcessManager::deliver(UPID&) called for process: > [https://github.com/apache/mesos/blob/663bfa68b6ab68f4c28ed6a01ac42ac2ad23ac07/3rdparty/libprocess/src/process.cpp#L2799] > and ends up waiting for processes_mutex in ProcessManager::terminate() for > process B: > > [https://github.com/apache/mesos/blob/663bfa68b6ab68f4c28ed6a01ac42ac2ad23ac07/3rdparty/libprocess/src/process.cpp#L3155] > - > In the observed case, terminate() for process B was triggered by a > destructor of a process-wrapping object owned by a libprocess loop executing > on A. > I'm attaching the stacks captured at the deadlock. Stacks of the threads > which lock one another are in [^deadlock_stacks_filtered.txt] Note frame #1 > in Thread 5 (waiting for all references to expire) and frames #48 and #8 in > Thread 19 (creating a reference and waiting for a processes_mutex). -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9801) Use protobuf arenas for v1 API responses.
[ https://issues.apache.org/jira/browse/MESOS-9801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-9801: -- Assignee: William Mahler > Use protobuf arenas for v1 API responses. > - > > Key: MESOS-9801 > URL: https://issues.apache.org/jira/browse/MESOS-9801 > Project: Mesos > Issue Type: Improvement > Components: agent, master >Reporter: Benjamin Mahler >Assignee: William Mahler >Priority: Major > Labels: performance > > The v1 API response construction is currently slower than the v0 API response > construction. A primary reason for this is that the v1 API constructs > intermediate C++ protobuf response objects, which are very expensive in terms > of memory allocation/deallocation cost. Also involved is the use of > {{evolve()}} which evolves messages from unversioned protobuf into v1 > protobuf. This also has very high memory allocation / deallocation cost. > Using arenas for all v1 API response construction will provide a significant > improvement. > This ticket currently captures all the aspects of this: > * Updating {{evolve()}} to use arenas across all v1 API responses. > * Updating all response construction functions (e.g. {{getState())}}) to use > arenas. > * Making this change for both the master and agent. > This is blocked by MESOS-9755 since we need to upgrade our bundled protobuf > to have string fields allocated in the arenas. > We may split out tickets for CHANGELOG purposes if only a portion of this > lands in 1.9.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9801) Use protobuf arenas for v1 API responses.
Benjamin Mahler created MESOS-9801: -- Summary: Use protobuf arenas for v1 API responses. Key: MESOS-9801 URL: https://issues.apache.org/jira/browse/MESOS-9801 Project: Mesos Issue Type: Improvement Components: agent, master Reporter: Benjamin Mahler The v1 API response construction is currently slower than the v0 API response construction. A primary reason for this is that the v1 API constructs intermediate C++ protobuf response objects, which are very expensive in terms of memory allocation/deallocation cost. Also involved is the use of {{evolve()}} which evolves messages from unversioned protobuf into v1 protobuf. This also has very high memory allocation / deallocation cost. Using arenas for all v1 API response construction will provide a significant improvement. This ticket currently captures all the aspects of this: * Updating {{evolve()}} to use arenas across all v1 API responses. * Updating all response construction functions (e.g. {{getState())}}) to use arenas. * Making this change for both the master and agent. This is blocked by MESOS-9755 since we need to upgrade our bundled protobuf to have string fields allocated in the arenas. We may split out tickets for CHANGELOG purposes if only a portion of this lands in 1.9.0. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9787) Low slow SSL (TLS) peer reverse DNS lookup.
[ https://issues.apache.org/jira/browse/MESOS-9787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-9787: -- Assignee: Benjamin Mahler > Low slow SSL (TLS) peer reverse DNS lookup. > --- > > Key: MESOS-9787 > URL: https://issues.apache.org/jira/browse/MESOS-9787 > Project: Mesos > Issue Type: Improvement > Components: libprocess >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Major > Fix For: 1.5.4, 1.6.3, 1.7.3, 1.8.1, 1.9.0 > > > Given the severity of MESOS-9339, we should add logging of slow SSL (TLS) > peer reverse DNS lookups. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9787) Low slow SSL (TLS) peer reverse DNS lookup.
Benjamin Mahler created MESOS-9787: -- Summary: Low slow SSL (TLS) peer reverse DNS lookup. Key: MESOS-9787 URL: https://issues.apache.org/jira/browse/MESOS-9787 Project: Mesos Issue Type: Improvement Components: libprocess Reporter: Benjamin Mahler Given the severity of MESOS-9339, we should add logging of slow SSL (TLS) peer reverse DNS lookups. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9749) mesos agent logging hangs upon systemd-journald restart
[ https://issues.apache.org/jira/browse/MESOS-9749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839372#comment-16839372 ] Benjamin Mahler commented on MESOS-9749: cc [~kaysoky] > mesos agent logging hangs upon systemd-journald restart > --- > > Key: MESOS-9749 > URL: https://issues.apache.org/jira/browse/MESOS-9749 > Project: Mesos > Issue Type: Bug >Affects Versions: 1.7.2 > Environment: Running on centos 7.4.1708, systemd 219 (probably > heavily patched by centos) > mesos-agent command: > {code} > /usr/sbin/mesos-slave \ > > --attributes='canary:canary-false;maintenance_group:group-6;network:10g;platform:centos;platform_major_version:7;rack_name:22.05;type:base;version:v2018-q-1' > \ > --cgroups_enable_cfs \ > --cgroups_hierarchy='/sys/fs/cgroup' \ > --cgroups_net_cls_primary_handle='0xC370' \ > --container_logger='org_apache_mesos_LogrotateContainerLogger' \ > --containerizers='mesos' \ > --credential='file:///etc/mesos-chef/slave-credential' \ > > --default_container_info='\{"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"},\{"host_path":"var_tmp","container_path":"/var/tmp","mode":"RW"},\{"host_path":".","container_path":"/mnt/mesos/sandbox","mode":"RW"},\{"host_path":"/usr/share/mesos/geoip","container_path":"/mnt/mesos/geoip","mode":"RO"}]}' > \ > --docker_registry='https://filer-docker-registry.prod.crto.in/' \ > --docker_store_dir='/var/opt/mesos/store/docker' \ > --enforce_container_disk_quota \ > > --executor_environment_variables='\{"PATH":"/bin:/usr/bin","CRITEO_DC":"par","CRITEO_ENV":"prod","CRITEO_GEOIP_PATH":"/mnt/mesos/geoip"}' > \ > --executor_registration_timeout='5mins' \ > --fetcher_cache_dir='/var/opt/mesos/cache' \ > --fetcher_cache_size='2GB' \ > --hooks='com_criteo_mesos_CommandHook' \ > --image_providers='docker' \ > --image_provisioner_backend='copy' \ > > --isolation='linux/capabilities,cgroups/cpu,cgroups/mem,cgroups/net_cls,namespaces/pid,filesystem/linux,docker/runtime,network/cni,disk/xfs,com_criteo_mesos_CommandIsolator' > \ > --logging_level='INFO' \ > > --master='zk://mesos:xx...@mesos-master01-par.central.criteo.prod:2181,mesos-master02-par.central.criteo.prod:2181,mesos-master03-par.central.criteo.prod:2181/mesos' > \ > --modules='file:///etc/mesos-chef/slave-modules.json' \ > --port=5051 \ > --recover='reconnect' \ > --resources='file:///etc/mesos-chef/custom_resources.json' \ > --strict \ > --work_dir='/var/opt/mesos' \ > --xfs_kill_containers \ > --xfs_project_range='[5000-50]' > {code} >Reporter: Gregoire Seux >Priority: Minor > Labels: foundations > > When mesos agent is launched through systemd, a restart of systemd-journald > service makes mesos agent logging hang (no more output).. The process itself > seems to work fine (we can query state via http for instance). > A restart of mesos-agent corrects the issue. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9773) Log the peer address during SSL handshake failure.
Benjamin Mahler created MESOS-9773: -- Summary: Log the peer address during SSL handshake failure. Key: MESOS-9773 URL: https://issues.apache.org/jira/browse/MESOS-9773 Project: Mesos Issue Type: Improvement Components: libprocess Reporter: Benjamin Mahler Recently, peer address logging was added to *most* socket errors per MESOS. However, the case where an non-SSL connection arrives when we have SSL-only mandated, the following confusing error is printed: {noformat} "process.cpp Failed to accept socket: Failed accept: connection error: error::lib(0):func(0):reason(0)" {noformat} We should be able to avoid the confusing message here as well as include the peer address, so that it's easier to know where the connection is coming from. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9767) Add self health monitoring in Mesos master
[ https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834119#comment-16834119 ] Benjamin Mahler commented on MESOS-9767: The bizarre thread stack is: {noformat} Thread 21 (Thread 0x7fa1e0e4d700 (LWP 85889)): #0 0x7fa1f05f01c2 in hash_combine_impl (k=52, h=) at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:264 #1 hash_combine (v=, seed=) at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:337 #2 hash_range<__gnu_cxx::__normal_iterator > > (last=..., first=52 '4') at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:351 #3 hash_value > (v=...) at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:410 #4 operator() (this=, v=...) at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:486 #5 boost::hash_combine (seed=seed@entry=@0x7fa1e0e4c770: 0, v=...) at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:337 #6 0x7fa1f06ad178 in operator() (this=0x7fa1cc02d068, taskId=...) at /mesos/include/mesos/type_utils.hpp:634 #7 _M_hash_code (this=0x7fa1cc02d068, __k=...) at /usr/include/c++/4.9/bits/hashtable_policy.h:1261 #8 std::_Hashtable > > >, std::allocator > > > >, std::__detail::_Select1st, std::equal_to, std::hash, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits >::count (this=this@entry=0x7fa1cc02d068, __k=...) at /usr/include/c++/4.9/bits/hashtable.h:1336 #9 0x7fa1f0663eb2 in count (__x=..., this=0x7fa1cc02d068) at /usr/include/c++/4.9/bits/unordered_map.h:592 #10 contains (key=..., this=0x7fa1cc02d068) at /mesos/3rdparty/stout/include/stout/hashmap.hpp:88 #11 erase (key=..., this=0x7fa1cc02d050) at /mesos/3rdparty/stout/include/stout/boundedhashmap.hpp:92 #12 mesos::internal::master::Master::__reregisterSlave(process::UPID const&, mesos::internal::ReregisterSlaveMessage&&, process::Future const&) (this=0x561dcf047380, pid=..., reregisterSlaveMessage=, future=...) at /mesos/src/master/master.cpp:7369 #13 0x7fa1f14d54e1 in operator() (args#0=0x561dcf048620, this=) at /mesos/3rdparty/libprocess/../stout/include/stout/lambda.hpp:443 #14 process::ProcessBase::consume(process::DispatchEvent&&) (this=, event=) at /mesos/3rdparty/libprocess/src/process.cpp:3577 #15 0x7fa1f14e89b2 in serve ( event=, this=0x561dcf048620) at /mesos/3rdparty/libprocess/include/process/process.hpp:87 #16 process::ProcessManager::resume (this=, process=0x561dcf048620) at /mesos/3rdparty/libprocess/src/process.cpp:3002 #17 0x7fa1f14ee226 in operator() (__closure=0x561dcf119158) at /mesos/3rdparty/libprocess/src/process.cpp:2511 #18 _M_invoke<> (this=0x561dcf119158) at /usr/include/c++/4.9/functional:1700 #19 operator() (this=0x561dcf119158) at /usr/include/c++/4.9/functional:1688 #20 std::thread::_Impl()> >::_M_run(void) (this=0x561dcf119140) at /usr/include/c++/4.9/thread:115 #21 0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #22 0x7fa1ee520064 in start_thread (arg=0x7fa1e0e4d700) at pthread_create.c:309 #23 0x7fa1ee25562d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 {noformat} [~ggarg] is this trace present whenever it's hanging? > Add self health monitoring in Mesos master > -- > > Key: MESOS-9767 > URL: https://issues.apache.org/jira/browse/MESOS-9767 > Project: Mesos > Issue Type: Task > Components: master >Affects Versions: 1.6.0 >Reporter: Gaurav Garg >Priority: Major > Fix For: 1.7.2 > > > We have seen issue where Mesos master got stuck and was not responding to > HTTP endpoints like "/metrics/snapshot". This results in calls by the > frameworks and metrics collector to the master to hang. Currently we emit > 'master alive' metric using prometheus. If master hangs, this metrics is not > published and we detect the hangs using alerts on top of this metrics. By the > time someone would have got the alert and restarted the master process, > 15-30mins would have passed by. This results in SLA violation by Mesos > cluster users. > It will be nice to implement a self health check monitoring to detect if the > Mesos master is hung/stuck. This will help us to quickly crash the master > process so that one of the other member of the quorum can acquire ZK > leadership lock. > We can use the "/master/health" endpoint for health checks. > Health checks can be initiated in > [src/master/main.cpp|[https://github.com/apache/mesos/blob/master/src/master/main.cpp]] > just after the child master process is > [spawned.|[https://github.com/apache/mesos/blob/master/src/master/main.cpp#L543]] > We can leverage the >
[jira] [Comment Edited] (MESOS-9767) Add self health monitoring in Mesos master
[ https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834089#comment-16834089 ] Benjamin Mahler edited comment on MESOS-9767 at 5/6/19 6:54 PM: Mesos master stopped responding to HTTP request at around 16:30PM. At around 17:00PM, master was restarted. Logs are attached after the stack trace. Logs of Mesos master around the same time: {noformat} I0429 16:26:45.664958 85889 master.cpp:8397] Sending status update TASK_FAILED for task 58f5b1e4-844d-4909-b75e-294ecc919a3f-3-2 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.665169 85889 master.cpp:8397] Sending status update TASK_FAILED for task 0accdb07-74f4-42d1-8921-1d0703d3c907-0-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.665390 85889 master.cpp:8397] Sending status update TASK_FAILED for task 2df3cbb1-9790-492b-8250-5d1666557e53-0-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.665594 85889 master.cpp:8397] Sending status update TASK_FAILED for task 7fbdf4f6-9947-413b-9b06-3e6c57d93cba-2-2 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.665812 85889 master.cpp:8397] Sending status update TASK_FAILED for task 588c43e4-38ee-4c29-947c-b59b9bd431f5-3-7 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.666008 85889 master.cpp:8397] Sending status update TASK_FAILED for task 7e4bd9f6-8da9-4569-9b23-dbfb0eb27c3f-0-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.666244 85889 master.cpp:8397] Sending status update TASK_FAILED for task 11c4d38d-a641-4936-ad16-b8c237e74498-1-34629 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.666452 85889 master.cpp:8397] Sending status update TASK_FAILED for task a675e086-fadf-47a0-87a4-a3c0f305b2c4-1-4 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.79 85889 master.cpp:8397] Sending status update TASK_FAILED for task 0b1cb4d7-5fb1-499f-8c02-98df60739f58-1-34027 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.666882 85889 master.cpp:8397] Sending status update TASK_FAILED for task 1529c73c-3699-4cd8-81b8-07849f34e89c-3-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.667078 85889 master.cpp:8397] Sending status update TASK_FAILED for task 3c9ad9c3-5cff-4550-b25e-d33b86d5a1ce-6-2 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.667371 85889 master.cpp:8397] Sending status update TASK_FAILED for task 365aa302-a4b1-4a70-ab47-49acf55d36c4-1-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.667604 85889 master.cpp:8397] Sending status update TASK_FAILED for task 47b370b1-2c1d-4679-93b4-93a33bb2783b-3-2 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.667842 85889 master.cpp:8397] Sending status update TASK_FAILED for task 161b2f67-7765-4d5f-94fe-fdcdb1b048e6-1-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.668094 85889 master.cpp:8397] Sending status update TASK_FAILED for task 87323ff4-7018-45b1-990d-8d673f932f6e-1-33866 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.668329 85889 master.cpp:8397] Sending status update TASK_FAILED for task 7e9fa49d-04f0-40f6-8799-9f0b47c3af83-2-3 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.668557 85889 master.cpp:8397] Sending status update TASK_FAILED for task d043189a-ae4c-4061-80f6-efc1e43938e6-1-2 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.668810 85889 master.cpp:8397] Sending status update TASK_FAILED for task f6b5ec2b-0b80-4929-baf9-23e63e9be050-1-33287 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.669023 85889 master.cpp:8397] Sending status update TASK_FAILED for task 2df3cbb1-9790-492b-8250-5d1666557e53-1-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.669239 85889 master.cpp:8397] Sending status update TASK_FAILED for task 2b0296a5-f576-47ba-ba46-88b7a604f1fb-1-1 of framework 3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered' I0429 16:26:45.669457 85889 master.cpp:8397]
[jira] [Comment Edited] (MESOS-9767) Add self health monitoring in Mesos master
[ https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834088#comment-16834088 ] Benjamin Mahler edited comment on MESOS-9767 at 5/6/19 6:50 PM: Stack trace of the Mesos master when the hang was detected. Captured using gdb. {noformat} Thread 35 (Thread 0x7fa1e7e5b700 (LWP 85875)): #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 #1 0x7fa1f14d6e82 in wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:115 #2 wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:154 #3 wait (this=) at /mesos/3rdparty/libprocess/src/run_queue.hpp:73 #4 process::ProcessManager::dequeue (this=0x561dcf063970) at /mesos/3rdparty/libprocess/src/process.cpp:3305 #5 0x7fa1f14ee22f in operator() (__closure=0x561dcf0ae768) at /mesos/3rdparty/libprocess/src/process.cpp:2505 #6 _M_invoke<> (this=0x561dcf0ae768) at /usr/include/c++/4.9/functional:1700 #7 operator() (this=0x561dcf0ae768) at /usr/include/c++/4.9/functional:1688 #8 std::thread::_Impl()> >::_M_run(void) (this=0x561dcf0ae750) at /usr/include/c++/4.9/thread:115 #9 0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x7fa1ee520064 in start_thread (arg=0x7fa1e7e5b700) at pthread_create.c:309 #11 0x7fa1ee25562d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 34 (Thread 0x7fa1e765a700 (LWP 85876)): #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 #1 0x7fa1f14d6e82 in wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:115 #2 wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:154 #3 wait (this=) at /mesos/3rdparty/libprocess/src/run_queue.hpp:73 #4 process::ProcessManager::dequeue (this=0x561dcf063970) at /mesos/3rdparty/libprocess/src/process.cpp:3305 #5 0x7fa1f14ee22f in operator() (__closure=0x561dcf11ff38) at /mesos/3rdparty/libprocess/src/process.cpp:2505 #6 _M_invoke<> (this=0x561dcf11ff38) at /usr/include/c++/4.9/functional:1700 #7 operator() (this=0x561dcf11ff38) at /usr/include/c++/4.9/functional:1688 #8 std::thread::_Impl()> >::_M_run(void) (this=0x561dcf11ff20) at /usr/include/c++/4.9/thread:115 #9 0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x7fa1ee520064 in start_thread (arg=0x7fa1e765a700) at pthread_create.c:309 #11 0x7fa1ee25562d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 33 (Thread 0x7fa1e6e59700 (LWP 85877)): #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 #1 0x7fa1f14d6e82 in wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:115 #2 wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:154 #3 wait (this=) at /mesos/3rdparty/libprocess/src/run_queue.hpp:73 #4 process::ProcessManager::dequeue (this=0x561dcf063970) at /mesos/3rdparty/libprocess/src/process.cpp:3305 #5 0x7fa1f14ee22f in operator() (__closure=0x561dcf11d988) at /mesos/3rdparty/libprocess/src/process.cpp:2505 #6 _M_invoke<> (this=0x561dcf11d988) at /usr/include/c++/4.9/functional:1700 #7 operator() (this=0x561dcf11d988) at /usr/include/c++/4.9/functional:1688 #8 std::thread::_Impl()> >::_M_run(void) (this=0x561dcf11d970) at /usr/include/c++/4.9/thread:115 #9 0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x7fa1ee520064 in start_thread (arg=0x7fa1e6e59700) at pthread_create.c:309 #11 0x7fa1ee25562d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 32 (Thread 0x7fa1e6658700 (LWP 85878)): #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 #1 0x7fa1f14d6e82 in wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:115 #2 wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:154 #3 wait (this=) at /mesos/3rdparty/libprocess/src/run_queue.hpp:73 #4 process::ProcessManager::dequeue (this=0x561dcf063970) at /mesos/3rdparty/libprocess/src/process.cpp:3305 #5 0x7fa1f14ee22f in operator() (__closure=0x561dcf128758) at /mesos/3rdparty/libprocess/src/process.cpp:2505 #6 _M_invoke<> (this=0x561dcf128758) at /usr/include/c++/4.9/functional:1700 #7 operator() (this=0x561dcf128758) at /usr/include/c++/4.9/functional:1688 #8 std::thread::_Impl()> >::_M_run(void) (this=0x561dcf128740) at /usr/include/c++/4.9/thread:115 #9 0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6 #10 0x7fa1ee520064 in start_thread (arg=0x7fa1e6658700) at pthread_create.c:309 #11 0x7fa1ee25562d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111 Thread 31 (Thread 0x7fa1e5e57700 (LWP 85879)): #0 sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85 #1 0x7fa1f14d6e82 in wait (this=) at /mesos/3rdparty/libprocess/src/semaphore.hpp:115 #2 wait (this=) at
[jira] [Assigned] (MESOS-9766) /__processes__ endpoint can hang.
[ https://issues.apache.org/jira/browse/MESOS-9766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-9766: -- Assignee: Benjamin Mahler > /__processes__ endpoint can hang. > - > > Key: MESOS-9766 > URL: https://issues.apache.org/jira/browse/MESOS-9766 > Project: Mesos > Issue Type: Bug > Components: libprocess >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Major > Labels: foundations > > A user reported that the {{/\_\_processes\_\_}} endpoint occasionally hangs. > Stack traces provided by [~alexr] revealed that all the threads appeared to > be idle waiting for events. After investigating the code, the issue was found > to be possible when a process gets terminated after the > {{/\_\_processes\_\_}} route handler dispatches to it, thus dropping the > dispatch and abandoning the future. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9766) /__processes__ endpoint can hang.
Benjamin Mahler created MESOS-9766: -- Summary: /__processes__ endpoint can hang. Key: MESOS-9766 URL: https://issues.apache.org/jira/browse/MESOS-9766 Project: Mesos Issue Type: Bug Components: libprocess Reporter: Benjamin Mahler A user reported that the {{/\_\_processes\_\_}} endpoint occasionally hangs. Stack traces provided by [~alexr] revealed that all the threads appeared to be idle waiting for events. After investigating the code, the issue was found to be possible when a process gets terminated after the {{/\_\_processes\_\_}} route handler dispatches to it, thus dropping the dispatch and abandoning the future. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9761) Mesos UI does not properly account for resources set via `--default-role`
[ https://issues.apache.org/jira/browse/MESOS-9761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831700#comment-16831700 ] Benjamin Mahler commented on MESOS-9761: As [~vinodkone] mentioned, reservations will show up as "consumption" rather than "guarantee" or "limit". Linking in related ticket. > Mesos UI does not properly account for resources set via `--default-role` > - > > Key: MESOS-9761 > URL: https://issues.apache.org/jira/browse/MESOS-9761 > Project: Mesos > Issue Type: Bug >Reporter: Benno Evers >Priority: Major > Labels: resource-management, ui > Attachments: default_role_ui.png > > > In our cluster, we have two agents configured with > "--default_role=slave_public" and 64 cpus each, for a total of 128 cpus > allocated to this role. The right side of the screenshot shows one of them. > However, looking at the "Roles" tab in the Mesos UI, neither "Guarantee" nor > "Limit" does show any resources for this role. > See attached screenshot for details. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9619) Mesos Master Crashes with Launch Group when using Port Resources
[ https://issues.apache.org/jira/browse/MESOS-9619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831259#comment-16831259 ] Benjamin Mahler commented on MESOS-9619: Updated test: https://reviews.apache.org/r/70580/ > Mesos Master Crashes with Launch Group when using Port Resources > > > Key: MESOS-9619 > URL: https://issues.apache.org/jira/browse/MESOS-9619 > Project: Mesos > Issue Type: Bug > Components: allocation >Affects Versions: 1.4.3, 1.7.1 > Environment: > Testing in both Mesos 1.4.3 and Mesos 1.7.1 >Reporter: Nimi Wariboko Jr. >Assignee: Greg Mann >Priority: Critical > Labels: foundations, master, mesosphere > Fix For: 1.5.4, 1.6.3, 1.7.3, 1.8.0 > > Attachments: mesos-master.log, mesos-master.snippet.log > > > Original Issue: > [https://lists.apache.org/thread.html/979c8799d128ad0c436b53f2788568212f97ccf324933524f1b4d189@%3Cuser.mesos.apache.org%3E] > When the ports resources is removed, Mesos functions normally (I'm able to > launch the task as many times as possible, while it always fails continually). > Attached is a snippet of the mesos master log from OFFER to crash. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9689) Migrate stout hashmap and hashset to Abseil's "swiss tables".
[ https://issues.apache.org/jira/browse/MESOS-9689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829344#comment-16829344 ] Benjamin Mahler commented on MESOS-9689: See also: https://code.fb.com/developer-tools/f14/ > Migrate stout hashmap and hashset to Abseil's "swiss tables". > - > > Key: MESOS-9689 > URL: https://issues.apache.org/jira/browse/MESOS-9689 > Project: Mesos > Issue Type: Improvement > Components: stout >Reporter: Benjamin Mahler >Priority: Major > Labels: performance > > For improved lookup and insertion performance, as well as lower memory > consumption, we should migrate stout's hashmap / hashset wrappers to use > Abseil's containers. > There are some subtleties to migration, see: > https://abseil.io/docs/cpp/guides/container > See also: https://youtu.be/ncHmEUmJZf4 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-8511) Provide a v0/v1 test scheduler to simplify the tests.
[ https://issues.apache.org/jira/browse/MESOS-8511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-8511: -- Assignee: Benjamin Mahler > Provide a v0/v1 test scheduler to simplify the tests. > - > > Key: MESOS-8511 > URL: https://issues.apache.org/jira/browse/MESOS-8511 > Project: Mesos > Issue Type: Improvement > Components: test >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Major > Labels: tech-debt > > Currently, there are a lot of tests that just want to launch a task in order > to test some behavior of the system. These tests have to create their own v0 > or v1 scheduler and invoke the necessary calls on it and expect the necessary > calls / messages back. This is rather verbose. > It would be helpful to have some better abstractions here, like a > TestScheduler that can launch tasks and exposes the status updates for them, > along with other interesting information. E.g. > {code} > class TestScheduler > { > // Add the task to the queue of tasks that need to be launched. > // Returns the stream of status updates for this task. > Queue addTask(const TaskInfo& t); > etc > } > {code} > Probably this could be implemented against both v0 and v1, if we want to > parameterize the tests. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9701) Allocator's roles map should track reservations.
[ https://issues.apache.org/jira/browse/MESOS-9701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-9701: -- Assignee: Andrei Sekretenko > Allocator's roles map should track reservations. > > > Key: MESOS-9701 > URL: https://issues.apache.org/jira/browse/MESOS-9701 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Benjamin Mahler >Assignee: Andrei Sekretenko >Priority: Major > Labels: resource-management > > Currently, the allocator's {{roles}} map only tracks roles that have > allocations or framework subscriptions: > https://github.com/apache/mesos/blob/1.7.2/src/master/allocator/mesos/hierarchical.hpp#L531-L535 > And we separately track a map of total reservations for each role: > https://github.com/apache/mesos/blob/1.7.2/src/master/allocator/mesos/hierarchical.hpp#L541-L547 > Confusingly, the {{roles}} map won't have an entry when there is a > reservation for a role but no allocations or frameworks subscribed. We should > ensure that the map has an entry when there are reservations. Also, we can > consolidate the reservation information and framework ids into the same map, > e.g.: > {code} > struct Role > { > hashset frameworkIds; > ResourceQuantities totalReservations; > }; > hashmap roles; > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9734) Allocator pause/resume functionality should compensate for a missed allocation cycle.
Benjamin Mahler created MESOS-9734: -- Summary: Allocator pause/resume functionality should compensate for a missed allocation cycle. Key: MESOS-9734 URL: https://issues.apache.org/jira/browse/MESOS-9734 Project: Mesos Issue Type: Bug Components: allocation Reporter: Benjamin Mahler This matters more when the allocation cycle interval is set to large values (e.g. 30 seconds, 1 minute, etc). When the allocator is paused, the interval timeouts continue but an allocation cycle gets skipped. So, if the interval is long, when it's resumed, it can take up to an entire interval again to have another cycle. E.g. with 1 minute cycle 0mins 1mins: allocate 1.01mins: pause 2mins: allocate skipped 2.01mins: resume 3mins: allocate In this case, one would expect that resuming at 2.01 mins should just immediately trigger an allocation cycle since we're "overdue" for one, and start the interval timeouts again fresh. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9710) Add tests to ensure random sorter performs correct weighted sorting.
[ https://issues.apache.org/jira/browse/MESOS-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819434#comment-16819434 ] Benjamin Mahler commented on MESOS-9710: {noformat} commit a03db7d684f343656aa229771f30c4990a2839c1 Author: Benjamin Mahler Date: Tue Apr 9 17:08:02 2019 -0400 Added a test of hierarchical sorting for the random sorter. Review: https://reviews.apache.org/r/70438 {noformat} > Add tests to ensure random sorter performs correct weighted sorting. > > > Key: MESOS-9710 > URL: https://issues.apache.org/jira/browse/MESOS-9710 > Project: Mesos > Issue Type: Task > Components: allocation >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Major > > We added tests for the weighted shuffle algorithm, but didn't test that the > RandomSorter's sort() function behaves correctly. > We should also test that hierarchical weights in the random sorter behave > correctly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-7258) Provide scheduler calls to subscribe to additional roles and unsubscribe from roles.
[ https://issues.apache.org/jira/browse/MESOS-7258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-7258: -- Assignee: Andrei Sekretenko (was: Benjamin Mahler) > Provide scheduler calls to subscribe to additional roles and unsubscribe from > roles. > > > Key: MESOS-7258 > URL: https://issues.apache.org/jira/browse/MESOS-7258 > Project: Mesos > Issue Type: Improvement > Components: master, scheduler api >Reporter: Benjamin Mahler >Assignee: Andrei Sekretenko >Priority: Major > Labels: multitenancy, resource-management > > The current support for schedulers to subscribe to additional roles or > unsubscribe from some of their roles requires that the scheduler obtain a new > subscription with the master which invalidates the event stream. > A more lightweight mechanism would be to provide calls for the scheduler to > subscribe to additional roles or unsubscribe from some roles such that the > existing event stream remains open and offers to the new roles arrive on the > existing event stream. E.g. > SUBSCRIBE_TO_ROLE > UNSUBSCRIBE_FROM_ROLE > One open question pertains to the terminology here, whether we would want to > avoid using "subscribe" in this context. An alternative would be: > UPDATE_FRAMEWORK_INFO > Which provides a generic mechanism for a framework to perform framework info > updates without obtaining a new event stream. > In addition, it would be easier to use if it returned 200 on success and an > error response if invalid, etc. Rather than returning 202. > *NOTE*: Not specific to this issue, but we need to figure out how to allow > the framework to not leak reservations, e.g. MESOS-7651. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-7258) Provide scheduler calls to subscribe to additional roles and unsubscribe from roles.
[ https://issues.apache.org/jira/browse/MESOS-7258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-7258: -- Assignee: Benjamin Mahler > Provide scheduler calls to subscribe to additional roles and unsubscribe from > roles. > > > Key: MESOS-7258 > URL: https://issues.apache.org/jira/browse/MESOS-7258 > Project: Mesos > Issue Type: Improvement > Components: master, scheduler api >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Major > Labels: multitenancy, resource-management > > The current support for schedulers to subscribe to additional roles or > unsubscribe from some of their roles requires that the scheduler obtain a new > subscription with the master which invalidates the event stream. > A more lightweight mechanism would be to provide calls for the scheduler to > subscribe to additional roles or unsubscribe from some roles such that the > existing event stream remains open and offers to the new roles arrive on the > existing event stream. E.g. > SUBSCRIBE_TO_ROLE > UNSUBSCRIBE_FROM_ROLE > One open question pertains to the terminology here, whether we would want to > avoid using "subscribe" in this context. An alternative would be: > UPDATE_FRAMEWORK_INFO > Which provides a generic mechanism for a framework to perform framework info > updates without obtaining a new event stream. > In addition, it would be easier to use if it returned 200 on success and an > error response if invalid, etc. Rather than returning 202. > *NOTE*: Not specific to this issue, but we need to figure out how to allow > the framework to not leak reservations, e.g. MESOS-7651. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-7258) Provide scheduler calls to subscribe to additional roles and unsubscribe from roles.
[ https://issues.apache.org/jira/browse/MESOS-7258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-7258: -- Assignee: (was: Kapil Arya) > Provide scheduler calls to subscribe to additional roles and unsubscribe from > roles. > > > Key: MESOS-7258 > URL: https://issues.apache.org/jira/browse/MESOS-7258 > Project: Mesos > Issue Type: Improvement > Components: master, scheduler api >Reporter: Benjamin Mahler >Priority: Major > Labels: multitenancy, resource-management > > The current support for schedulers to subscribe to additional roles or > unsubscribe from some of their roles requires that the scheduler obtain a new > subscription with the master which invalidates the event stream. > A more lightweight mechanism would be to provide calls for the scheduler to > subscribe to additional roles or unsubscribe from some roles such that the > existing event stream remains open and offers to the new roles arrive on the > existing event stream. E.g. > SUBSCRIBE_TO_ROLE > UNSUBSCRIBE_FROM_ROLE > One open question pertains to the terminology here, whether we would want to > avoid using "subscribe" in this context. An alternative would be: > UPDATE_FRAMEWORK_INFO > Which provides a generic mechanism for a framework to perform framework info > updates without obtaining a new event stream. > In addition, it would be easier to use if it returned 200 on success and an > error response if invalid, etc. Rather than returning 202. > *NOTE*: Not specific to this issue, but we need to figure out how to allow > the framework to not leak reservations, e.g. MESOS-7651. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9710) Add tests to ensure random sorter performs correct weighted sorting.
[ https://issues.apache.org/jira/browse/MESOS-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-9710: -- Assignee: Benjamin Mahler (was: Meng Zhu) Assigning to myself for adding the hierarchical tests. > Add tests to ensure random sorter performs correct weighted sorting. > > > Key: MESOS-9710 > URL: https://issues.apache.org/jira/browse/MESOS-9710 > Project: Mesos > Issue Type: Task > Components: allocation >Reporter: Benjamin Mahler >Assignee: Benjamin Mahler >Priority: Major > > We added tests for the weighted shuffle algorithm, but didn't test that the > RandomSorter's sort() function behaves correctly. > We should also test that hierarchical weights in the random sorter behave > correctly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9710) Add tests to ensure random sorter performs correct weighted sorting.
[ https://issues.apache.org/jira/browse/MESOS-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812758#comment-16812758 ] Benjamin Mahler commented on MESOS-9710: Review for the first half; testing that flat role sorting behaves correctly: https://reviews.apache.org/r/70418/ > Add tests to ensure random sorter performs correct weighted sorting. > > > Key: MESOS-9710 > URL: https://issues.apache.org/jira/browse/MESOS-9710 > Project: Mesos > Issue Type: Task > Components: allocation >Reporter: Benjamin Mahler >Assignee: Meng Zhu >Priority: Major > > We added tests for the weighted shuffle algorithm, but didn't test that the > RandomSorter's sort() function behaves correctly. > We should also test that hierarchical weights in the random sorter behave > correctly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9710) Add tests to ensure random sorter performs correct weighted sorting.
Benjamin Mahler created MESOS-9710: -- Summary: Add tests to ensure random sorter performs correct weighted sorting. Key: MESOS-9710 URL: https://issues.apache.org/jira/browse/MESOS-9710 Project: Mesos Issue Type: Task Components: allocation Reporter: Benjamin Mahler Assignee: Meng Zhu We added tests for the weighted shuffle algorithm, but didn't test that the RandomSorter's sort() function behaves correctly. We should also test that hierarchical weights in the random sorter behave correctly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9701) Allocator's roles map should track reservations.
Benjamin Mahler created MESOS-9701: -- Summary: Allocator's roles map should track reservations. Key: MESOS-9701 URL: https://issues.apache.org/jira/browse/MESOS-9701 Project: Mesos Issue Type: Improvement Components: allocation Reporter: Benjamin Mahler Currently, the allocator's {{roles}} map only tracks roles that have allocations or framework subscriptions: https://github.com/apache/mesos/blob/1.7.2/src/master/allocator/mesos/hierarchical.hpp#L531-L535 And we separately track a map of total reservations for each role: https://github.com/apache/mesos/blob/1.7.2/src/master/allocator/mesos/hierarchical.hpp#L541-L547 Confusingly, the {{roles}} map won't have an entry when there is a reservation for a role but no allocations or frameworks subscribed. We should ensure that the map has an entry when there are reservations. Also, we can consolidate the reservation information and framework ids into the same map, e.g.: {code} struct Role { hashset frameworkIds; ResourceQuantities totalReservations; }; hashmap roles; {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9688) Quota is not enforced properly when subroles have reservations.
[ https://issues.apache.org/jira/browse/MESOS-9688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810208#comment-16810208 ] Benjamin Mahler commented on MESOS-9688: Additional fix: https://reviews.apache.org/r/70393/ > Quota is not enforced properly when subroles have reservations. > --- > > Key: MESOS-9688 > URL: https://issues.apache.org/jira/browse/MESOS-9688 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Meng Zhu >Assignee: Benjamin Mahler >Priority: Critical > Labels: resource-management > Fix For: 1.8.0 > > > Note: the discussion here concerns quota enforcement for top-level role, > setting quota on sublevel role is not supported. > If a subrole directly makes a reservation, the accounting of > `roleConsumedQuota` will be off: > https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.cpp#L1703-L1705 > Specifically, in this formula: > `Consumed Quota = reservations + allocation - allocated reservations` > The `reservations` part does not account subrole's reservation to its > ancestors. If a reservation is made directly for role "a/b", its reservation > is accounted only for "a/b" but not for "a". Similarly, if a top role ( "a") > reservation is refined to a subrole ("a/b"), the current code first subtracts > the reservation from "a" and then track that under "a/b". > We should make it hierarchical-aware. > The "allocation" and "allocated reservations" are both tracked in the sorter > where the hierarchical relationship is considered -- allocations are added > hierarchically. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9691) Quota headroom calculation is off when subroles are involved.
[ https://issues.apache.org/jira/browse/MESOS-9691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809875#comment-16809875 ] Benjamin Mahler commented on MESOS-9691: Re-opening as there is an issue with the fix. > Quota headroom calculation is off when subroles are involved. > - > > Key: MESOS-9691 > URL: https://issues.apache.org/jira/browse/MESOS-9691 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Meng Zhu >Assignee: Benjamin Mahler >Priority: Critical > Labels: resource-management > Fix For: 1.8.0 > > > Quota "availableHeadroom" calculation: > https://github.com/apache/mesos/blob/6276f7e73b0dbe7df49a7315cd1b83340d66f4ea/src/master/allocator/mesos/hierarchical.cpp#L1751-L1754 > is off when subroles are involved. > Specifically, in the formula > {noformat} > available headroom = total resources - allocated resources - (total > reservations - allocated reservations) - unallocated revocable resources > {noformat} > -The "allocated resources" part is hierarchical-aware and aggregate that > across all roles, thus allocations to subroles will be counted multiple times > (in the case of "a/b", once for "a" and once for "a/b").- Looks like due to > the presence of `INTERNAL` node, > `roleSorter->allocationScalarQuantities(role)` is *not* hierarchical. Thus > this is not an issue. > (If role `a/b` consumes 1cpu and `a` consumes 1cpu, if we query > `roleSorter->allocationScalarQuantities("a");` It will return 1cpu, which is > correct. In the sorter, there are four nodes, root, `a` (internal, 1cpu), > `a/.` (leaf, 1cpu), `a/b` (leaf, 1cpu). Query `a` will return `a/.`) > The "total reservations" is correct, since today it is "flat" (reservations > made to "a/b" are not counted to "a"). Thus all reservations are only counted > once -- which is the correct semantic here. However, once we fix MESOS-9688 > (which likely requires reservation tracking to be hierarchical-aware), we > need to ensure that the accounting is still correct. > -The "allocated reservations" is hierarchical-aware, thus overlap accounting > would occur.- Similar to the `"allocated resources"` above, this is also not > an issue at the moment. > Basically, when calculating the available headroom, we need to ensure > "single-counting". Ideally, we only need to look at the root's consumptions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-9688) Quota is not enforced properly when subroles have reservations.
[ https://issues.apache.org/jira/browse/MESOS-9688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809876#comment-16809876 ] Benjamin Mahler commented on MESOS-9688: Re-opening as there is an issue with the fix. > Quota is not enforced properly when subroles have reservations. > --- > > Key: MESOS-9688 > URL: https://issues.apache.org/jira/browse/MESOS-9688 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Meng Zhu >Assignee: Benjamin Mahler >Priority: Critical > Labels: resource-management > Fix For: 1.8.0 > > > Note: the discussion here concerns quota enforcement for top-level role, > setting quota on sublevel role is not supported. > If a subrole directly makes a reservation, the accounting of > `roleConsumedQuota` will be off: > https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.cpp#L1703-L1705 > Specifically, in this formula: > `Consumed Quota = reservations + allocation - allocated reservations` > The `reservations` part does not account subrole's reservation to its > ancestors. If a reservation is made directly for role "a/b", its reservation > is accounted only for "a/b" but not for "a". Similarly, if a top role ( "a") > reservation is refined to a subrole ("a/b"), the current code first subtracts > the reservation from "a" and then track that under "a/b". > We should make it hierarchical-aware. > The "allocation" and "allocated reservations" are both tracked in the sorter > where the hierarchical relationship is considered -- allocations are added > hierarchically. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9696) Test MasterQuotaTest.AvailableResourcesSingleDisconnectedAgent is flaky
[ https://issues.apache.org/jira/browse/MESOS-9696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-9696: -- Assignee: Benjamin Mahler > Test MasterQuotaTest.AvailableResourcesSingleDisconnectedAgent is flaky > --- > > Key: MESOS-9696 > URL: https://issues.apache.org/jira/browse/MESOS-9696 > Project: Mesos > Issue Type: Bug > Components: test >Affects Versions: 1.8.0 >Reporter: Benjamin Bannier >Assignee: Benjamin Mahler >Priority: Major > Labels: flaky, flaky-test, resource-management > Attachments: test.log > > > The test {{MasterQuotaTest.AvailableResourcesSingleDisconnectedAgent}} is > flaky, especially under additional system load. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9691) Quota headroom calculation is off when subroles are involved.
[ https://issues.apache.org/jira/browse/MESOS-9691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-9691: -- Assignee: Benjamin Mahler > Quota headroom calculation is off when subroles are involved. > - > > Key: MESOS-9691 > URL: https://issues.apache.org/jira/browse/MESOS-9691 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Meng Zhu >Assignee: Benjamin Mahler >Priority: Critical > Labels: resource-management > > Quota "availableHeadroom" calculation: > https://github.com/apache/mesos/blob/6276f7e73b0dbe7df49a7315cd1b83340d66f4ea/src/master/allocator/mesos/hierarchical.cpp#L1751-L1754 > is off when subroles are involved. > Specifically, in the formula > {noformat} > available headroom = total resources - allocated resources - (total > reservations - allocated reservations) - unallocated revocable resources > {noformat} > The "allocated resources" part is hierarchical-aware and aggregate that > across all roles, thus allocations to subroles will be counted multiple times > (in the case of "a/b", once for "a" and once for "a/b"). > The "total reservations" is correct, since today it is "flat" (reservations > made to "a/b" are not counted to "a"). Thus all reservations are only counted > once -- which is the correct semantic here. However, once we fix MESOS-9688 > (which likely requires reservation tracking to be hierarchical-aware), we > need to ensure that the accounting is still correct. > The "allocated reservations" is hierarchical-aware, thus overlap accounting > would occur. > Basically, when calculating the available headroom, we need to ensure > "single-counting". Ideally, we only need to look at the root's consumptions. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9688) Quota is not enforced properly when subroles have reservations.
[ https://issues.apache.org/jira/browse/MESOS-9688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-9688: -- Assignee: Benjamin Mahler > Quota is not enforced properly when subroles have reservations. > --- > > Key: MESOS-9688 > URL: https://issues.apache.org/jira/browse/MESOS-9688 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Meng Zhu >Assignee: Benjamin Mahler >Priority: Critical > Labels: resource-management > > Note: the discussion here concerns quota enforcement for top-level role, > setting quota on sublevel role is not supported. > If a subrole directly makes a reservation, the accounting of > `roleConsumedQuota` will be off: > https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.cpp#L1703-L1705 > Specifically, in this formula: > `Consumed Quota = reservations + allocation - allocated reservations` > The `reservations` part does not account subrole's reservation to its > ancestors. If a reservation is made directly for role "a/b", its reservation > is accounted only for "a/b" but not for "a". Similarly, if a top role ( "a") > reservation is refined to a subrole ("a/b"), the current code first subtracts > the reservation from "a" and then track that under "a/b". > We should make it hierarchical-aware. > The "allocation" and "allocated reservations" are both tracked in the sorter > where the hierarchical relationship is considered -- allocations are added > hierarchically. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9689) Migrate stout hashmap and hashset to Abseil's "swiss tables".
Benjamin Mahler created MESOS-9689: -- Summary: Migrate stout hashmap and hashset to Abseil's "swiss tables". Key: MESOS-9689 URL: https://issues.apache.org/jira/browse/MESOS-9689 Project: Mesos Issue Type: Improvement Components: stout Reporter: Benjamin Mahler For improved lookup and insertion performance, as well as lower memory consumption, we should migrate stout's hashmap / hashset wrappers to use Abseil's containers. There are some subtleties to migration, see: https://abseil.io/docs/cpp/guides/container See also: https://youtu.be/ncHmEUmJZf4 -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9680) Remove automatic disablement of GLOG_drop_log_memory.
Benjamin Mahler created MESOS-9680: -- Summary: Remove automatic disablement of GLOG_drop_log_memory. Key: MESOS-9680 URL: https://issues.apache.org/jira/browse/MESOS-9680 Project: Mesos Issue Type: Improvement Reporter: Benjamin Mahler Once we upgrade to glog 0.4.0, we no longer need our special case disablement of GLOG_drop_log_memory (see MESOS-920): https://github.com/apache/mesos/blob/1.7.2/src/logging/logging.cpp#L184-L194 This is because 0.4.0 includes https://github.com/google/glog/pull/145 which fixes the issue we filed:https://github.com/google/glog/issues/84. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8248) Expose information about GPU assigned to a task
[ https://issues.apache.org/jira/browse/MESOS-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16798495#comment-16798495 ] Benjamin Mahler commented on MESOS-8248: [~jomach] also, let's use MESOS-5255 > Expose information about GPU assigned to a task > --- > > Key: MESOS-8248 > URL: https://issues.apache.org/jira/browse/MESOS-8248 > Project: Mesos > Issue Type: Improvement > Components: containerization, gpu >Reporter: Karthik Anantha Padmanabhan >Priority: Major > Labels: GPU > > As a framework author I'd like information about the gpu that was assigned to > a task. > `nvidia-smi` for example provides the following information GPU UUID, boardId > minor number etc. It would useful to expose this information when a task is > assigned to a GPU instance. > This will make it possible to monitor resource usage for a task on GPU which > is not possible when -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-8248) Expose information about GPU assigned to a task
[ https://issues.apache.org/jira/browse/MESOS-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16798494#comment-16798494 ] Benjamin Mahler commented on MESOS-8248: [~jomach] send an email to the dev@ mailing list with your proposal, feel free also to use the #containerizer slack channel. > Expose information about GPU assigned to a task > --- > > Key: MESOS-8248 > URL: https://issues.apache.org/jira/browse/MESOS-8248 > Project: Mesos > Issue Type: Improvement > Components: containerization, gpu >Reporter: Karthik Anantha Padmanabhan >Priority: Major > Labels: GPU > > As a framework author I'd like information about the gpu that was assigned to > a task. > `nvidia-smi` for example provides the following information GPU UUID, boardId > minor number etc. It would useful to expose this information when a task is > assigned to a GPU instance. > This will make it possible to monitor resource usage for a task on GPU which > is not possible when -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9292) Rejected quotas request error messages should specify which resources were overcommitted.
[ https://issues.apache.org/jira/browse/MESOS-9292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-9292: -- Assignee: Benjamin Mahler Sprint: Resource Mgmt RI12 Sp 42 > Rejected quotas request error messages should specify which resources were > overcommitted. > - > > Key: MESOS-9292 > URL: https://issues.apache.org/jira/browse/MESOS-9292 > Project: Mesos > Issue Type: Improvement > Components: allocation >Reporter: Benno Evers >Assignee: Benjamin Mahler >Priority: Major > Labels: multitenancy > > If we reject a quota request due to not having enough available resources, we > fail with the following error: > {noformat} > Not enough available cluster capacity to reasonably satisfy quota > request; the force flag can be used to override this check > {noformat} > but we don't print *which* resource was not available. This can be confusing > to operators when the quota was attempted to be set for multiple resources at > once. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-7883) Quota heuristic check not accounting for mount volumes
[ https://issues.apache.org/jira/browse/MESOS-7883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-7883: -- Assignee: Benjamin Mahler Sprint: Resource Mgmt RI12 Sp 42 > Quota heuristic check not accounting for mount volumes > -- > > Key: MESOS-7883 > URL: https://issues.apache.org/jira/browse/MESOS-7883 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Vincent Roy >Assignee: Benjamin Mahler >Priority: Major > Labels: resource-management > > This may be expected but came as a surprise to us. We are unable to create a > quota bigger than the root disk space on slaves. > Given two clusters with the same number of slaves and root disk size, but one > that also has mount volumes, is what the disk resources look like: > {noformat} > [root@fin-fang-foom-master-1 ~]# curl -s master.mesos:5050/state | jq > '.slaves[] .resources .disk' > 28698 > 28699 > 28698 > 28698 > 28697 > {noformat} > {noformat} > [root@hydra-master-1 ~]# curl -s master.mesos:5050/state | jq '.slaves[] > .resources .disk' > 50817 > 50817 > 50814 > 50819 > 50817 > {noformat} > In {{fin-fang-foom}}, I was able to create a quota for {{143490mb}} which is > the total of available disk resources, root in this case, as reported by > Mesos. For {{hydra}}, I am only able to create a quota for {{143489mb}}. This > is equivalent to the total of root disks available in {{hydra}} rather than > the total available disks reported by Mesos resources which is {{254084mb}}. > With a modified Mesos that adds logging to {{quota_handler}}, we can see that > only the {{disk(*)}} number increases in {{nonStaticClusterResources}} after > every iteration. The final iteration is {{disk(*):143489}} which is the > maximum quota I was able to create on {{hydra}}. We expected that quota > heuristic check would also include resources such as > {{disk(*)[MOUNT:/dcos/volume2]:7373}} > {noformat} > Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.763764 > 24902 quota_handler.cpp:71] Performing capacity heuristic check for a set > quota request > Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.763783 > 24902 quota_handler.cpp:87] heuristic: total quota 'disk(*):143489' > Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.763870 > 24902 quota_handler.cpp:111] heuristic: nonStaticAgentResources = > 'ports(*):[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, > 8182-32000]; disk(*)[MOUNT:/dcos/volume0]:7373; > disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373; > disk(*):28698; cpus(*):4; mem(*):15023' > Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.763923 > 24902 quota_handler.cpp:113] heuristic: nonStaticClusterResources = > 'ports(*):[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, > 8182-32000]; disk(*)[MOUNT:/dcos/volume0]:7373; > disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373; > disk(*):28698; cpus(*):4; mem(*):15023' > Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.763989 > 24902 quota_handler.cpp:111] heuristic: nonStaticAgentResources = > 'ports(*):[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, > 8182-32000]; disk(*)[MOUNT:/dcos/volume0]:7373; > disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373; > disk(*):28698; cpus(*):4; mem(*):15023' > Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.764022 > 24902 quota_handler.cpp:113] heuristic: nonStaticClusterResources = > 'ports(*):[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, > 8182-32000]; disk(*)[MOUNT:/dcos/volume0]:7373; > disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373; > disk(*):57396; cpus(*):8; mem(*):30046; disk(*)[MOUNT:/dcos/volume0]:7373; > disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373' > Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.764077 > 24902 quota_handler.cpp:111] heuristic: nonStaticAgentResources = > 'ports(*):[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, > 8182-32000]; disk(*)[MOUNT:/dcos/volume0]:7373; > disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373; > disk(*):28695; cpus(*):4; mem(*):15023' > Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.764119 > 24902 quota_handler.cpp:113] heuristic: nonStaticClusterResources = > 'ports(*):[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, > 8182-32000]; disk(*)[MOUNT:/dcos/volume0]:7373; > disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373; > disk(*):86091; cpus(*):12; mem(*):45069; disk(*)[MOUNT:/dcos/volume0]:7373; > disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373; >
[jira] [Commented] (MESOS-9634) Soft CPU limit for windows JobObject
[ https://issues.apache.org/jira/browse/MESOS-9634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16792077#comment-16792077 ] Benjamin Mahler commented on MESOS-9634: Linked in a related ticket, I thought we had a ticket for "burstable containers" but I can't seem to find one. > Soft CPU limit for windows JobObject > > > Key: MESOS-9634 > URL: https://issues.apache.org/jira/browse/MESOS-9634 > Project: Mesos > Issue Type: Wish > Components: allocation, containerization >Reporter: Andrei Stryia >Priority: Major > > We are using Mesos to run Windows payload. As I see, CPU utilization on the > slave nodes is not very good. Because of the hard cap limit, process cannot > use more CPU resources even if there are a lot of free CPU resources at the > moment (e.g. only one task is started on the node at the moment). > I know, the reason of such behavior is > {{JOB_OBJECT_CPU_RATE_CONTROL_HARD_CAP}} control flag of the Job Object. > But what about ability to use {{JOB_OBJECT_CPU_RATE_CONTROL_MIN_MAX_RATE}} > control flag, where MinRate will be limit specified in Task config while > MaxRate will be 100%CPU. This option will work the same way as cgroups/cpu > and add more elasticity. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9640) Add authorization support for `UPDATE_QUOTA` call.
[ https://issues.apache.org/jira/browse/MESOS-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-9640: -- Assignee: Till Toenshoff > Add authorization support for `UPDATE_QUOTA` call. > -- > > Key: MESOS-9640 > URL: https://issues.apache.org/jira/browse/MESOS-9640 > Project: Mesos > Issue Type: Improvement >Reporter: Meng Zhu >Assignee: Till Toenshoff >Priority: Major > Labels: mesosphere, resource-management > > For the new `UPDATE_QUOTA` call, we need to add the corresponding > authorization support. Unfortunately, there is already an action named > `update_quotas`. We can use `update_quota_configs` instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-9640) Add authorization support for `UPDATE_QUOTA` call.
[ https://issues.apache.org/jira/browse/MESOS-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-9640: -- Assignee: (was: Meng Zhu) > Add authorization support for `UPDATE_QUOTA` call. > -- > > Key: MESOS-9640 > URL: https://issues.apache.org/jira/browse/MESOS-9640 > Project: Mesos > Issue Type: Improvement >Reporter: Meng Zhu >Priority: Major > Labels: mesosphere, resource-management > > For the new `UPDATE_QUOTA` call, we need to add the corresponding > authorization support. Unfortunately, there is already an action named > `update_quotas`. We can use `update_quota_configs` instead. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (MESOS-9618) Display quota consumption in the webui.
Benjamin Mahler created MESOS-9618: -- Summary: Display quota consumption in the webui. Key: MESOS-9618 URL: https://issues.apache.org/jira/browse/MESOS-9618 Project: Mesos Issue Type: Improvement Components: webui Reporter: Benjamin Mahler Currently, the Roles table in the webui displays allocation and quota guarantees / limits. However, quota "consumption" is different from allocation, in that reserved resources are always considered consumed against the quota. This discrepancy has led to confusion from users. One exampled occurred when an agent was added with a large reservation exceeding the memory quota guarantee. The user sees memory chopping in offers, and since the scheduler didn't want to use the reservation, it can't launch its tasks. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (MESOS-6840) Tests for quota capacity heuristic.
[ https://issues.apache.org/jira/browse/MESOS-6840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-6840: -- Assignee: (was: Zhitao Li) > Tests for quota capacity heuristic. > --- > > Key: MESOS-6840 > URL: https://issues.apache.org/jira/browse/MESOS-6840 > Project: Mesos > Issue Type: Task > Components: allocation, test >Reporter: Alexander Rukletsov >Priority: Major > Labels: mesosphere, quota, resource-management > > We need more tests to ensure capacity heuristic works as expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-7883) Quota heuristic check not accounting for mount volumes
[ https://issues.apache.org/jira/browse/MESOS-7883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16779689#comment-16779689 ] Benjamin Mahler commented on MESOS-7883: Linking in quota "capacity heuristic" testing work. > Quota heuristic check not accounting for mount volumes > -- > > Key: MESOS-7883 > URL: https://issues.apache.org/jira/browse/MESOS-7883 > Project: Mesos > Issue Type: Bug > Components: allocation >Reporter: Vincent Roy >Priority: Major > Labels: resource-management > > This may be expected but came as a surprise to us. We are unable to create a > quota bigger than the root disk space on slaves. > Given two clusters with the same number of slaves and root disk size, but one > that also has mount volumes, is what the disk resources look like: > {noformat} > [root@fin-fang-foom-master-1 ~]# curl -s master.mesos:5050/state | jq > '.slaves[] .resources .disk' > 28698 > 28699 > 28698 > 28698 > 28697 > {noformat} > {noformat} > [root@hydra-master-1 ~]# curl -s master.mesos:5050/state | jq '.slaves[] > .resources .disk' > 50817 > 50817 > 50814 > 50819 > 50817 > {noformat} > In {{fin-fang-foom}}, I was able to create a quota for {{143490mb}} which is > the total of available disk resources, root in this case, as reported by > Mesos. For {{hydra}}, I am only able to create a quota for {{143489mb}}. This > is equivalent to the total of root disks available in {{hydra}} rather than > the total available disks reported by Mesos resources which is {{254084mb}}. > With a modified Mesos that adds logging to {{quota_handler}}, we can see that > only the {{disk(*)}} number increases in {{nonStaticClusterResources}} after > every iteration. The final iteration is {{disk(*):143489}} which is the > maximum quota I was able to create on {{hydra}}. We expected that quota > heuristic check would also include resources such as > {{disk(*)[MOUNT:/dcos/volume2]:7373}} > {noformat} > Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.763764 > 24902 quota_handler.cpp:71] Performing capacity heuristic check for a set > quota request > Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.763783 > 24902 quota_handler.cpp:87] heuristic: total quota 'disk(*):143489' > Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.763870 > 24902 quota_handler.cpp:111] heuristic: nonStaticAgentResources = > 'ports(*):[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, > 8182-32000]; disk(*)[MOUNT:/dcos/volume0]:7373; > disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373; > disk(*):28698; cpus(*):4; mem(*):15023' > Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.763923 > 24902 quota_handler.cpp:113] heuristic: nonStaticClusterResources = > 'ports(*):[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, > 8182-32000]; disk(*)[MOUNT:/dcos/volume0]:7373; > disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373; > disk(*):28698; cpus(*):4; mem(*):15023' > Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.763989 > 24902 quota_handler.cpp:111] heuristic: nonStaticAgentResources = > 'ports(*):[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, > 8182-32000]; disk(*)[MOUNT:/dcos/volume0]:7373; > disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373; > disk(*):28698; cpus(*):4; mem(*):15023' > Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.764022 > 24902 quota_handler.cpp:113] heuristic: nonStaticClusterResources = > 'ports(*):[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, > 8182-32000]; disk(*)[MOUNT:/dcos/volume0]:7373; > disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373; > disk(*):57396; cpus(*):8; mem(*):30046; disk(*)[MOUNT:/dcos/volume0]:7373; > disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373' > Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.764077 > 24902 quota_handler.cpp:111] heuristic: nonStaticAgentResources = > 'ports(*):[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, > 8182-32000]; disk(*)[MOUNT:/dcos/volume0]:7373; > disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373; > disk(*):28695; cpus(*):4; mem(*):15023' > Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.764119 > 24902 quota_handler.cpp:113] heuristic: nonStaticClusterResources = > 'ports(*):[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, > 8182-32000]; disk(*)[MOUNT:/dcos/volume0]:7373; > disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373; > disk(*):86091; cpus(*):12; mem(*):45069; disk(*)[MOUNT:/dcos/volume0]:7373; > disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373; >
[jira] [Assigned] (MESOS-6840) Tests for quota capacity heuristic.
[ https://issues.apache.org/jira/browse/MESOS-6840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Benjamin Mahler reassigned MESOS-6840: -- Shepherd: (was: Alexander Rukletsov) Assignee: Benjamin Mahler Sprint: Resource Mgmt RI11 Sp 41 > Tests for quota capacity heuristic. > --- > > Key: MESOS-6840 > URL: https://issues.apache.org/jira/browse/MESOS-6840 > Project: Mesos > Issue Type: Task > Components: allocation, test >Reporter: Alexander Rukletsov >Assignee: Benjamin Mahler >Priority: Major > Labels: mesosphere, quota, resource-management > > We need more tests to ensure capacity heuristic works as expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (MESOS-6840) Tests for quota capacity heuristic.
[ https://issues.apache.org/jira/browse/MESOS-6840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16779681#comment-16779681 ] Benjamin Mahler commented on MESOS-6840: As part of testing the capacity heuristic, we'd like to refactor the code to make it unit-testable. > Tests for quota capacity heuristic. > --- > > Key: MESOS-6840 > URL: https://issues.apache.org/jira/browse/MESOS-6840 > Project: Mesos > Issue Type: Task > Components: allocation, test >Reporter: Alexander Rukletsov >Priority: Major > Labels: mesosphere, quota, resource-management > > We need more tests to ensure capacity heuristic works as expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005)