from:"Benjamin Mahler \\\(JIRA\\\)"

[jira] [Assigned] (MESOS-9669) Deprecate v0 quota calls.

2019-08-13 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9669?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9669:
--

Assignee: Benjamin Mahler

> Deprecate v0 quota calls.
> -
>
> Key: MESOS-9669
> URL: https://issues.apache.org/jira/browse/MESOS-9669
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Meng Zhu
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: mesosphere, resource-management
>
> Once we introduce the new quota APIs in MESOS-8068, we should deprecate the 
> `/quota` endpoint. We should mark this as deprecated and hide it in our 
> documentation.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Assigned] (MESOS-9937) 53598228fe should be backported to 1.7.x

2019-08-13 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9937:
--

Assignee: Greg Mann
Priority: Blocker  (was: Major)
Target Version/s: 1.7.3

Marking as a blocker for the next 1.7.x release. Greg please reassign if 
someone else can pick this up.

> 53598228fe should be backported to 1.7.x
> 
>
> Key: MESOS-9937
> URL: https://issues.apache.org/jira/browse/MESOS-9937
> Project: Mesos
>  Issue Type: Bug
>Reporter: longfei
>Assignee: Greg Mann
>Priority: Blocker
>
> Commit 53598228fe on the master branch should be backported to 1.7.x. 
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (MESOS-9852) Slow memory growth in master due to deferred deletion of offer filters and timers.

2019-08-12 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16905316#comment-16905316
 ] 

Benjamin Mahler commented on MESOS-9852:


{quote}
Do you mean max_*_tasks_per_framework? Would this history take hundreds of MBs? 
I'll try... 
{quote}

Yes, for task history:

{noformat}
--max_completed_frameworks
--max_completed_tasks_per_framework
{noformat}

{quote}
I found that every terminated(no matter completed or unreachable) task would be 
put into slaves.unreachableTasks and would only be erased in _doRegistryGc.
{quote}

This will only happen for unreachable agents. Please file a ticket if you see 
otherwise. cc [~greggomann] [~vinodkone]

At this point I don't see the leak described in this ticket in the memory 
profiling data, so we can continue the discussion on the mailing list or in 
slack, to avoid spamming the watchers of this ticket.

> Slow memory growth in master due to deferred deletion of offer filters and 
> timers.
> --
>
> Key: MESOS-9852
> URL: https://issues.apache.org/jira/browse/MESOS-9852
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, master
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: resource-management
> Fix For: 1.5.4, 1.6.3, 1.7.3, 1.8.1, 1.9.0
>
> Attachments: _tmp_libprocess.Do1MrG_profile (1).dump, 
> _tmp_libprocess.Do1MrG_profile (1).svg, _tmp_libprocess.Do1MrG_profile 
> 24hours.dump, _tmp_libprocess.Do1MrG_profile 24hours.svg, screenshot-1.png, 
> statistics
>
>
> The allocator does not keep a handle to the offer filter timer, which means 
> it cannot remove the timer overhead (in this case memory) when removing the 
> offer filter earlier (e.g. due to revive):
> https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1338-L1352
> In addition, the offer filter is allocated on the heap but not deleted until 
> the timer fires (which might take forever!):
> https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1321
> https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1408-L1413
> https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L2249
> We'll need to try to backport this to all active release branches.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (MESOS-9852) Slow memory growth in master due to deferred deletion of offer filters and timers.

2019-08-11 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904850#comment-16904850
 ] 

Benjamin Mahler commented on MESOS-9852:


[~carlone] not sure if you intended to reply to my message but I noticed you 
attached the additional 24 hour data. Looking at it, it appears to be mostly 
due to task history. If you don't care about the task history, you can tune the 
master's flags to reduce the amount of framework / task history stored.

> Slow memory growth in master due to deferred deletion of offer filters and 
> timers.
> --
>
> Key: MESOS-9852
> URL: https://issues.apache.org/jira/browse/MESOS-9852
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, master
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: resource-management
> Fix For: 1.5.4, 1.6.3, 1.7.3, 1.8.1, 1.9.0
>
> Attachments: _tmp_libprocess.Do1MrG_profile (1).dump, 
> _tmp_libprocess.Do1MrG_profile (1).svg, _tmp_libprocess.Do1MrG_profile 
> 24hours.dump, _tmp_libprocess.Do1MrG_profile 24hours.svg, screenshot-1.png, 
> statistics
>
>
> The allocator does not keep a handle to the offer filter timer, which means 
> it cannot remove the timer overhead (in this case memory) when removing the 
> offer filter earlier (e.g. due to revive):
> https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1338-L1352
> In addition, the offer filter is allocated on the heap but not deleted until 
> the timer fires (which might take forever!):
> https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1321
> https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1408-L1413
> https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L2249
> We'll need to try to backport this to all active release branches.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Assigned] (MESOS-9932) Removal of a role from the suppression list should be equivalent to REVIVE.

2019-08-09 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9932?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9932:
--

Assignee: Benjamin Mahler

> Removal of a role from the suppression list should be equivalent to REVIVE.
> ---
>
> Key: MESOS-9932
> URL: https://issues.apache.org/jira/browse/MESOS-9932
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation, scheduler api
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: resource-management
>
> [~timcharper] and [~daa] pointed out that removal of a role from the 
> suppression list (e.g. via UPDATE_FRAMEWORK) does not clear filters. This 
> means that schedulers have to issue a separate explicit REVIVE for the roles 
> they want to remove.
> It seems like these are not the semantics we want, and we should instead be 
> clearing filters upon removing a role from the suppression list.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (MESOS-9932) Removal of a role from the suppression list should be equivalent to REVIVE.

2019-08-09 Thread Benjamin Mahler (JIRA)

Benjamin Mahler created MESOS-9932:
--

 Summary: Removal of a role from the suppression list should be 
equivalent to REVIVE.
 Key: MESOS-9932
 URL: https://issues.apache.org/jira/browse/MESOS-9932
 Project: Mesos
  Issue Type: Improvement
  Components: allocation, scheduler api
Reporter: Benjamin Mahler


[~timcharper] and [~daa] pointed out that removal of a role from the 
suppression list (e.g. via UPDATE_FRAMEWORK) does not clear filters. This means 
that schedulers have to issue a separate explicit REVIVE for the roles they 
want to remove.

It seems like these are not the semantics we want, and we should instead be 
clearing filters upon removing a role from the suppression list.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (MESOS-9852) Slow memory growth in master due to deferred deletion of offer filters and timers.

2019-08-09 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16904005#comment-16904005
 ] 

Benjamin Mahler commented on MESOS-9852:


{quote} The newest commit is 8e8c6c0. {quote}

[~carlone] this is what the /version endpoint shows?

I don't see anything abnormal looking, just a combination of increasing number 
of connections, task history, and offer filters. But the sample you took is 
only looking at 35 MB of memory growth.

Can you run this over the course of a very long time period to try to capture a 
large amount of the memory increase? E.g. 12 hours - 72 hours? Be sure to show 
the same graph as before so we know what the memory consumption history looked 
like.

> Slow memory growth in master due to deferred deletion of offer filters and 
> timers.
> --
>
> Key: MESOS-9852
> URL: https://issues.apache.org/jira/browse/MESOS-9852
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, master
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: resource-management
> Fix For: 1.5.4, 1.6.3, 1.7.3, 1.8.1, 1.9.0
>
> Attachments: _tmp_libprocess.Do1MrG_profile (1).dump, 
> _tmp_libprocess.Do1MrG_profile (1).svg, screenshot-1.png, statistics
>
>
> The allocator does not keep a handle to the offer filter timer, which means 
> it cannot remove the timer overhead (in this case memory) when removing the 
> offer filter earlier (e.g. due to revive):
> https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1338-L1352
> In addition, the offer filter is allocated on the heap but not deleted until 
> the timer fires (which might take forever!):
> https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1321
> https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1408-L1413
> https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L2249
> We'll need to try to backport this to all active release branches.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (MESOS-9852) Slow memory growth in master due to deferred deletion of offer filters and timers.

2019-08-06 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16901248#comment-16901248
 ] 

Benjamin Mahler commented on MESOS-9852:


[~carlone] we can figure out whether it has this fix if we have the commit sha. 
You can check this by hitting the /version endpoint on the master. In any case, 
please include the memory profiling data as well.

> Slow memory growth in master due to deferred deletion of offer filters and 
> timers.
> --
>
> Key: MESOS-9852
> URL: https://issues.apache.org/jira/browse/MESOS-9852
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, master
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: resource-management
> Fix For: 1.5.4, 1.6.3, 1.7.3, 1.8.1, 1.9.0
>
> Attachments: screenshot-1.png
>
>
> The allocator does not keep a handle to the offer filter timer, which means 
> it cannot remove the timer overhead (in this case memory) when removing the 
> offer filter earlier (e.g. due to revive):
> https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1338-L1352
> In addition, the offer filter is allocated on the heap but not deleted until 
> the timer fires (which might take forever!):
> https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1321
> https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1408-L1413
> https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L2249
> We'll need to try to backport this to all active release branches.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (MESOS-9852) Slow memory growth in master due to deferred deletion of offer filters and timers.

2019-08-02 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16899214#comment-16899214
 ] 

Benjamin Mahler commented on MESOS-9852:


Hi [~carlone], 1.7.3 is not released yet, are you referring to the 1.7.x 
release branch with the fix in this ticket applied?

Please report your findings using the built in memory profiling:
http://mesos.apache.org/documentation/latest/memory-profiling/

> Slow memory growth in master due to deferred deletion of offer filters and 
> timers.
> --
>
> Key: MESOS-9852
> URL: https://issues.apache.org/jira/browse/MESOS-9852
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, master
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: resource-management
> Fix For: 1.5.4, 1.6.3, 1.7.3, 1.8.1, 1.9.0
>
> Attachments: screenshot-1.png
>
>
> The allocator does not keep a handle to the offer filter timer, which means 
> it cannot remove the timer overhead (in this case memory) when removing the 
> offer filter earlier (e.g. due to revive):
> https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1338-L1352
> In addition, the offer filter is allocated on the heap but not deleted until 
> the timer fires (which might take forever!):
> https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1321
> https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1408-L1413
> https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L2249
> We'll need to try to backport this to all active release branches.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (MESOS-8069) Role-related endpoints need to reflect hierarchical accounting.

2019-07-31 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-8069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897302#comment-16897302
 ] 

Benjamin Mahler commented on MESOS-8069:


This was done for the v0 /roles endpoint but still needs to be done for v1 
GET_ROLES.

> Role-related endpoints need to reflect hierarchical accounting.
> ---
>
> Key: MESOS-8069
> URL: https://issues.apache.org/jira/browse/MESOS-8069
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, HTTP API, master
>Reporter: Benjamin Mahler
>Assignee: Till Toenshoff
>Priority: Major
>  Labels: mesosphere, multitenancy, resource-management
> Attachments: Screen Shot 2018-03-06 at 15.06.04.png
>
>
> With the introduction of hierarchical roles, the role-related endpoints need 
> to be updated to provide aggregated accounting information.
> For example, information about how many resources are allocated to "/eng" 
> should include the resources allocated to "/eng/frontend" and "/eng/backend", 
> since quota guarantees and limits are also applied on the aggregation.
> This also affects the UI display, for example the 'Roles' tab.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Assigned] (MESOS-9427) Revisit quota documentation.

2019-07-31 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9427:
--

Assignee: Benjamin Mahler

> Revisit quota documentation.
> 
>
> Key: MESOS-9427
> URL: https://issues.apache.org/jira/browse/MESOS-9427
> Project: Mesos
>  Issue Type: Documentation
>  Components: allocation, documentation
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: multitenancy, resource-management
>
> At this point the quota documentation in the docs/ folder has become rather 
> stale. It would be good to at least update any inaccuracies and ideally 
> re-write it to better reflect the current thinking.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (MESOS-9758) Take ports out of the roles endpoints.

2019-07-31 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16897300#comment-16897300
 ] 

Benjamin Mahler commented on MESOS-9758:


v0 /roles no longer has ports, but v1 GET_ROLES still has it.

> Take ports out of the roles endpoints.
> --
>
> Key: MESOS-9758
> URL: https://issues.apache.org/jira/browse/MESOS-9758
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> It does not make sense to combine ports across agents.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (MESOS-6200) Hope mesos support soft and hard cpu/memory resource in the task

2019-07-30 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-6200?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16896580#comment-16896580
 ] 

Benjamin Mahler commented on MESOS-6200:


[~xds2000] I think this request is about minimum / maximum container cpu / 
memory and I don't think that rlimits is the way to accomplish that. We will be 
working on it via MESOS-9916.

> Hope mesos support soft and hard cpu/memory resource in the task
> 
>
> Key: MESOS-6200
> URL: https://issues.apache.org/jira/browse/MESOS-6200
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization, docker, scheduler api
>Affects Versions: 0.28.2
> Environment: CentOS 7 
> Kernel 3.10.0-327.28.3.el7.x86_64
> Mesos 0.28.2
> Docker 1.11.2
>Reporter: Lei Xu
>Priority: Major
>  Labels: resource-management
>
> The Docker executor maybe could support soft/hard resource limit to enable 
> more flexible resources sharing among the applications.
> ||  || CPU || Memory ||
> | hard limit| --cpu-period & --cpu-quota | --memory & --memory-swap|
> | soft limit| --cpu-shares | --memory-reservation|
> And now the task protobuf message has only one resource struct that used to 
> describe the cgroup limit, and the docker executor handle is like the 
> following, only --memory and --cpu-shares were set:
> {code}
>   if (resources.isSome()) {
> // TODO(yifan): Support other resources (e.g. disk).
> Option cpus = resources.get().cpus();
> if (cpus.isSome()) {
>   uint64_t cpuShare =
> std::max((uint64_t) (CPU_SHARES_PER_CPU * cpus.get()), 
> MIN_CPU_SHARES);
>   argv.push_back("--cpu-shares");
>   argv.push_back(stringify(cpuShare));
> }
> Option mem = resources.get().mem();
> if (mem.isSome()) {
>   Bytes memLimit = std::max(mem.get(), MIN_MEMORY);
>   argv.push_back("--memory");
>   argv.push_back(stringify(memLimit.bytes()));
> }
>   }
> {code}
> I hope that the executor and the protobuf message could separate the resource 
> to the two parts: soft and hard. Then the user could set 2 levels resource 
> limits for the docker.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (MESOS-9916) Support per-container cpu / memory bursting.

2019-07-30 Thread Benjamin Mahler (JIRA)

Benjamin Mahler created MESOS-9916:
--

 Summary: Support per-container cpu / memory bursting.
 Key: MESOS-9916
 URL: https://issues.apache.org/jira/browse/MESOS-9916
 Project: Mesos
  Issue Type: Epic
  Components: containerization, scheduler api
Reporter: Benjamin Mahler


Currently, the cgroup cpu policy is burned in at the agent level. The user can 
start the agent with {{--cgroups_enable_cfs}} to apply cfs quota to all 
containers (effectively disallowing exceeding the requested amount of cpus for 
all containers on the agent). The agent does not allow containers to exceed the 
requested memory (except when a container's requested memory is shrunk).

We should instead enable per-container cpu / memory bursting via per-container 
cpu and memory requests / limits.

See kubernetes for an example of a per container cpu/memory bursting API:

https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/#resource-requests-and-limits-of-pod-and-container



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (MESOS-9915) Store a role tree in the master.

2019-07-30 Thread Benjamin Mahler (JIRA)

Benjamin Mahler created MESOS-9915:
--

 Summary: Store a role tree in the master.
 Key: MESOS-9915
 URL: https://issues.apache.org/jira/browse/MESOS-9915
 Project: Mesos
  Issue Type: Improvement
  Components: master
Reporter: Benjamin Mahler


Currently, both the master and allocator track known roles in maps (note 
however that the master does not currently have complete tracking of known 
roles).

These Role structs track some information about roles, but currently do not 
track information hierarchically. As a result, when per-role resource 
quantities were exposed in the API, we had to add code outside of the master's 
Role struct to perform the hierarchical aggregation.

It would be nice if the master (and allocator) had a complete Role tree stored 
and updated in an event driven manner to obtain information cheaply at any 
point in time. Ideally this role tree abstraction can be shared (e.g. with the 
allocator) which may not be trivial since the information tracked might differ.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Assigned] (MESOS-9861) Make PushGauges support floating point stats.

2019-07-29 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9861:
--

Assignee: Benjamin Mahler

> Make PushGauges support floating point stats.
> -
>
> Key: MESOS-9861
> URL: https://issues.apache.org/jira/browse/MESOS-9861
> Project: Mesos
>  Issue Type: Bug
>  Components: metrics
>Reporter: Meng Zhu
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: foundations, resource-management
>
> Currently, PushGauges are modeled against counters. Thus it does not support 
> floating point stats. This prevents many existing PullGauges to use it. We 
> need to add support for floating point stat.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (MESOS-9912) Webui roles table sorting treats 0 entries as largest values.

2019-07-29 Thread Benjamin Mahler (JIRA)

Benjamin Mahler created MESOS-9912:
--

 Summary: Webui roles table sorting treats 0 entries as largest 
values.
 Key: MESOS-9912
 URL: https://issues.apache.org/jira/browse/MESOS-9912
 Project: Mesos
  Issue Type: Bug
  Components: webui
Reporter: Benjamin Mahler


Currently, the webui roles table displays dashes ("-") for zero entries to ease 
readability of non-zero entries, however this alters the column sorting 
behavior to treat these entries as larger than any number. The expected 
behavior is to have the "-" entries be treated as zero.

Ideally we can fix this without having to stick zeroes everywhere and reduce 
the readability of the table.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Assigned] (MESOS-9603) Add quota limits metrics.

2019-07-24 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9603:
--

Assignee: Benjamin Mahler

> Add quota limits metrics.
> -
>
> Key: MESOS-9603
> URL: https://issues.apache.org/jira/browse/MESOS-9603
> Project: Mesos
>  Issue Type: Task
>Reporter: Meng Zhu
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: resource-management
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (MESOS-9901) Specialize jsonify for protobuf Maps.

2019-07-23 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9901?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16891171#comment-16891171
 ] 

Benjamin Mahler commented on MESOS-9901:


[~bbannier] hm.. not sure how the existing format was produced but it doesn't 
comply with the standard mapping?

https://developers.google.com/protocol-buffers/docs/proto3#json

I think we should just bite the bullet and send out an email to make the 
breaking change to get towards the proto3 standard json mapping.

> Specialize jsonify for protobuf Maps.
> -
>
> Key: MESOS-9901
> URL: https://issues.apache.org/jira/browse/MESOS-9901
> Project: Mesos
>  Issue Type: Improvement
>  Components: json api
>Reporter: Meng Zhu
>Priority: Major
>
> Jsonify current treats protobuf as a regular repeated field. For example, for 
> the schema 
> {noformat}
> message QuotaConfig {
>   required string role = 1;
>   map guarantees = 2;
>   map limits = 3;
> }
> {noformat}
> it will produce:
> {noformat}
>   "configs": [
> {
>   "role": "role1",
>   "guarantees": [
> {
>   "key": "cpus",
>   "value": {
> "value": 1
>   }
> },
> {
>   "key": "mem",
>   "value": {
> "value": 512
>   }
> }
>   ]
> {noformat}
> This output cannot be parsed back to proto messages. We need to specialize 
> jsonify for Maps type. 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (MESOS-9897) Remove java and python language bindings from the source tree.

2019-07-18 Thread Benjamin Mahler (JIRA)

Benjamin Mahler created MESOS-9897:
--

 Summary: Remove java and python language bindings from the source 
tree.
 Key: MESOS-9897
 URL: https://issues.apache.org/jira/browse/MESOS-9897
 Project: Mesos
  Issue Type: Task
Reporter: Benjamin Mahler


The java and python bindings are not well maintained and now that we have the 
HTTP based V1 scheduler and executor APIs it would be good to remove the burden 
of carrying the java and python bindings.

I've targeted this for the 2.0 milestone so that we remember to do this, since 
this is a breaking change. If there's no objections from users, we could find a 
way to remove them prior to 2.0.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (MESOS-9896) Consider using protobuf provided json conversion facilities rather than custom ones.

2019-07-17 Thread Benjamin Mahler (JIRA)

Benjamin Mahler created MESOS-9896:
--

 Summary: Consider using protobuf provided json conversion 
facilities rather than custom ones.
 Key: MESOS-9896
 URL: https://issues.apache.org/jira/browse/MESOS-9896
 Project: Mesos
  Issue Type: Task
  Components: stout
Reporter: Benjamin Mahler


Currently, stout provides custom JSON to protobuf conversion facilities, some 
of which use protobuf reflection.

When upgrading protobuf to 3.7.x in MESOS-9755, we found that the v0 /state 
response of the master slowed down, and it appears to be due to a performance 
regression in the protobuf reflection code.

We should file an issue with protobuf, but we should also look into using the 
json conversion code that protobuf provides to see if that can help avoid the 
regression.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Comment Edited] (MESOS-9890) /roles and GET_ROLES does not always expose parent roles.

2019-07-15 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885568#comment-16885568
 ] 

Benjamin Mahler edited comment on MESOS-9890 at 7/15/19 9:13 PM:
-

https://reviews.apache.org/r/71073/
https://reviews.apache.org/r/71077/


was (Author: bmahler):
https://reviews.apache.org/r/71073/

(no test yet)

> /roles and GET_ROLES does not always expose parent roles.
> -
>
> Key: MESOS-9890
> URL: https://issues.apache.org/jira/browse/MESOS-9890
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: resource-management
>
> If some descendant roles are present in frameworks, then the parent roles 
> will not be exposed in the /roles and GET_ROLES endpoints.
> This is because the tracking is currently based on frameworks being 
> subscribed to the role.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (MESOS-9890) /roles and GET_ROLES does not always expose parent roles.

2019-07-15 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16885568#comment-16885568
 ] 

Benjamin Mahler commented on MESOS-9890:


https://reviews.apache.org/r/71073/

(no test yet)

> /roles and GET_ROLES does not always expose parent roles.
> -
>
> Key: MESOS-9890
> URL: https://issues.apache.org/jira/browse/MESOS-9890
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: resource-management
>
> If some descendant roles are present in frameworks, then the parent roles 
> will not be exposed in the /roles and GET_ROLES endpoints.
> This is because the tracking is currently based on frameworks being 
> subscribed to the role.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Assigned] (MESOS-9888) /roles and GET_ROLES do not expose roles with only static reservations

2019-07-15 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9888?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9888:
--

Assignee: Benjamin Mahler

> /roles and GET_ROLES do not expose roles with only static reservations
> --
>
> Key: MESOS-9888
> URL: https://issues.apache.org/jira/browse/MESOS-9888
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: resource-management
>
> If a role is only known to the master because of an agent with static 
> reservations to that role, it will not be shown in the /roles and GET_ROLES 
> APIs.
> This is because the roles are tracked based on frameworks primarily. We'll 
> need to update the tracking to include when there are agents with 
> reservations.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Assigned] (MESOS-9890) /roles and GET_ROLES does not always expose parent roles.

2019-07-15 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9890:
--

Assignee: Benjamin Mahler

> /roles and GET_ROLES does not always expose parent roles.
> -
>
> Key: MESOS-9890
> URL: https://issues.apache.org/jira/browse/MESOS-9890
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: resource-management
>
> If some descendant roles are present in frameworks, then the parent roles 
> will not be exposed in the /roles and GET_ROLES endpoints.
> This is because the tracking is currently based on frameworks being 
> subscribed to the role.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (MESOS-9890) /roles and GET_ROLES does not always expose parent roles.

2019-07-12 Thread Benjamin Mahler (JIRA)

Benjamin Mahler created MESOS-9890:
--

 Summary: /roles and GET_ROLES does not always expose parent roles.
 Key: MESOS-9890
 URL: https://issues.apache.org/jira/browse/MESOS-9890
 Project: Mesos
  Issue Type: Bug
Reporter: Benjamin Mahler


If some descendant roles are present in frameworks, then the parent roles will 
not be exposed in the /roles and GET_ROLES endpoints.

This is because the tracking is currently based on frameworks being subscribed 
to the role.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (MESOS-5037) foreachkey behaviour is not expected in multimap

2019-07-12 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-5037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884048#comment-16884048
 ] 

Benjamin Mahler commented on MESOS-5037:


[~haosd...@gmail.com] Can you file a separate ticket for the performance 
problem? And we can keep this ticket as a foreachkey issue?

> foreachkey behaviour is not expected in multimap
> 
>
> Key: MESOS-5037
> URL: https://issues.apache.org/jira/browse/MESOS-5037
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: haosdent
>Priority: Major
>  Labels: foundations, stout
>
> Currently the {{foreachkey}} implementation is 
> {code}
> #define foreachkey(VAR, COL)\
>   foreachpair (VAR, __foreach__::ignore, COL)
> {code}
> This works in most structures. But in multimap, one key may map to multi 
> values. This means there are multi pairs which have same key. So when call 
> {{foreachkey}}, the {{key}} would duplicated when iteration. My idea to solve 
> this is we prefer call {{foreach}} on {{(COL).keys()}} if {{keys()}} method 
> exists in {{COL}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Comment Edited] (MESOS-5037) foreachkey behaviour is not expected in multimap

2019-07-12 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-5037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884044#comment-16884044
 ] 

Benjamin Mahler edited comment on MESOS-5037 at 7/12/19 5:59 PM:
-

[~bmahler] Sure, it is 
https://github.com/apache/mesos/blob/9932550e9632e7fbb9a45b217793c7f508f57001/src/master/master.cpp#L7707-L7708

{code}
void Master::__reregisterSlave(
...
foreachkey (FrameworkID frameworkId,
   slaves.unreachableTasks.at(slaveInfo.id())) {
...
foreach (TaskID taskId,
 slaves.unreachableTasks.at(slaveInfo.id()).get(frameworkId)) {
{code}

Our case is when network flapping, 3~4 agents reregister, then master would CPU 
full and could not process any requests during that period.


was (Author: haosd...@gmail.com):
[~bmahler] Sure, it is 
https://github.com/apache/mesos/blob/master/src/master/master.cpp#L7707-L7708

{code}
void Master::__reregisterSlave(
...
foreachkey (FrameworkID frameworkId,
   slaves.unreachableTasks.at(slaveInfo.id())) {
...
foreach (TaskID taskId,
 slaves.unreachableTasks.at(slaveInfo.id()).get(frameworkId)) {
{code}

Our case is when network flapping, 3~4 agents reregister, then master would CPU 
full and could not process any requests during that period.

> foreachkey behaviour is not expected in multimap
> 
>
> Key: MESOS-5037
> URL: https://issues.apache.org/jira/browse/MESOS-5037
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: haosdent
>Priority: Major
>  Labels: foundations, stout
>
> Currently the {{foreachkey}} implementation is 
> {code}
> #define foreachkey(VAR, COL)\
>   foreachpair (VAR, __foreach__::ignore, COL)
> {code}
> This works in most structures. But in multimap, one key may map to multi 
> values. This means there are multi pairs which have same key. So when call 
> {{foreachkey}}, the {{key}} would duplicated when iteration. My idea to solve 
> this is we prefer call {{foreach}} on {{(COL).keys()}} if {{keys()}} method 
> exists in {{COL}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (MESOS-5037) foreachkey behaviour is not expected in multimap

2019-07-12 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-5037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16884011#comment-16884011
 ] 

Benjamin Mahler commented on MESOS-5037:


[~haosd...@gmail.com] can you post a link to the code in question?

> foreachkey behaviour is not expected in multimap
> 
>
> Key: MESOS-5037
> URL: https://issues.apache.org/jira/browse/MESOS-5037
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: haosdent
>Priority: Major
>  Labels: foundations, stout
>
> Currently the {{foreachkey}} implementation is 
> {code}
> #define foreachkey(VAR, COL)\
>   foreachpair (VAR, __foreach__::ignore, COL)
> {code}
> This works in most structures. But in multimap, one key may map to multi 
> values. This means there are multi pairs which have same key. So when call 
> {{foreachkey}}, the {{key}} would duplicated when iteration. My idea to solve 
> this is we prefer call {{foreach}} on {{(COL).keys()}} if {{keys()}} method 
> exists in {{COL}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (MESOS-5037) foreachkey behaviour is not expected in multimap

2019-07-11 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-5037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883340#comment-16883340
 ] 

Benjamin Mahler commented on MESOS-5037:


[~haosd...@gmail.com] foreachkey indeed sounds problematic for multimap.

I didn't follow the CPU load issue you found. Can you file a related ticket 
explaining it? Be sure to show the code in question that is inducing the cpu 
load, and attach perf data if possible.

> foreachkey behaviour is not expected in multimap
> 
>
> Key: MESOS-5037
> URL: https://issues.apache.org/jira/browse/MESOS-5037
> Project: Mesos
>  Issue Type: Bug
>  Components: stout
>Reporter: haosdent
>Priority: Major
>  Labels: foundations, stout
>
> Currently the {{foreachkey}} implementation is 
> {code}
> #define foreachkey(VAR, COL)\
>   foreachpair (VAR, __foreach__::ignore, COL)
> {code}
> This works in most structures. But in multimap, one key may map to multi 
> values. This means there are multi pairs which have same key. So when call 
> {{foreachkey}}, the {{key}} would duplicated when iteration. My idea to solve 
> this is we prefer call {{foreach}} on {{(COL).keys()}} if {{keys()}} method 
> exists in {{COL}}.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (MESOS-8789) Role-related endpoints should display distinct offered and allocated resources.

2019-07-11 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16883290#comment-16883290
 ] 

Benjamin Mahler commented on MESOS-8789:


{noformat}
commit d6738bcc86525e1ac661d2027a1934134426255f
Author: Benjamin Mahler 
Date:   Wed Jul 10 19:36:54 2019 -0400

Added Role::reserved, Role::allocated, Role::offered to master.

This provides a breakdown of resource quantities on a per-role
basis, that would aid debugging if shown in the endpoints and
roles table in the ui.

Review: https://reviews.apache.org/r/71050
{noformat}

{noformat}
commit 69c8feab6a62b1728872a367a8ed28f88eb029d3 (HEAD -> master, apache/master)
Author: Benjamin Mahler 
Date:   Wed Jul 10 20:09:31 2019 -0400

Added reserved, offered, allocated resources to the /roles endpoint.

This provides helpful information for debugging, as well as for the
webui to display in the roles table.

Review: https://reviews.apache.org/r/71053
{noformat}

> Role-related endpoints should display distinct offered and allocated 
> resources.
> ---
>
> Key: MESOS-8789
> URL: https://issues.apache.org/jira/browse/MESOS-8789
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, HTTP API, master
>Affects Versions: 1.5.0
>Reporter: Till Toenshoff
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: mesosphere, multitenancy, resource-management
>
> The role endpoints currently show accumulated values for resources 
> (allocated), containing offered resources. For gaining an overview showing 
> our allocated resources separately from the offered resources could improve 
> the signal quality, depending on the use case.
> This also affects the UI display, for example the "Roles" tab.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Created] (MESOS-9888) /roles and GET_ROLES do not expose roles with only static reservations

2019-07-11 Thread Benjamin Mahler (JIRA)

Benjamin Mahler created MESOS-9888:
--

 Summary: /roles and GET_ROLES do not expose roles with only static 
reservations
 Key: MESOS-9888
 URL: https://issues.apache.org/jira/browse/MESOS-9888
 Project: Mesos
  Issue Type: Bug
Reporter: Benjamin Mahler


If a role is only known to the master because of an agent with static 
reservations to that role, it will not be shown in the /roles and GET_ROLES 
APIs.

This is because the roles are tracked based on frameworks primarily. We'll need 
to update the tracking to include when there are agents with reservations.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Assigned] (MESOS-8503) Improve UI when displaying frameworks with many roles.

2019-07-10 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-8503?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-8503:
--

Assignee: (was: Armand Grillet)

> Improve UI when displaying frameworks with many roles.
> --
>
> Key: MESOS-8503
> URL: https://issues.apache.org/jira/browse/MESOS-8503
> Project: Mesos
>  Issue Type: Task
>Reporter: Armand Grillet
>Priority: Major
> Attachments: Screen Shot 2018-01-29 à 10.38.05.png
>
>
> The /frameworks UI endpoint displays all the roles of each framework in a 
> table:
> !Screen Shot 2018-01-29 à 10.38.05.png!
> This is not readable if a framework has many roles. We thus need to provide a 
> solution to only display a few roles per framework and show more when a user 
> wants to see all of them.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-9618) Display quota consumption in the webui.

2019-07-10 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9618?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9618:
--

Assignee: Benjamin Mahler

> Display quota consumption in the webui.
> ---
>
> Key: MESOS-9618
> URL: https://issues.apache.org/jira/browse/MESOS-9618
> Project: Mesos
>  Issue Type: Improvement
>  Components: webui
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: resource-management
>
> Currently, the Roles table in the webui displays allocation and quota 
> guarantees / limits. However, quota "consumption" is different from 
> allocation, in that reserved resources are always considered consumed against 
> the quota.
> This discrepancy has led to confusion from users. One exampled occurred when 
> an agent was added with a large reservation exceeding the memory quota 
> guarantee. The user sees memory chopping in offers, and since the scheduler 
> didn't want to use the reservation, it can't launch its tasks.
> If consumption is shown in the UI, we should include a tool tip that 
> indicates how consumed is calculated so that users know how to interpret it.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9886) RoleTest.RolesEndpointContainsConsumedQuota is flaky.

2019-07-10 Thread Benjamin Mahler (JIRA)

Benjamin Mahler created MESOS-9886:
--

 Summary: RoleTest.RolesEndpointContainsConsumedQuota is flaky.
 Key: MESOS-9886
 URL: https://issues.apache.org/jira/browse/MESOS-9886
 Project: Mesos
  Issue Type: Bug
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler


{noformat}
[ RUN  ] RoleTest.RolesEndpointContainsConsumedQuota
I0710 07:05:42.670790  9995 cluster.cpp:176] Creating default 'local' authorizer
I0710 07:05:42.672238   master.cpp:440] Master 
8db40cec-43ef-41a1-89a4-4f7b877d8f13 (ip-172-16-10-69.ec2.internal)
started on 172.16.10.69:37082
I0710 07:05:42.672256   master.cpp:443] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregiste
r_timeout="10mins" --allocation_interval="1secs" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate
_frameworks="true" --authenticate_http_frameworks="true" 
--authenticate_http_readonly="true" --authenticate_http_readwr
ite="true" --authentication_v0_timeout="15secs" --authenticators="crammd5" 
--authorizers="local" --credentials="/tmp/1d
0m6o/credentials" --filter_gpu_resources="true" --framework_sorter="drf" 
--help="false" --hostname_lookup="true" --http
_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initializ
e="true" --logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_co
mpleted_tasks_per_framework="1000" 
--max_operator_event_stream_subscribers="1000" 
--max_unreachable_tasks_per_framework
="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" 
--publish_per_framework
_metrics="true" --quiet="false" --recovery_agent_removal_limit="100%" 
--registry="in_memory" --registry_fetch_timeout="
1mins" --registry_gc_interval="15mins" --registry_max_agent_age="2weeks" 
--registry_max_agent_count="102400" --registry
_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --root_submission
s="true" --version="false" --webui_dir="/usr/local/share/mesos/webui" 
--work_dir="/tmp/1d0m6o/master" --zk_session_time
out="10secs"
I0710 07:05:42.672351   master.cpp:492] Master only allowing authenticated 
frameworks to register
I0710 07:05:42.672356   master.cpp:498] Master only allowing authenticated 
agents to register
I0710 07:05:42.672360   master.cpp:504] Master only allowing authenticated 
HTTP frameworks to register
I0710 07:05:42.672364   credentials.hpp:37] Loading credentials for 
authentication from '/tmp/1d0m6o/credentials'
I0710 07:05:42.672430   master.cpp:548] Using default 'crammd5' 
authenticator
I0710 07:05:42.672466   http.cpp:975] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0710 07:05:42.672508   http.cpp:975] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite
'
I0710 07:05:42.672538   http.cpp:975] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler
'
I0710 07:05:42.672569   master.cpp:629] Authorization enabled
I0710 07:05:42.672658 10001 hierarchical.cpp:241] Initialized hierarchical 
allocator process
I0710 07:05:42.672685 10001 whitelist_watcher.cpp:77] No whitelist given
I0710 07:05:42.673316 10001 master.cpp:2150] Elected as the leading master!
I0710 07:05:42.673331 10001 master.cpp:1664] Recovering from registrar
I0710 07:05:42.673616 10001 registrar.cpp:339] Recovering registrar
I0710 07:05:42.673874 10001 registrar.cpp:383] Successfully fetched the 
registry (0B) in 239104ns
I0710 07:05:42.673923 10001 registrar.cpp:487] Applied 1 operations in 7745ns; 
attempting to update the registry
I0710 07:05:42.674052   registrar.cpp:544] Successfully updated the 
registry in 108032ns
I0710 07:05:42.674082   registrar.cpp:416] Successfully recovered registrar
I0710 07:05:42.674152   master.cpp:1799] Recovered 0 agents from the 
registry (180B); allowing 10mins for agents to
 reregister
I0710 07:05:42.674185  9996 hierarchical.cpp:280] Skipping recovery of 
hierarchical allocator: nothing to recover
W0710 07:05:42.676100  9995 process.cpp:2877] Attempted to spawn already 
running process files@172.16.10.69:37082
I0710 07:05:42.676537  9995 containerizer.cpp:314] Using isolation { 
environment_secret, posix/cpu, posix/mem, filesyst
em/posix, network/cni }
I0710 07:05:42.678514  9995 linux_launcher.cpp:144] Using /cgroup/freezer as 
the freezer hierarchy for the Linux launch
er
I0710 07:05:42.678980  9995 provisioner.cpp:298] Using default backend 'copy'
I0710 07:05:42.680043  9995 cluster.cpp:510] Creating default 'local' authorizer
I0710 07:05:42.680832  9998 slave.cpp:265] Mesos agent started on 
(522)@172.16.10.69:37082
I0710 07:05:42.680850  9998 slave.cpp:266] Flags at startup: --acls="" 
--appc_simple_discovery_uri_prefix="http://; --a

[jira] [Commented] (MESOS-9755) Upgrade bundled protobuf to 3.7.x.

2019-07-04 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16878882#comment-16878882
 ] 

Benjamin Mahler commented on MESOS-9755:


For posterity, it looks like there is a performance regression in the v0 API 
when upgrading to protobuf 3.7.1:

Master:
{noformat}
[ RUN  ] 
AgentFrameworkTaskCountContentType/MasterStateQuery_BENCHMARK_Test.GetState/0
Test setup: 1000 agents with a total of 1 running tasks and 1 completed 
tasks
v0 '/state' response took 177.001464ms
[   OK ] 
AgentFrameworkTaskCountContentType/MasterStateQuery_BENCHMARK_Test.GetState/0 
(4593 ms)
[ RUN  ] 
AgentFrameworkTaskCountContentType/MasterStateQuery_BENCHMARK_Test.GetState/1
Test setup: 1 agents with a total of 10 running tasks and 10 
completed tasks
v0 '/state' response took 1.802505171secs
[   OK ] 
AgentFrameworkTaskCountContentType/MasterStateQuery_BENCHMARK_Test.GetState/1 
(51571 ms)
[ RUN  ] 
AgentFrameworkTaskCountContentType/MasterStateQuery_BENCHMARK_Test.GetState/2
Test setup: 2 agents with a total of 20 running tasks and 20 
completed tasks
v0 '/state' response took 3.164482263secs
[   OK ] 
AgentFrameworkTaskCountContentType/MasterStateQuery_BENCHMARK_Test.GetState/2 
(104737 ms)
{noformat}

After upgrading to 3.7.1:
{noformat}
[ RUN  ] 
AgentFrameworkTaskCountContentType/MasterStateQuery_BENCHMARK_Test.GetState/0
Test setup: 1000 agents with a total of 1 running tasks and 1 completed 
tasks
v0 '/state' response took 253.753947ms
[   OK ] 
AgentFrameworkTaskCountContentType/MasterStateQuery_BENCHMARK_Test.GetState/0 
(6107 ms)
[ RUN  ] 
AgentFrameworkTaskCountContentType/MasterStateQuery_BENCHMARK_Test.GetState/1
Test setup: 1 agents with a total of 10 running tasks and 10 
completed tasks
v0 '/state' response took 2.118297secs
[   OK ] 
AgentFrameworkTaskCountContentType/MasterStateQuery_BENCHMARK_Test.GetState/1 
(58902 ms)
[ RUN  ] 
AgentFrameworkTaskCountContentType/MasterStateQuery_BENCHMARK_Test.GetState/2
Test setup: 2 agents with a total of 20 running tasks and 20 
completed tasks
v0 '/state' response took 4.150050151secs
[   OK ] 
AgentFrameworkTaskCountContentType/MasterStateQuery_BENCHMARK_Test.GetState/2 
(116661 ms)
{noformat}

It appears to be due to a performance regression in the reflection code in 
protobuf. We may want to investigate further with the protobuf maintainers 
and/or investigate using the built in json conversion support rather than our 
reflection based implementation.

> Upgrade bundled protobuf to 3.7.x.
> --
>
> Key: MESOS-9755
> URL: https://issues.apache.org/jira/browse/MESOS-9755
> Project: Mesos
>  Issue Type: Wish
>Reporter: Kaiwalya Joshi
>Priority: Major
>  Labels: foundations, integration, protobuf
>
> We're noticing the following warning emitted by the JVM on JDK9+ for Google 
> Protobuf _v3.5.0_
> {code}
> WARNING: An illegal reflective access operation has occurred
> WARNING: Illegal reflective access by com.google.protobuf.UnsafeUtil 
> (file:/home/kjoshi/.gradle/caches/modules-2/files-2.1/com.google.protobuf/protobuf-java/3.5.0/200fb936907fbab5e521d148026f6033d4aa539e/protobuf-java-3.5.0.jar)
>  to field java.nio.Buffer.address
> WARNING: Please consider reporting this to the maintainers of 
> com.google.protobuf.UnsafeUtil
> {code}
> This warning is fixed in ProtoBuf versions [_v3.7.0_ and 
> above|https://github.com/protocolbuffers/protobuf/releases/tag/v3.7.0].
> As the current access warning can turn into an access violation in later 
> versions of the JDK, we're requesting Mesos to update to a version of 
> ProtoBuf that incorporates the needed fixes.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9755) Upgrade bundled protobuf to 3.7.x.

2019-07-04 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9755?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16878791#comment-16878791
 ] 

Benjamin Mahler commented on MESOS-9755:


Note that upgrading protobuf to 3.7.x breaks the grpc build in the mesos 
autotools build:

{noformat}
[HOSTCXX] Compiling src/compiler/cpp_plugin.cc
[HOSTCXX] Compiling src/compiler/node_plugin.cc
[HOSTCXX] Compiling src/compiler/csharp_plugin.cc
[HOSTCXX] Compiling src/compiler/php_plugin.cc
[HOSTCXX] Compiling src/compiler/objective_c_plugin.cc
[HOSTCXX] Compiling src/compiler/python_plugin.cc
[HOSTCXX] Compiling src/compiler/ruby_plugin.cc
[HOSTLD]  Linking 
/home/bmahler/git/mesos3/build/3rdparty/grpc-1.10.0/bins/opt/grpc_python_plugin
[HOSTLD]  Linking 
/home/bmahler/git/mesos3/build/3rdparty/grpc-1.10.0/bins/opt/grpc_csharp_plugin
[HOSTLD]  Linking 
/home/bmahler/git/mesos3/build/3rdparty/grpc-1.10.0/bins/opt/grpc_objective_c_plugin
[HOSTLD]  Linking 
/home/bmahler/git/mesos3/build/3rdparty/grpc-1.10.0/bins/opt/grpc_ruby_plugin
[HOSTLD]  Linking 
/home/bmahler/git/mesos3/build/3rdparty/grpc-1.10.0/bins/opt/grpc_node_plugin
[HOSTLD]  Linking 
/home/bmahler/git/mesos3/build/3rdparty/grpc-1.10.0/bins/opt/grpc_php_plugin
[HOSTLD]  Linking 
/home/bmahler/git/mesos3/build/3rdparty/grpc-1.10.0/bins/opt/grpc_cpp_plugin
[PROTOC]  Generating protobuf CC file from src/proto/grpc/health/v1/health.proto
[PROTOC]  Generating protobuf CC file from 
src/proto/grpc/testing/echo_messages.proto
[PROTOC]  Generating protobuf CC file from src/proto/grpc/testing/payloads.proto
[PROTOC]  Generating protobuf CC file from src/proto/grpc/core/stats.proto
[PROTOC]  Generating protobuf CC file from src/proto/grpc/testing/messages.proto
third_party/protobuf/src: warning: directory does not exist.
third_party/protobuf/srcthird_party/protobuf/src: warning: directory does not 
exist.: warning: directory does not exist.

third_party/protobuf/src: warning: directory does not exist.
third_party/protobuf/src: warning: directory does not exist.
[GRPC]Generating gRPC's protobuf service CC file from 
src/proto/grpc/health/v1/health.proto
[GRPC]Generating gRPC's protobuf service CC file from 
src/proto/grpc/testing/payloads.proto
[GRPC]Generating gRPC's protobuf service CC file from 
src/proto/grpc/core/stats.proto
third_party/protobuf/src: warning: directory does not exist.
[GRPC]Generating gRPC's protobuf service CC file from 
src/proto/grpc/testing/echo_messages.proto
[PROTOC]  Generating protobuf CC file from src/proto/grpc/testing/echo.proto
third_party/protobuf/src: warning: directory does not exist.
[PROTOC]  Generating protobuf CC file from 
src/proto/grpc/testing/duplicate/echo_duplicate.proto
third_party/protobuf/src: warning: directory does not exist.
[PROTOC]  Generating protobuf CC file from src/proto/grpc/testing/stats.proto
[libprotobuf FATAL google/protobuf/generated_message_util.cc:794] CHECK failed: 
(scc->visit_status.load(std::memory_order_relaxed)) == (SCCInfoBase::kRunning):
terminate called after throwing an instance of 
'google::protobuf::FatalException'
  what():  CHECK failed: (scc->visit_status.load(std::memory_order_relaxed)) == 
(SCCInfoBase::kRunning):
[libprotobuf FATAL google/protobuf/generated_message_util.cc:794] CHECK failed: 
(scc->visit_status.load(std::memory_order_relaxed)) == (SCCInfoBase::kRunning):
terminate called after throwing an instance of 
'google::protobuf::FatalException'
  what():  CHECK failed: (scc->visit_status.load(std::memory_order_relaxed)) == 
(SCCInfoBase::kRunning):
[libprotobuf FATAL google/protobuf/generated_message_util.cc:794] CHECK failed: 
(scc->visit_status.load(std::memory_order_relaxed)) == (SCCInfoBase::kRunning):
terminate called after throwing an instance of 
'google::protobuf::FatalException'
  what():  CHECK failed: (scc->visit_status.load(std::memory_order_relaxed)) == 
(SCCInfoBase::kRunning):
[GRPC]Generating gRPC's protobuf service CC file from 
src/proto/grpc/testing/messages.proto
third_party/protobuf/src: warning: directory does not exist.
third_party/protobuf/src: warning: directory does not exist.
third_party/protobuf/src: warning: directory does not exist.
third_party/protobuf/src: warning: directory does not exist.
[libprotobuf FATAL google/protobuf/generated_message_util.cc:794] CHECK failed: 
(scc->visit_status.load(std::memory_order_relaxed)) == (SCCInfoBase::kRunning):
terminate called after throwing an instance of 
'google::protobuf::FatalException'
  what():  CHECK failed: (scc->visit_status.load(std::memory_order_relaxed)) == 
(SCCInfoBase::kRunning):
third_party/protobuf/src: warning: directory does not exist.
[libprotobuf FATAL google/protobuf/generated_message_util.cc:794] CHECK failed: 
(scc->visit_status.load(std::memory_order_relaxed)) == (SCCInfoBase::kRunning):
terminate called after throwing an instance of 
'google::protobuf::FatalException'
  what():  CHECK failed:

[jira] [Created] (MESOS-9881) StorageLocalResourceProviderTest.RetryOperationStatusUpdateAfterRecovery is flaky.

2019-07-03 Thread Benjamin Mahler (JIRA)

Benjamin Mahler created MESOS-9881:
--

 Summary: 
StorageLocalResourceProviderTest.RetryOperationStatusUpdateAfterRecovery is 
flaky.
 Key: MESOS-9881
 URL: https://issues.apache.org/jira/browse/MESOS-9881
 Project: Mesos
  Issue Type: Improvement
Reporter: Benjamin Mahler


This failed in CI:

{noformat}
1 tests failed.
FAILED:  
CSIVersion/StorageLocalResourceProviderTest.RetryOperationStatusUpdateAfterRecovery/v0

Error Message:
../../../3rdparty/libprocess/include/process/gmock.hpp:667
Mock function called more times than expected - returning default value.
Function call: filter(@0x5617542ee270 master@172.17.0.3:35735, 
@0x7f83cc053c30 264-byte object <48-23 06-32 84-7F 00-00 40-DE 07-CC 83-7F 
00-00 2B-00 00-00 00-00 00-00 2B-00 00-00 00-00 00-00 4C-65 6E-67 74-68 00-6F 
20-AF 00-54 17-56 00-00 10-AF 00-54 17-56 00-00 02-00 00-00 AC-11 00-03 ... 
20-20 05-CC 83-7F 00-00 00-00 00-00 6E-20 76-61 50-2B 4B-53 17-56 00-00 40-2B 
4B-53 17-56 00-00 60-DA 07-CC 83-7F 00-00 CA-03 00-00 00-00 00-00 CA-03 00-00 
00-00 00-00 10-01 00-00 00-00 00-00>)
  Returns: false
 Expected: to be never called
   Actual: called once - over-saturated and active

Stack Trace:
../../../3rdparty/libprocess/include/process/gmock.hpp:667
Mock function called more times than expected - returning default value.
Function call: filter(@0x5617542ee270 master@172.17.0.3:35735, 
@0x7f83cc053c30 264-byte object <48-23 06-32 84-7F 00-00 40-DE 07-CC 83-7F 
00-00 2B-00 00-00 00-00 00-00 2B-00 00-00 00-00 00-00 4C-65 6E-67 74-68 00-6F 
20-AF 00-54 17-56 00-00 10-AF 00-54 17-56 00-00 02-00 00-00 AC-11 00-03 ... 
20-20 05-CC 83-7F 00-00 00-00 00-00 6E-20 76-61 50-2B 4B-53 17-56 00-00 40-2B 
4B-53 17-56 00-00 60-DA 07-CC 83-7F 00-00 CA-03 00-00 00-00 00-00 CA-03 00-00 
00-00 00-00 10-01 00-00 00-00 00-00>)
  Returns: false
 Expected: to be never called
   Actual: called once - over-saturated and active
{noformat}

Full test output:

{noformat}
[ RUN  ] 
CSIVersion/StorageLocalResourceProviderTest.RetryOperationStatusUpdateAfterRecovery/v0
I0702 06:51:02.172196  6961 cluster.cpp:176] Creating default 'local' authorizer
I0702 06:51:02.183229 17274 master.cpp:440] Master 
c310f701-ca24-4ea8-a4be-df3aa3637194 (005dc56bde82) started on 172.17.0.3:35735
I0702 06:51:02.184095 17274 master.cpp:443] Flags at startup: --acls="" 
--agent_ping_timeout="15secs" --agent_reregister_timeout="10mins" 
--allocation_interval="50ms" --allocator="hierarchical" 
--authenticate_agents="true" --authenticate_frameworks="true" 
--authenticate_http_frameworks="true" --authenticate_http_readonly="true" 
--authenticate_http_readwrite="true" --authentication_v0_timeout="15secs" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/Pq6bYz/credentials" --filter_gpu_resources="true" 
--framework_sorter="drf" --help="false" --hostname_lookup="true" 
--http_authenticators="basic" --http_framework_authenticators="basic" 
--initialize_driver_logging="true" --log_auto_initialize="true" 
--logbufsecs="0" --logging_level="INFO" --max_agent_ping_timeouts="5" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_operator_event_stream_subscribers="1000" 
--max_unreachable_tasks_per_framework="1000" --memory_profiling="false" 
--min_allocatable_resources="cpus:0.01|mem:32" --port="5050" 
--publish_per_framework_metrics="true" --quiet="false" 
--recovery_agent_removal_limit="100%" --registry="in_memory" 
--registry_fetch_timeout="1mins" --registry_gc_interval="15mins" 
--registry_max_agent_age="2weeks" --registry_max_agent_count="102400" 
--registry_store_timeout="100secs" --registry_strict="false" 
--require_agent_domain="false" --role_sorter="drf" --root_submissions="true" 
--version="false" 
--webui_dir="/tmp/SRC/build/mesos-1.9.0/_inst/share/mesos/webui" 
--work_dir="/tmp/Pq6bYz/master" --zk_session_timeout="10secs"
I0702 06:51:02.185236 17274 master.cpp:492] Master only allowing authenticated 
frameworks to register
I0702 06:51:02.185819 17274 master.cpp:498] Master only allowing authenticated 
agents to register
I0702 06:51:02.186395 17274 master.cpp:504] Master only allowing authenticated 
HTTP frameworks to register
I0702 06:51:02.186951 17274 credentials.hpp:37] Loading credentials for 
authentication from '/tmp/Pq6bYz/credentials'
I0702 06:51:02.187907 17274 master.cpp:548] Using default 'crammd5' 
authenticator
I0702 06:51:02.188771 17274 http.cpp:975] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readonly'
I0702 06:51:02.189630 17274 http.cpp:975] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-readwrite'
I0702 06:51:02.190573 17274 http.cpp:975] Creating default 'basic' HTTP 
authenticator for realm 'mesos-master-scheduler'
I0702 06:51:02.191690 17274 master.cpp:629] Authorization enabled
I0702 06:51:02.195374 17265

[jira] [Created] (MESOS-9880) Update SUPPRESS/REVIVE calls to return error codes / 200 OK.

2019-07-03 Thread Benjamin Mahler (JIRA)

Benjamin Mahler created MESOS-9880:
--

 Summary: Update SUPPRESS/REVIVE calls to return error codes / 200 
OK.
 Key: MESOS-9880
 URL: https://issues.apache.org/jira/browse/MESOS-9880
 Project: Mesos
  Issue Type: Improvement
  Components: master, scheduler api
Reporter: Benjamin Mahler


Currently, the SUPPRESS/REVIVE calls always return '202 Accepted' even if the 
call is invalid.

Instead, to be aligned with UPDATE_FRAMEWORK, these calls should:

-Return 200 OK if successful.
-Return appropriate error response if invalid or erroneous.

For the v0 driver, this means:

-Send back a FrameworkErrorMessage if invalid or erroneous.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9871) Expose quota consumption in /roles endpoint.

2019-06-28 Thread Benjamin Mahler (JIRA)

Benjamin Mahler created MESOS-9871:
--

 Summary: Expose quota consumption in /roles endpoint.
 Key: MESOS-9871
 URL: https://issues.apache.org/jira/browse/MESOS-9871
 Project: Mesos
  Issue Type: Task
  Components: master
Reporter: Benjamin Mahler
Assignee: Benjamin Mahler


As part of exposing quota consumption to users and displaying quota consumption 
in the ui, we will need to add it to the /roles endpoint (which is currently 
what the ui uses for the roles table).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-9870) Simultaneous adding/removal of a role from framework's roles and its suppressed roles crashes the master.

2019-06-27 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9870?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9870:
--

Assignee: Andrei Sekretenko
Target Version/s: 1.5.4, 1.6.3, 1.7.3, 1.8.1, 1.9.0  (was: 1.9.0)

> Simultaneous adding/removal of a role from framework's roles and its 
> suppressed roles crashes the master.
> -
>
> Key: MESOS-9870
> URL: https://issues.apache.org/jira/browse/MESOS-9870
> Project: Mesos
>  Issue Type: Bug
>Reporter: Andrei Sekretenko
>Assignee: Andrei Sekretenko
>Priority: Blocker
>  Labels: resource-management
>
> Calling UPDATE_FRAMEWORK with a new role added both to 'FrameworkInfo.roles` 
> and `suppressed_roles` crashes the master.
> The first place which doesn't expect this is increasing a `suppressed` 
> allocator metric:
> [https://github.com/apache/mesos/blob/fe7be9701e92d863734621ae1a3d339bb8598044/src/master/allocator/mesos/hierarchical.cpp#L507]
> [
> https://github.com/apache/mesos/blob/fe7be9701e92d863734621ae1a3d339bb8598044/src/master/allocator/mesos/metrics.cpp#L255]
> Probably there are other similar places.
> Adding a new role in a suppressed state via re-subscribing  should also 
> trigger this bug - haven't checked it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9870) Adding a new role in a suppressed state crashes the master.

2019-06-27 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9870?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16874294#comment-16874294
 ] 

Benjamin Mahler commented on MESOS-9870:


Marked this as a blocker for the 1.9.0 release.

> Adding a new role in a suppressed state crashes the master.
> ---
>
> Key: MESOS-9870
> URL: https://issues.apache.org/jira/browse/MESOS-9870
> Project: Mesos
>  Issue Type: Bug
>Reporter: Andrei Sekretenko
>Priority: Major
>
> Calling UPDATE_FRAMEWORK with a new role added both to 'FrameworkInfo.roles` 
> and `suppressed_roles` crashes the master.
> The first place which doesn't expect this is increasing a `suppressed` 
> allocator metric:
> [https://github.com/apache/mesos/blob/fe7be9701e92d863734621ae1a3d339bb8598044/src/master/allocator/mesos/hierarchical.cpp#L507]
> [
> https://github.com/apache/mesos/blob/fe7be9701e92d863734621ae1a3d339bb8598044/src/master/allocator/mesos/metrics.cpp#L255]
> Probably there are other similar places.
> Adding a new role in a suppressed state via re-subscribing  should also 
> trigger this bug - haven't checked it



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-7899) Expose sandboxes using virtual paths and hide the agent work directory.

2019-06-25 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-7899?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16872880#comment-16872880
 ] 

Benjamin Mahler commented on MESOS-7899:


Hi [~tomq42]! I'd like to direct you instead to the user@ mailing list or slack 
(e.g. #containerizer) to get help with this.

> Expose sandboxes using virtual paths and hide the agent work directory.
> ---
>
> Key: MESOS-7899
> URL: https://issues.apache.org/jira/browse/MESOS-7899
> Project: Mesos
>  Issue Type: Task
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>Priority: Major
> Fix For: 1.5.0
>
>
> {{Files}} interface already supports a virtual file system. We should figure 
> out a way to enable this in {{ /files/download}} endpoint to hide agent 
> sandbox.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9124) Agent reconfiguration can cause master to unsuppress on scheduler's behalf

2019-06-24 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9124?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16871826#comment-16871826
 ] 

Benjamin Mahler commented on MESOS-9124:


Backporting this fix to active release branches.

> Agent reconfiguration can cause master to unsuppress on scheduler's behalf
> --
>
> Key: MESOS-9124
> URL: https://issues.apache.org/jira/browse/MESOS-9124
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, master
>Affects Versions: 1.5.3, 1.6.2, 1.7.2
>Reporter: Greg Mann
>Assignee: Greg Mann
>Priority: Major
>  Labels: foundations, mesosphere
> Fix For: 1.8.0
>
>
> When agent reconfiguration was enabled in Mesos, the allocator was also 
> updated to remove all offer filters associated with an agent when that 
> agent's attributes change. In addition, whenever filters for an agent are 
> removed, the framework is unsuppressed for any roles that had filters on the 
> agent.
> While this ensures that schedulers will have an opportunity to use resources 
> on an agent after reconfiguration, modifying the scheduler's suppression may 
> put the scheduler in an inconsistent state, where it believes it is 
> suppressed in a particular role when it is not.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-9856) REVIVE call with specified role(s) clears filters for all roles of a framework.

2019-06-21 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9856?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9856:
--

Assignee: Andrei Sekretenko

> REVIVE call with specified role(s) clears filters for all roles of a 
> framework.
> ---
>
> Key: MESOS-9856
> URL: https://issues.apache.org/jira/browse/MESOS-9856
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Andrei Sekretenko
>Priority: Major
>  Labels: resource-management
>
> As pointed out by [~asekretenko], the REVIVE implementation in the allocator 
> incorrectly clears decline filters for all of the framework's roles, rather 
> than only those that were specified in the REVIVE call:
> https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1392
> This should only clear filters for the roles specified in the REVIVE call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-9852) Slow memory growth due to deferred deletion of offer filters and timers.

2019-06-21 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9852?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9852:
--

Assignee: Benjamin Mahler

> Slow memory growth due to deferred deletion of offer filters and timers.
> 
>
> Key: MESOS-9852
> URL: https://issues.apache.org/jira/browse/MESOS-9852
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, master
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: resource-management
>
> The allocator does not keep a handle to the offer filter timer, which means 
> it cannot remove the timer overhead (in this case memory) when removing the 
> offer filter earlier (e.g. due to revive):
> https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1338-L1352
> In addition, the offer filter is allocated on the heap but not deleted until 
> the timer fires (which might take forever!):
> https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1321
> https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1408-L1413
> https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L2249
> We'll need to try to backport this to all active release branches.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9856) REVIVE call with specified role(s) clears filters for all roles of a framework.

2019-06-21 Thread Benjamin Mahler (JIRA)

Benjamin Mahler created MESOS-9856:
--

 Summary: REVIVE call with specified role(s) clears filters for all 
roles of a framework.
 Key: MESOS-9856
 URL: https://issues.apache.org/jira/browse/MESOS-9856
 Project: Mesos
  Issue Type: Bug
  Components: allocation
Reporter: Benjamin Mahler


As pointed out by [~asekretenko], the REVIVE implementation in the allocator 
incorrectly clears decline filters for all of the framework's roles, rather 
than only those that were specified in the REVIVE call:

https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1392

This should only clear filters for the roles specified in the REVIVE call.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-8789) Role-related endpoints should display distinct offered and allocated resources.

2019-06-19 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-8789?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-8789:
--

Assignee: Benjamin Mahler  (was: Till Toenshoff)

> Role-related endpoints should display distinct offered and allocated 
> resources.
> ---
>
> Key: MESOS-8789
> URL: https://issues.apache.org/jira/browse/MESOS-8789
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, HTTP API, master
>Affects Versions: 1.5.0
>Reporter: Till Toenshoff
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: mesosphere, multitenancy, resource-management
>
> The role endpoints currently show accumulated values for resources 
> (allocated), containing offered resources. For gaining an overview showing 
> our allocated resources separately from the offered resources could improve 
> the signal quality, depending on the use case.
> This also affects the UI display, for example the "Roles" tab.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-8790) Deprecate Role::resources in favor of Role::allocated and Role::offered.

2019-06-19 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-8790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-8790:
--

Assignee: Benjamin Mahler  (was: Till Toenshoff)

> Deprecate Role::resources in favor of Role::allocated and Role::offered.
> 
>
> Key: MESOS-8790
> URL: https://issues.apache.org/jira/browse/MESOS-8790
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API, master
>Affects Versions: 1.5.0
>Reporter: Till Toenshoff
>Assignee: Benjamin Mahler
>Priority: Minor
>  Labels: mesosphere, multitenancy, resource-management
>
> There are upcoming enhancements around role related resource accounting. The 
> changes will add a more detailed role related resources accounting. 
> We need to retire the {{resources}} member of the {{Role}} Message in 
> mesos.proto (V0 + V1). This in turn means that we follow this deprecation on 
> the role-related endpoints as well, adding {{allocated}} to both "/roles" as 
> well as "GET_ROLES".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9852) Slow memory growth due to deferred deletion of offer filters and timers.

2019-06-19 Thread Benjamin Mahler (JIRA)

Benjamin Mahler created MESOS-9852:
--

 Summary: Slow memory growth due to deferred deletion of offer 
filters and timers.
 Key: MESOS-9852
 URL: https://issues.apache.org/jira/browse/MESOS-9852
 Project: Mesos
  Issue Type: Bug
  Components: allocation, master
Reporter: Benjamin Mahler


The allocator does not keep a handle to the offer filter timer, which means it 
cannot remove the timer overhead (in this case memory) when removing the offer 
filter earlier (e.g. due to revive):

https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1338-L1352

In addition, the offer filter is allocated on the heap but not deleted until 
the timer fires (which might take forever!):

https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1321
https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L1408-L1413
https://github.com/apache/mesos/blob/1.8.0/src/master/allocator/mesos/hierarchical.cpp#L2249

We'll need to try to backport this to all active release branches.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9813) Track role consumed quota for all roles in the allocator.

2019-06-17 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16866034#comment-16866034
 ] 

Benjamin Mahler commented on MESOS-9813:


Note that the consumption metrics we expose should not include what is offered, 
which means we can't simply use the allocator's tracking of quota consumption, 
since it's unable to distinguish between offered and allocated.

> Track role consumed quota for all roles in the allocator.
> -
>
> Key: MESOS-9813
> URL: https://issues.apache.org/jira/browse/MESOS-9813
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Meng Zhu
>Priority: Major
>  Labels: resource-management
>
> We are already tracking role consumed quota for roles with non-default quota 
> in the allocator. We should expand that to track all roles' consumptions 
> which will then be exposed through metrics later.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9849) Add support for per-role REVIVE / SUPPRESS to V0 scheduler driver.

2019-06-17 Thread Benjamin Mahler (JIRA)

Benjamin Mahler created MESOS-9849:
--

 Summary: Add support for per-role REVIVE / SUPPRESS to V0 
scheduler driver.
 Key: MESOS-9849
 URL: https://issues.apache.org/jira/browse/MESOS-9849
 Project: Mesos
  Issue Type: Task
  Components: scheduler driver
Reporter: Benjamin Mahler


Unfortunately, there are still schedulers that are using the v0 bindings and 
are unable to move to v1 before wanting to use the per-role REVIVE / SUPPRESS 
calls.

We'll need to add per-role REVIVE / SUPPRESS into the v1 scheduler driver.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9793) Implement UPDATE_FRAMEWORK call in V0 API

2019-06-15 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16864803#comment-16864803
 ] 

Benjamin Mahler commented on MESOS-9793:


[~asekretenko] friendly reminder to add the ticket to the reviews.

I assume this ticket is also tracking the python bindings?

> Implement UPDATE_FRAMEWORK call in V0 API
> -
>
> Key: MESOS-9793
> URL: https://issues.apache.org/jira/browse/MESOS-9793
> Project: Mesos
>  Issue Type: Task
>Reporter: Andrei Sekretenko
>Assignee: Andrei Sekretenko
>Priority: Major
>  Labels: multitenancy, resource-management
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-9808) libprocess can deadlock on termination (cleanup() vs use() + terminate())

2019-06-06 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9808:
--

Assignee: Benjamin Mahler

> libprocess can deadlock on termination (cleanup() vs use() + terminate())
> -
>
> Key: MESOS-9808
> URL: https://issues.apache.org/jira/browse/MESOS-9808
> Project: Mesos
>  Issue Type: Bug
>Reporter: Andrei Sekretenko
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: foundations
> Attachments: deadlock_stacks.txt, deadlock_stacks_filtered.txt, 
> deadlock_stacks_with_fix.txt
>
>
> Using the process::loop() together with the common pattern of using 
> libprocess (Process wrapper + dispatching) is prone to causing a deadlock on 
> libprocess termination if the code does not wait for the loop exit before 
> termination.
> *The deadlock itself is not directly caused by the process::loop(), though.*
>  It occurs in a following setup with two processes (let's name them A and B).
> Thread 1 tries to cleanup process A. It locks processes_mutex and hangs here:
>  
> [https://github.com/apache/mesos/blob/663bfa68b6ab68f4c28ed6a01ac42ac2ad23ac07/3rdparty/libprocess/src/process.cpp#L3079]
>  waiting for the process A to have no strong references.
> Thread 2 begins with creating a ProcessReference in 
> ProcessManager::deliver(UPID&) called for process: 
> [https://github.com/apache/mesos/blob/663bfa68b6ab68f4c28ed6a01ac42ac2ad23ac07/3rdparty/libprocess/src/process.cpp#L2799]
> and ends up waiting for processes_mutex in ProcessManager::terminate() for 
> process B:
>  
> [https://github.com/apache/mesos/blob/663bfa68b6ab68f4c28ed6a01ac42ac2ad23ac07/3rdparty/libprocess/src/process.cpp#L3155]
> -
>  In the observed case, terminate() for process B was triggered by a 
> destructor of a process-wrapping object owned by a libprocess loop executing 
> on A.
> I'm attaching the stacks captured at the deadlock. Stacks of the threads 
> which lock one another are in [^deadlock_stacks_filtered.txt] Note frame #1 
> in Thread 5 (waiting for all references to expire) and frames #48 and #8 in 
> Thread 19 (creating a reference and waiting for a processes_mutex).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9808) libprocess can deadlock on termination (cleanup() vs use() + terminate())

2019-06-03 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9808?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16855287#comment-16855287
 ] 

Benjamin Mahler commented on MESOS-9808:


Thanks for looking into this [~asekretenko]!

This can happen when a dispatch has objects that are bound into it whose 
destructors will do any of the following:
* terminate a process
* dispatch to a process using a UPID that didn't resolve to a Process upon 
construction (highly doubt we have any code doing this)
* send a message to a local Process (i.e. in the same OS process) (doubt this 
will be an issue outside of testing since we use dispatch for local components)

The issue is that we currently destruct dropped DispatchEvents to TERMINATING 
Processes while holding the TERMINATING ProcessReference (whoops!), and so we 
can execute further calls that try to block on the processes_mutex (e.g. 
terminate()) while the cleanup of the TERMINATING Process is spinning waiting 
for transient references to go away.

I'm not sure how common the terminate case above is, but it's the most 
worrying. Probably it makes sense to backport the fix to at least 1.8.x, and 
ideally further back.

I wrote a fix, and spent some time trying to test this but gave up after being 
unable to figure out how to reliably get into a deadlock state without races. 
The fix is here: https://reviews.apache.org/r/70778/

Can you let me know if it fixes the issue that you saw without your workaround?

> libprocess can deadlock on termination (cleanup() vs use() + terminate())
> -
>
> Key: MESOS-9808
> URL: https://issues.apache.org/jira/browse/MESOS-9808
> Project: Mesos
>  Issue Type: Bug
>Reporter: Andrei Sekretenko
>Priority: Major
>  Labels: foundations
> Attachments: deadlock_stacks.txt, deadlock_stacks_filtered.txt
>
>
> Using the process::loop() together with the common pattern of using 
> libprocess (Process wrapper + dispatching) is prone to causing a deadlock on 
> libprocess termination if the code does not wait for the loop exit before 
> termination.
> *The deadlock itself is not directly caused by the process::loop(), though.*
>  It occurs in a following setup with two processes (let's name them A and B).
> Thread 1 tries to cleanup process A. It locks processes_mutex and hangs here:
>  
> [https://github.com/apache/mesos/blob/663bfa68b6ab68f4c28ed6a01ac42ac2ad23ac07/3rdparty/libprocess/src/process.cpp#L3079]
>  waiting for the process A to have no strong references.
> Thread 2 begins with creating a ProcessReference in 
> ProcessManager::deliver(UPID&) called for process: 
> [https://github.com/apache/mesos/blob/663bfa68b6ab68f4c28ed6a01ac42ac2ad23ac07/3rdparty/libprocess/src/process.cpp#L2799]
> and ends up waiting for processes_mutex in ProcessManager::terminate() for 
> process B:
>  
> [https://github.com/apache/mesos/blob/663bfa68b6ab68f4c28ed6a01ac42ac2ad23ac07/3rdparty/libprocess/src/process.cpp#L3155]
> -
>  In the observed case, terminate() for process B was triggered by a 
> destructor of a process-wrapping object owned by a libprocess loop executing 
> on A.
> I'm attaching the stacks captured at the deadlock. Stacks of the threads 
> which lock one another are in [^deadlock_stacks_filtered.txt] Note frame #1 
> in Thread 5 (waiting for all references to expire) and frames #48 and #8 in 
> Thread 19 (creating a reference and waiting for a processes_mutex).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-9801) Use protobuf arenas for v1 API responses.

2019-05-28 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9801:
--

Assignee: William Mahler

> Use protobuf arenas for v1 API responses.
> -
>
> Key: MESOS-9801
> URL: https://issues.apache.org/jira/browse/MESOS-9801
> Project: Mesos
>  Issue Type: Improvement
>  Components: agent, master
>Reporter: Benjamin Mahler
>Assignee: William Mahler
>Priority: Major
>  Labels: performance
>
> The v1 API response construction is currently slower than the v0 API response 
> construction. A primary reason for this is that the v1 API constructs 
> intermediate C++ protobuf response objects, which are very expensive in terms 
> of memory allocation/deallocation cost. Also involved is the use of 
> {{evolve()}} which evolves messages from unversioned protobuf into v1 
> protobuf. This also has very high memory allocation / deallocation cost.
> Using arenas for all v1 API response construction will provide a significant 
> improvement.
> This ticket currently captures all the aspects of this:
> * Updating {{evolve()}} to use arenas across all v1 API responses.
> * Updating all response construction functions (e.g. {{getState())}}) to use 
> arenas.
> * Making this change for both the master and agent.
> This is blocked by MESOS-9755 since we need to upgrade our bundled protobuf 
> to have string fields allocated in the arenas.
> We may split out tickets for CHANGELOG purposes if only a portion of this 
> lands in 1.9.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9801) Use protobuf arenas for v1 API responses.

2019-05-28 Thread Benjamin Mahler (JIRA)

Benjamin Mahler created MESOS-9801:
--

 Summary: Use protobuf arenas for v1 API responses.
 Key: MESOS-9801
 URL: https://issues.apache.org/jira/browse/MESOS-9801
 Project: Mesos
  Issue Type: Improvement
  Components: agent, master
Reporter: Benjamin Mahler


The v1 API response construction is currently slower than the v0 API response 
construction. A primary reason for this is that the v1 API constructs 
intermediate C++ protobuf response objects, which are very expensive in terms 
of memory allocation/deallocation cost. Also involved is the use of 
{{evolve()}} which evolves messages from unversioned protobuf into v1 protobuf. 
This also has very high memory allocation / deallocation cost.

Using arenas for all v1 API response construction will provide a significant 
improvement.

This ticket currently captures all the aspects of this:

* Updating {{evolve()}} to use arenas across all v1 API responses.
* Updating all response construction functions (e.g. {{getState())}}) to use 
arenas.
* Making this change for both the master and agent.

This is blocked by MESOS-9755 since we need to upgrade our bundled protobuf to 
have string fields allocated in the arenas.

We may split out tickets for CHANGELOG purposes if only a portion of this lands 
in 1.9.0.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-9787) Low slow SSL (TLS) peer reverse DNS lookup.

2019-05-16 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9787?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9787:
--

Assignee: Benjamin Mahler

> Low slow SSL (TLS) peer reverse DNS lookup.
> ---
>
> Key: MESOS-9787
> URL: https://issues.apache.org/jira/browse/MESOS-9787
> Project: Mesos
>  Issue Type: Improvement
>  Components: libprocess
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
> Fix For: 1.5.4, 1.6.3, 1.7.3, 1.8.1, 1.9.0
>
>
> Given the severity of MESOS-9339, we should add logging of slow SSL (TLS) 
> peer reverse DNS lookups.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9787) Low slow SSL (TLS) peer reverse DNS lookup.

2019-05-16 Thread Benjamin Mahler (JIRA)

Benjamin Mahler created MESOS-9787:
--

 Summary: Low slow SSL (TLS) peer reverse DNS lookup.
 Key: MESOS-9787
 URL: https://issues.apache.org/jira/browse/MESOS-9787
 Project: Mesos
  Issue Type: Improvement
  Components: libprocess
Reporter: Benjamin Mahler


Given the severity of MESOS-9339, we should add logging of slow SSL (TLS) peer 
reverse DNS lookups.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9749) mesos agent logging hangs upon systemd-journald restart

2019-05-14 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9749?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16839372#comment-16839372
 ] 

Benjamin Mahler commented on MESOS-9749:


cc [~kaysoky]

> mesos agent logging hangs upon systemd-journald restart
> ---
>
> Key: MESOS-9749
> URL: https://issues.apache.org/jira/browse/MESOS-9749
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.7.2
> Environment: Running on centos 7.4.1708, systemd  219 (probably 
> heavily patched by centos)
> mesos-agent command:
> {code}
> /usr/sbin/mesos-slave \
>  
> --attributes='canary:canary-false;maintenance_group:group-6;network:10g;platform:centos;platform_major_version:7;rack_name:22.05;type:base;version:v2018-q-1'
>  \
>  --cgroups_enable_cfs \
>  --cgroups_hierarchy='/sys/fs/cgroup' \
>  --cgroups_net_cls_primary_handle='0xC370' \
>  --container_logger='org_apache_mesos_LogrotateContainerLogger' \
>  --containerizers='mesos' \
>  --credential='file:///etc/mesos-chef/slave-credential' \
>  
> --default_container_info='\{"type":"MESOS","volumes":[{"host_path":"tmp","container_path":"/tmp","mode":"RW"},\{"host_path":"var_tmp","container_path":"/var/tmp","mode":"RW"},\{"host_path":".","container_path":"/mnt/mesos/sandbox","mode":"RW"},\{"host_path":"/usr/share/mesos/geoip","container_path":"/mnt/mesos/geoip","mode":"RO"}]}'
>  \
>  --docker_registry='https://filer-docker-registry.prod.crto.in/' \
>  --docker_store_dir='/var/opt/mesos/store/docker' \
>  --enforce_container_disk_quota \
>  
> --executor_environment_variables='\{"PATH":"/bin:/usr/bin","CRITEO_DC":"par","CRITEO_ENV":"prod","CRITEO_GEOIP_PATH":"/mnt/mesos/geoip"}'
>  \
>  --executor_registration_timeout='5mins' \
>  --fetcher_cache_dir='/var/opt/mesos/cache' \
>  --fetcher_cache_size='2GB' \
>  --hooks='com_criteo_mesos_CommandHook' \
>  --image_providers='docker' \
>  --image_provisioner_backend='copy' \
>  
> --isolation='linux/capabilities,cgroups/cpu,cgroups/mem,cgroups/net_cls,namespaces/pid,filesystem/linux,docker/runtime,network/cni,disk/xfs,com_criteo_mesos_CommandIsolator'
>  \
>  --logging_level='INFO' \
>  
> --master='zk://mesos:xx...@mesos-master01-par.central.criteo.prod:2181,mesos-master02-par.central.criteo.prod:2181,mesos-master03-par.central.criteo.prod:2181/mesos'
>  \
>  --modules='file:///etc/mesos-chef/slave-modules.json' \
>  --port=5051 \
>  --recover='reconnect' \
>  --resources='file:///etc/mesos-chef/custom_resources.json' \
>  --strict \
>  --work_dir='/var/opt/mesos' \
>  --xfs_kill_containers \
>  --xfs_project_range='[5000-50]'
> {code}
>Reporter: Gregoire Seux
>Priority: Minor
>  Labels: foundations
>
> When mesos agent is launched through systemd, a restart of systemd-journald 
> service makes mesos agent logging hang (no more output).. The process itself 
> seems to work fine (we can query state via http for instance).
> A restart of mesos-agent corrects the issue.
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9773) Log the peer address during SSL handshake failure.

2019-05-07 Thread Benjamin Mahler (JIRA)

Benjamin Mahler created MESOS-9773:
--

 Summary: Log the peer address during SSL handshake failure.
 Key: MESOS-9773
 URL: https://issues.apache.org/jira/browse/MESOS-9773
 Project: Mesos
  Issue Type: Improvement
  Components: libprocess
Reporter: Benjamin Mahler


Recently, peer address logging was added to *most* socket errors per MESOS. 
However, the case where an non-SSL connection arrives when we have SSL-only 
mandated, the following confusing error is printed:

{noformat}
"process.cpp Failed to accept socket: Failed accept: connection error: 
error::lib(0):func(0):reason(0)"
{noformat}

We should be able to avoid the confusing message here as well as include the 
peer address, so that it's easier to know where the connection is coming from.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9767) Add self health monitoring in Mesos master

2019-05-06 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834119#comment-16834119
 ] 

Benjamin Mahler commented on MESOS-9767:


The bizarre thread stack is:

{noformat}
Thread 21 (Thread 0x7fa1e0e4d700 (LWP 85889)):

#0  0x7fa1f05f01c2 in hash_combine_impl (k=52, h=)
at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:264
#1  hash_combine (v=, seed=)
at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:337
#2  hash_range<__gnu_cxx::__normal_iterator > > (last=...,
first=52 '4') at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:351
#3  hash_value > (v=...)
at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:410
#4  operator() (this=, v=...)
at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:486
#5  boost::hash_combine (seed=seed@entry=@0x7fa1e0e4c770: 0, v=...)
at ../3rdparty/boost-1.65.0/boost/functional/hash/hash.hpp:337
#6  0x7fa1f06ad178 in operator() (this=0x7fa1cc02d068, taskId=...)
at /mesos/include/mesos/type_utils.hpp:634
#7  _M_hash_code (this=0x7fa1cc02d068, __k=...) at 
/usr/include/c++/4.9/bits/hashtable_policy.h:1261
#8  std::_Hashtable > > >, 
std::allocator > > > 
>, std::__detail::_Select1st, std::equal_to, 
std::hash, std::__detail::_Mod_range_hashing, 
std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, 
std::__detail::_Hashtable_traits >::count 
(this=this@entry=0x7fa1cc02d068, __k=...)
at /usr/include/c++/4.9/bits/hashtable.h:1336
#9  0x7fa1f0663eb2 in count (__x=..., this=0x7fa1cc02d068)
at /usr/include/c++/4.9/bits/unordered_map.h:592
#10 contains (key=..., this=0x7fa1cc02d068) at 
/mesos/3rdparty/stout/include/stout/hashmap.hpp:88
#11 erase (key=..., this=0x7fa1cc02d050)
at /mesos/3rdparty/stout/include/stout/boundedhashmap.hpp:92
#12 mesos::internal::master::Master::__reregisterSlave(process::UPID const&, 
mesos::internal::ReregisterSlaveMessage&&, process::Future const&) 
(this=0x561dcf047380, pid=...,
reregisterSlaveMessage=, future=...) at /mesos/src/master/master.cpp:7369
#13 0x7fa1f14d54e1 in operator() (args#0=0x561dcf048620, this=)
at /mesos/3rdparty/libprocess/../stout/include/stout/lambda.hpp:443
#14 process::ProcessBase::consume(process::DispatchEvent&&) (this=,
event=) at /mesos/3rdparty/libprocess/src/process.cpp:3577
#15 0x7fa1f14e89b2 in serve (
event=,
this=0x561dcf048620) at 
/mesos/3rdparty/libprocess/include/process/process.hpp:87
#16 process::ProcessManager::resume (this=, 
process=0x561dcf048620)
at /mesos/3rdparty/libprocess/src/process.cpp:3002
#17 0x7fa1f14ee226 in operator() (__closure=0x561dcf119158)
at /mesos/3rdparty/libprocess/src/process.cpp:2511
#18 _M_invoke<> (this=0x561dcf119158) at /usr/include/c++/4.9/functional:1700
#19 operator() (this=0x561dcf119158) at /usr/include/c++/4.9/functional:1688
#20 
std::thread::_Impl()>
 >::_M_run(void) (this=0x561dcf119140) at /usr/include/c++/4.9/thread:115
#21 0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#22 0x7fa1ee520064 in start_thread (arg=0x7fa1e0e4d700) at 
pthread_create.c:309
#23 0x7fa1ee25562d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111
{noformat}

[~ggarg] is this trace present whenever it's hanging?

> Add self health monitoring in Mesos master
> --
>
> Key: MESOS-9767
> URL: https://issues.apache.org/jira/browse/MESOS-9767
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Affects Versions: 1.6.0
>Reporter: Gaurav Garg
>Priority: Major
> Fix For: 1.7.2
>
>
> We have seen issue where Mesos master got stuck and was not responding to 
> HTTP endpoints like "/metrics/snapshot". This results in calls by the 
> frameworks and metrics collector to the master to hang. Currently we emit 
> 'master alive' metric using prometheus. If master hangs, this metrics is not 
> published and we detect the hangs using alerts on top of this metrics. By the 
> time someone would have got the alert and restarted the master process, 
> 15-30mins would have passed by. This results in SLA violation by Mesos 
> cluster users.
> It will be nice to implement a self health check monitoring to detect if the 
> Mesos master is hung/stuck. This will help us to quickly crash the master 
> process so that one of the other member of the quorum can acquire ZK 
> leadership lock.
> We can use the "/master/health" endpoint for health checks. 
> Health checks can be initiated in 
> [src/master/main.cpp|[https://github.com/apache/mesos/blob/master/src/master/main.cpp]]
>  just after the child master process is 
> [spawned.|[https://github.com/apache/mesos/blob/master/src/master/main.cpp#L543]]
> We can leverage the 
>

[jira] [Comment Edited] (MESOS-9767) Add self health monitoring in Mesos master

2019-05-06 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834089#comment-16834089
 ] 

Benjamin Mahler edited comment on MESOS-9767 at 5/6/19 6:54 PM:


Mesos master stopped responding to HTTP request at around 16:30PM. At around 
17:00PM, master was restarted. Logs are attached after the stack trace.

Logs of Mesos master around the same time:

{noformat}
I0429 16:26:45.664958 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 58f5b1e4-844d-4909-b75e-294ecc919a3f-3-2 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.665169 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 0accdb07-74f4-42d1-8921-1d0703d3c907-0-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.665390 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 2df3cbb1-9790-492b-8250-5d1666557e53-0-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.665594 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 7fbdf4f6-9947-413b-9b06-3e6c57d93cba-2-2 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.665812 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 588c43e4-38ee-4c29-947c-b59b9bd431f5-3-7 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.666008 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 7e4bd9f6-8da9-4569-9b23-dbfb0eb27c3f-0-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.666244 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 11c4d38d-a641-4936-ad16-b8c237e74498-1-34629 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.666452 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task a675e086-fadf-47a0-87a4-a3c0f305b2c4-1-4 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.79 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 0b1cb4d7-5fb1-499f-8c02-98df60739f58-1-34027 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.666882 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 1529c73c-3699-4cd8-81b8-07849f34e89c-3-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.667078 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 3c9ad9c3-5cff-4550-b25e-d33b86d5a1ce-6-2 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.667371 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 365aa302-a4b1-4a70-ab47-49acf55d36c4-1-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.667604 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 47b370b1-2c1d-4679-93b4-93a33bb2783b-3-2 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.667842 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 161b2f67-7765-4d5f-94fe-fdcdb1b048e6-1-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.668094 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 87323ff4-7018-45b1-990d-8d673f932f6e-1-33866 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.668329 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 7e9fa49d-04f0-40f6-8799-9f0b47c3af83-2-3 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.668557 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task d043189a-ae4c-4061-80f6-efc1e43938e6-1-2 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.668810 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task f6b5ec2b-0b80-4929-baf9-23e63e9be050-1-33287 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.669023 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 2df3cbb1-9790-492b-8250-5d1666557e53-1-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.669239 85889 master.cpp:8397] Sending status update TASK_FAILED 
for task 2b0296a5-f576-47ba-ba46-88b7a604f1fb-1-1 of framework 
3dcc744f-016c-6579-9b82-6325424502d2- 'Unreachable agent re-reregistered'
I0429 16:26:45.669457 85889 master.cpp:8397]

[jira] [Comment Edited] (MESOS-9767) Add self health monitoring in Mesos master

2019-05-06 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16834088#comment-16834088
 ] 

Benjamin Mahler edited comment on MESOS-9767 at 5/6/19 6:50 PM:


Stack trace of the Mesos master when the hang was detected. Captured using gdb.

 
{noformat}
Thread 35 (Thread 0x7fa1e7e5b700 (LWP 85875)):

#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
#1  0x7fa1f14d6e82 in wait (this=)
    at /mesos/3rdparty/libprocess/src/semaphore.hpp:115
#2  wait (this=) at 
/mesos/3rdparty/libprocess/src/semaphore.hpp:154
#3  wait (this=) at 
/mesos/3rdparty/libprocess/src/run_queue.hpp:73
#4  process::ProcessManager::dequeue (this=0x561dcf063970)
    at /mesos/3rdparty/libprocess/src/process.cpp:3305
#5  0x7fa1f14ee22f in operator() (__closure=0x561dcf0ae768)
    at /mesos/3rdparty/libprocess/src/process.cpp:2505
#6  _M_invoke<> (this=0x561dcf0ae768) at /usr/include/c++/4.9/functional:1700
#7  operator() (this=0x561dcf0ae768) at /usr/include/c++/4.9/functional:1688
#8  
std::thread::_Impl()>
 >::_M_run(void) (this=0x561dcf0ae750) at /usr/include/c++/4.9/thread:115
#9  0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x7fa1ee520064 in start_thread (arg=0x7fa1e7e5b700) at 
pthread_create.c:309
#11 0x7fa1ee25562d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 34 (Thread 0x7fa1e765a700 (LWP 85876)):

#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
#1  0x7fa1f14d6e82 in wait (this=)
    at /mesos/3rdparty/libprocess/src/semaphore.hpp:115
#2  wait (this=) at 
/mesos/3rdparty/libprocess/src/semaphore.hpp:154
#3  wait (this=) at 
/mesos/3rdparty/libprocess/src/run_queue.hpp:73
#4  process::ProcessManager::dequeue (this=0x561dcf063970)
    at /mesos/3rdparty/libprocess/src/process.cpp:3305
#5  0x7fa1f14ee22f in operator() (__closure=0x561dcf11ff38)
    at /mesos/3rdparty/libprocess/src/process.cpp:2505
#6  _M_invoke<> (this=0x561dcf11ff38) at /usr/include/c++/4.9/functional:1700
#7  operator() (this=0x561dcf11ff38) at /usr/include/c++/4.9/functional:1688
#8  
std::thread::_Impl()>
 >::_M_run(void) (this=0x561dcf11ff20) at /usr/include/c++/4.9/thread:115
#9  0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x7fa1ee520064 in start_thread (arg=0x7fa1e765a700) at 
pthread_create.c:309
#11 0x7fa1ee25562d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 33 (Thread 0x7fa1e6e59700 (LWP 85877)):

#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
#1  0x7fa1f14d6e82 in wait (this=)
    at /mesos/3rdparty/libprocess/src/semaphore.hpp:115
#2  wait (this=) at 
/mesos/3rdparty/libprocess/src/semaphore.hpp:154
#3  wait (this=) at 
/mesos/3rdparty/libprocess/src/run_queue.hpp:73
#4  process::ProcessManager::dequeue (this=0x561dcf063970)
    at /mesos/3rdparty/libprocess/src/process.cpp:3305
#5  0x7fa1f14ee22f in operator() (__closure=0x561dcf11d988)
    at /mesos/3rdparty/libprocess/src/process.cpp:2505
#6  _M_invoke<> (this=0x561dcf11d988) at /usr/include/c++/4.9/functional:1700
#7  operator() (this=0x561dcf11d988) at /usr/include/c++/4.9/functional:1688
#8  
std::thread::_Impl()>
 >::_M_run(void) (this=0x561dcf11d970) at /usr/include/c++/4.9/thread:115
#9  0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x7fa1ee520064 in start_thread (arg=0x7fa1e6e59700) at 
pthread_create.c:309
#11 0x7fa1ee25562d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 32 (Thread 0x7fa1e6658700 (LWP 85878)):

#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
#1  0x7fa1f14d6e82 in wait (this=)
    at /mesos/3rdparty/libprocess/src/semaphore.hpp:115
#2  wait (this=) at 
/mesos/3rdparty/libprocess/src/semaphore.hpp:154
#3  wait (this=) at 
/mesos/3rdparty/libprocess/src/run_queue.hpp:73
#4  process::ProcessManager::dequeue (this=0x561dcf063970)
    at /mesos/3rdparty/libprocess/src/process.cpp:3305
#5  0x7fa1f14ee22f in operator() (__closure=0x561dcf128758)
    at /mesos/3rdparty/libprocess/src/process.cpp:2505
#6  _M_invoke<> (this=0x561dcf128758) at /usr/include/c++/4.9/functional:1700
#7  operator() (this=0x561dcf128758) at /usr/include/c++/4.9/functional:1688
#8  
std::thread::_Impl()>
 >::_M_run(void) (this=0x561dcf128740) at /usr/include/c++/4.9/thread:115
#9  0x7fa1ee7eb990 in ?? () from /usr/lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x7fa1ee520064 in start_thread (arg=0x7fa1e6658700) at 
pthread_create.c:309
#11 0x7fa1ee25562d in clone () at 
../sysdeps/unix/sysv/linux/x86_64/clone.S:111

Thread 31 (Thread 0x7fa1e5e57700 (LWP 85879)):

#0  sem_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/sem_wait.S:85
#1  0x7fa1f14d6e82 in wait (this=)
    at /mesos/3rdparty/libprocess/src/semaphore.hpp:115
#2  wait (this=) at

[jira] [Assigned] (MESOS-9766) /processes endpoint can hang.

2019-05-03 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9766:
--

Assignee: Benjamin Mahler

> /__processes__ endpoint can hang.
> -
>
> Key: MESOS-9766
> URL: https://issues.apache.org/jira/browse/MESOS-9766
> Project: Mesos
>  Issue Type: Bug
>  Components: libprocess
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: foundations
>
> A user reported that the {{/\_\_processes\_\_}} endpoint occasionally hangs.
> Stack traces provided by [~alexr] revealed that all the threads appeared to 
> be idle waiting for events. After investigating the code, the issue was found 
> to be possible when a process gets terminated after the 
> {{/\_\_processes\_\_}} route handler dispatches to it, thus dropping the 
> dispatch and abandoning the future.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9766) /processes endpoint can hang.

2019-05-03 Thread Benjamin Mahler (JIRA)

Benjamin Mahler created MESOS-9766:
--

 Summary: /__processes__ endpoint can hang.
 Key: MESOS-9766
 URL: https://issues.apache.org/jira/browse/MESOS-9766
 Project: Mesos
  Issue Type: Bug
  Components: libprocess
Reporter: Benjamin Mahler


A user reported that the {{/\_\_processes\_\_}} endpoint occasionally hangs.

Stack traces provided by [~alexr] revealed that all the threads appeared to be 
idle waiting for events. After investigating the code, the issue was found to 
be possible when a process gets terminated after the {{/\_\_processes\_\_}} 
route handler dispatches to it, thus dropping the dispatch and abandoning the 
future.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9761) Mesos UI does not properly account for resources set via `--default-role`

2019-05-02 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831700#comment-16831700
 ] 

Benjamin Mahler commented on MESOS-9761:


As [~vinodkone] mentioned, reservations will show up as "consumption" rather 
than "guarantee" or "limit". Linking in related ticket.

> Mesos UI does not properly account for resources set via `--default-role`
> -
>
> Key: MESOS-9761
> URL: https://issues.apache.org/jira/browse/MESOS-9761
> Project: Mesos
>  Issue Type: Bug
>Reporter: Benno Evers
>Priority: Major
>  Labels: resource-management, ui
> Attachments: default_role_ui.png
>
>
> In our cluster, we have two agents configured with  
> "--default_role=slave_public" and 64 cpus each, for a total of 128 cpus 
> allocated to this role. The right side of the screenshot shows one of them.
> However, looking at the "Roles" tab in the Mesos UI, neither "Guarantee" nor 
> "Limit" does show any resources for this role.
> See attached screenshot for details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9619) Mesos Master Crashes with Launch Group when using Port Resources

2019-05-01 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9619?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16831259#comment-16831259
 ] 

Benjamin Mahler commented on MESOS-9619:


Updated test: https://reviews.apache.org/r/70580/

> Mesos Master Crashes with Launch Group when using Port Resources
> 
>
> Key: MESOS-9619
> URL: https://issues.apache.org/jira/browse/MESOS-9619
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Affects Versions: 1.4.3, 1.7.1
> Environment:  
> Testing in both Mesos 1.4.3 and Mesos 1.7.1
>Reporter: Nimi Wariboko Jr.
>Assignee: Greg Mann
>Priority: Critical
>  Labels: foundations, master, mesosphere
> Fix For: 1.5.4, 1.6.3, 1.7.3, 1.8.0
>
> Attachments: mesos-master.log, mesos-master.snippet.log
>
>
> Original Issue: 
> [https://lists.apache.org/thread.html/979c8799d128ad0c436b53f2788568212f97ccf324933524f1b4d189@%3Cuser.mesos.apache.org%3E]
>  When the ports resources is removed, Mesos functions normally (I'm able to 
> launch the task as many times as possible, while it always fails continually).
> Attached is a snippet of the mesos master log from OFFER to crash.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9689) Migrate stout hashmap and hashset to Abseil's "swiss tables".

2019-04-29 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16829344#comment-16829344
 ] 

Benjamin Mahler commented on MESOS-9689:


See also: https://code.fb.com/developer-tools/f14/

> Migrate stout hashmap and hashset to Abseil's "swiss tables".
> -
>
> Key: MESOS-9689
> URL: https://issues.apache.org/jira/browse/MESOS-9689
> Project: Mesos
>  Issue Type: Improvement
>  Components: stout
>Reporter: Benjamin Mahler
>Priority: Major
>  Labels: performance
>
> For improved lookup and insertion performance, as well as lower memory 
> consumption, we should migrate stout's hashmap / hashset wrappers to use 
> Abseil's containers.
> There are some subtleties to migration, see: 
> https://abseil.io/docs/cpp/guides/container
> See also: https://youtu.be/ncHmEUmJZf4



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-8511) Provide a v0/v1 test scheduler to simplify the tests.

2019-04-24 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-8511?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-8511:
--

Assignee: Benjamin Mahler

> Provide a v0/v1 test scheduler to simplify the tests.
> -
>
> Key: MESOS-8511
> URL: https://issues.apache.org/jira/browse/MESOS-8511
> Project: Mesos
>  Issue Type: Improvement
>  Components: test
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: tech-debt
>
> Currently, there are a lot of tests that just want to launch a task in order 
> to test some behavior of the system. These tests have to create their own v0 
> or v1 scheduler and invoke the necessary calls on it and expect the necessary 
> calls / messages back. This is rather verbose.
> It would be helpful to have some better abstractions here, like a 
> TestScheduler that can launch tasks and exposes the status updates for them, 
> along with other interesting information. E.g.
> {code}
> class TestScheduler
> {
>   // Add the task to the queue of tasks that need to be launched.
>   // Returns the stream of status updates for this task.
>   Queue addTask(const TaskInfo& t);
>   etc
> }
> {code}
> Probably this could be implemented against both v0 and v1, if we want to 
> parameterize the tests.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-9701) Allocator's roles map should track reservations.

2019-04-24 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9701:
--

Assignee: Andrei Sekretenko

> Allocator's roles map should track reservations.
> 
>
> Key: MESOS-9701
> URL: https://issues.apache.org/jira/browse/MESOS-9701
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Andrei Sekretenko
>Priority: Major
>  Labels: resource-management
>
> Currently, the allocator's {{roles}} map only tracks roles that have 
> allocations or framework subscriptions:
> https://github.com/apache/mesos/blob/1.7.2/src/master/allocator/mesos/hierarchical.hpp#L531-L535
> And we separately track a map of total reservations for each role:
> https://github.com/apache/mesos/blob/1.7.2/src/master/allocator/mesos/hierarchical.hpp#L541-L547
> Confusingly, the {{roles}} map won't have an entry when there is a 
> reservation for a role but no allocations or frameworks subscribed. We should 
> ensure that the map has an entry when there are reservations. Also, we can 
> consolidate the reservation information and framework ids into the same map, 
> e.g.:
> {code}
> struct Role
> {
>   hashset frameworkIds;
>   ResourceQuantities totalReservations;
> };
> hashmap roles;
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9734) Allocator pause/resume functionality should compensate for a missed allocation cycle.

2019-04-19 Thread Benjamin Mahler (JIRA)

Benjamin Mahler created MESOS-9734:
--

 Summary: Allocator pause/resume functionality should compensate 
for a missed allocation cycle.
 Key: MESOS-9734
 URL: https://issues.apache.org/jira/browse/MESOS-9734
 Project: Mesos
  Issue Type: Bug
  Components: allocation
Reporter: Benjamin Mahler


This matters more when the allocation cycle interval is set to large values 
(e.g. 30 seconds, 1 minute, etc).

When the allocator is paused, the interval timeouts continue but an allocation 
cycle gets skipped. So, if the interval is long, when it's resumed, it can take 
up to an entire interval again to have another cycle. E.g. with 1 minute cycle

0mins
1mins: allocate
1.01mins: pause
2mins: allocate skipped
2.01mins: resume
3mins: allocate

In this case, one would expect that resuming at 2.01 mins should just 
immediately trigger an allocation cycle since we're "overdue" for one, and 
start the interval timeouts again fresh.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9710) Add tests to ensure random sorter performs correct weighted sorting.

2019-04-16 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16819434#comment-16819434
 ] 

Benjamin Mahler commented on MESOS-9710:


{noformat}
commit a03db7d684f343656aa229771f30c4990a2839c1
Author: Benjamin Mahler 
Date:   Tue Apr 9 17:08:02 2019 -0400

Added a test of hierarchical sorting for the random sorter.

Review: https://reviews.apache.org/r/70438
{noformat}

> Add tests to ensure random sorter performs correct weighted sorting.
> 
>
> Key: MESOS-9710
> URL: https://issues.apache.org/jira/browse/MESOS-9710
> Project: Mesos
>  Issue Type: Task
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>
> We added tests for the weighted shuffle algorithm, but didn't test that the 
> RandomSorter's sort() function behaves correctly.
> We should also test that hierarchical weights in the random sorter behave 
> correctly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-7258) Provide scheduler calls to subscribe to additional roles and unsubscribe from roles.

2019-04-10 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-7258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-7258:
--

Assignee: Andrei Sekretenko  (was: Benjamin Mahler)

> Provide scheduler calls to subscribe to additional roles and unsubscribe from 
> roles.
> 
>
> Key: MESOS-7258
> URL: https://issues.apache.org/jira/browse/MESOS-7258
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, scheduler api
>Reporter: Benjamin Mahler
>Assignee: Andrei Sekretenko
>Priority: Major
>  Labels: multitenancy, resource-management
>
> The current support for schedulers to subscribe to additional roles or 
> unsubscribe from some of their roles requires that the scheduler obtain a new 
> subscription with the master which invalidates the event stream.
> A more lightweight mechanism would be to provide calls for the scheduler to 
> subscribe to additional roles or unsubscribe from some roles such that the 
> existing event stream remains open and offers to the new roles arrive on the 
> existing event stream. E.g.
> SUBSCRIBE_TO_ROLE
>  UNSUBSCRIBE_FROM_ROLE
> One open question pertains to the terminology here, whether we would want to 
> avoid using "subscribe" in this context. An alternative would be:
> UPDATE_FRAMEWORK_INFO
> Which provides a generic mechanism for a framework to perform framework info 
> updates without obtaining a new event stream.
> In addition, it would be easier to use if it returned 200 on success and an 
> error response if invalid, etc. Rather than returning 202.
> *NOTE*: Not specific to this issue, but we need to figure out how to allow 
> the framework to not leak reservations, e.g. MESOS-7651.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-7258) Provide scheduler calls to subscribe to additional roles and unsubscribe from roles.

2019-04-10 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-7258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-7258:
--

Assignee: Benjamin Mahler

> Provide scheduler calls to subscribe to additional roles and unsubscribe from 
> roles.
> 
>
> Key: MESOS-7258
> URL: https://issues.apache.org/jira/browse/MESOS-7258
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, scheduler api
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: multitenancy, resource-management
>
> The current support for schedulers to subscribe to additional roles or 
> unsubscribe from some of their roles requires that the scheduler obtain a new 
> subscription with the master which invalidates the event stream.
> A more lightweight mechanism would be to provide calls for the scheduler to 
> subscribe to additional roles or unsubscribe from some roles such that the 
> existing event stream remains open and offers to the new roles arrive on the 
> existing event stream. E.g.
> SUBSCRIBE_TO_ROLE
>  UNSUBSCRIBE_FROM_ROLE
> One open question pertains to the terminology here, whether we would want to 
> avoid using "subscribe" in this context. An alternative would be:
> UPDATE_FRAMEWORK_INFO
> Which provides a generic mechanism for a framework to perform framework info 
> updates without obtaining a new event stream.
> In addition, it would be easier to use if it returned 200 on success and an 
> error response if invalid, etc. Rather than returning 202.
> *NOTE*: Not specific to this issue, but we need to figure out how to allow 
> the framework to not leak reservations, e.g. MESOS-7651.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-7258) Provide scheduler calls to subscribe to additional roles and unsubscribe from roles.

2019-04-10 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-7258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-7258:
--

Assignee: (was: Kapil Arya)

> Provide scheduler calls to subscribe to additional roles and unsubscribe from 
> roles.
> 
>
> Key: MESOS-7258
> URL: https://issues.apache.org/jira/browse/MESOS-7258
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, scheduler api
>Reporter: Benjamin Mahler
>Priority: Major
>  Labels: multitenancy, resource-management
>
> The current support for schedulers to subscribe to additional roles or 
> unsubscribe from some of their roles requires that the scheduler obtain a new 
> subscription with the master which invalidates the event stream.
> A more lightweight mechanism would be to provide calls for the scheduler to 
> subscribe to additional roles or unsubscribe from some roles such that the 
> existing event stream remains open and offers to the new roles arrive on the 
> existing event stream. E.g.
> SUBSCRIBE_TO_ROLE
>  UNSUBSCRIBE_FROM_ROLE
> One open question pertains to the terminology here, whether we would want to 
> avoid using "subscribe" in this context. An alternative would be:
> UPDATE_FRAMEWORK_INFO
> Which provides a generic mechanism for a framework to perform framework info 
> updates without obtaining a new event stream.
> In addition, it would be easier to use if it returned 200 on success and an 
> error response if invalid, etc. Rather than returning 202.
> *NOTE*: Not specific to this issue, but we need to figure out how to allow 
> the framework to not leak reservations, e.g. MESOS-7651.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-9710) Add tests to ensure random sorter performs correct weighted sorting.

2019-04-08 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9710:
--

Assignee: Benjamin Mahler  (was: Meng Zhu)

Assigning to myself for adding the hierarchical tests.

> Add tests to ensure random sorter performs correct weighted sorting.
> 
>
> Key: MESOS-9710
> URL: https://issues.apache.org/jira/browse/MESOS-9710
> Project: Mesos
>  Issue Type: Task
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Benjamin Mahler
>Priority: Major
>
> We added tests for the weighted shuffle algorithm, but didn't test that the 
> RandomSorter's sort() function behaves correctly.
> We should also test that hierarchical weights in the random sorter behave 
> correctly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9710) Add tests to ensure random sorter performs correct weighted sorting.

2019-04-08 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9710?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16812758#comment-16812758
 ] 

Benjamin Mahler commented on MESOS-9710:


Review for the first half; testing that flat role sorting behaves correctly:
https://reviews.apache.org/r/70418/

> Add tests to ensure random sorter performs correct weighted sorting.
> 
>
> Key: MESOS-9710
> URL: https://issues.apache.org/jira/browse/MESOS-9710
> Project: Mesos
>  Issue Type: Task
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Meng Zhu
>Priority: Major
>
> We added tests for the weighted shuffle algorithm, but didn't test that the 
> RandomSorter's sort() function behaves correctly.
> We should also test that hierarchical weights in the random sorter behave 
> correctly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9710) Add tests to ensure random sorter performs correct weighted sorting.

2019-04-08 Thread Benjamin Mahler (JIRA)

Benjamin Mahler created MESOS-9710:
--

 Summary: Add tests to ensure random sorter performs correct 
weighted sorting.
 Key: MESOS-9710
 URL: https://issues.apache.org/jira/browse/MESOS-9710
 Project: Mesos
  Issue Type: Task
  Components: allocation
Reporter: Benjamin Mahler
Assignee: Meng Zhu


We added tests for the weighted shuffle algorithm, but didn't test that the 
RandomSorter's sort() function behaves correctly.

We should also test that hierarchical weights in the random sorter behave 
correctly.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9701) Allocator's roles map should track reservations.

2019-04-04 Thread Benjamin Mahler (JIRA)

Benjamin Mahler created MESOS-9701:
--

 Summary: Allocator's roles map should track reservations.
 Key: MESOS-9701
 URL: https://issues.apache.org/jira/browse/MESOS-9701
 Project: Mesos
  Issue Type: Improvement
  Components: allocation
Reporter: Benjamin Mahler


Currently, the allocator's {{roles}} map only tracks roles that have 
allocations or framework subscriptions:

https://github.com/apache/mesos/blob/1.7.2/src/master/allocator/mesos/hierarchical.hpp#L531-L535

And we separately track a map of total reservations for each role:

https://github.com/apache/mesos/blob/1.7.2/src/master/allocator/mesos/hierarchical.hpp#L541-L547

Confusingly, the {{roles}} map won't have an entry when there is a reservation 
for a role but no allocations or frameworks subscribed. We should ensure that 
the map has an entry when there are reservations. Also, we can consolidate the 
reservation information and framework ids into the same map, e.g.:

{code}
struct Role
{
  hashset frameworkIds;
  ResourceQuantities totalReservations;
};

hashmap roles;
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9688) Quota is not enforced properly when subroles have reservations.

2019-04-04 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16810208#comment-16810208
 ] 

Benjamin Mahler commented on MESOS-9688:


Additional fix: https://reviews.apache.org/r/70393/

> Quota is not enforced properly when subroles have reservations.
> ---
>
> Key: MESOS-9688
> URL: https://issues.apache.org/jira/browse/MESOS-9688
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: resource-management
> Fix For: 1.8.0
>
>
> Note: the discussion here concerns quota enforcement for top-level role, 
> setting quota on sublevel role is not supported.
> If a subrole directly makes a reservation, the accounting of 
> `roleConsumedQuota` will be off:
> https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.cpp#L1703-L1705
> Specifically, in this formula:
> `Consumed Quota = reservations + allocation - allocated reservations`
> The `reservations` part does not account subrole's reservation to its 
> ancestors. If a reservation is made directly for role "a/b", its reservation 
> is accounted only for "a/b" but not for "a". Similarly, if a top role ( "a") 
> reservation is refined to a subrole ("a/b"), the current code first subtracts 
> the reservation from "a" and then track that under "a/b".
> We should make it hierarchical-aware.
> The "allocation" and "allocated reservations" are both tracked in the sorter 
> where the hierarchical relationship is considered -- allocations are added 
> hierarchically.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9691) Quota headroom calculation is off when subroles are involved.

2019-04-04 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9691?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809875#comment-16809875
 ] 

Benjamin Mahler commented on MESOS-9691:


Re-opening as there is an issue with the fix.

> Quota headroom calculation is off when subroles are involved.
> -
>
> Key: MESOS-9691
> URL: https://issues.apache.org/jira/browse/MESOS-9691
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: resource-management
> Fix For: 1.8.0
>
>
> Quota "availableHeadroom" calculation:
> https://github.com/apache/mesos/blob/6276f7e73b0dbe7df49a7315cd1b83340d66f4ea/src/master/allocator/mesos/hierarchical.cpp#L1751-L1754
> is off when subroles are involved.
> Specifically, in the formula 
> {noformat}
> available headroom = total resources - allocated resources - (total 
> reservations - allocated reservations) - unallocated revocable resources
> {noformat}
> -The "allocated resources" part is hierarchical-aware and aggregate that 
> across all roles, thus allocations to subroles will be counted multiple times 
> (in the case of "a/b", once for "a" and once for "a/b").- Looks like due to 
> the presence of `INTERNAL` node, 
> `roleSorter->allocationScalarQuantities(role)` is *not* hierarchical. Thus 
> this is not an issue.
> (If role `a/b` consumes 1cpu and `a` consumes 1cpu, if we query 
> `roleSorter->allocationScalarQuantities("a");` It will return 1cpu, which is 
> correct. In the sorter, there are four nodes, root, `a` (internal, 1cpu), 
> `a/.` (leaf, 1cpu), `a/b` (leaf, 1cpu). Query `a` will return `a/.`)
> The "total reservations"  is correct, since today it is "flat" (reservations 
> made to "a/b" are not counted to "a"). Thus all reservations are only counted 
> once -- which is the correct semantic here. However, once we fix MESOS-9688 
> (which likely requires reservation tracking to be hierarchical-aware), we 
> need to ensure that the accounting is still correct.
> -The "allocated reservations" is hierarchical-aware, thus overlap accounting 
> would occur.- Similar to the `"allocated resources"` above, this is also not 
> an issue at the moment.
> Basically, when calculating the available headroom, we need to ensure 
> "single-counting". Ideally, we only need to look at the root's consumptions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-9688) Quota is not enforced properly when subroles have reservations.

2019-04-04 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16809876#comment-16809876
 ] 

Benjamin Mahler commented on MESOS-9688:


Re-opening as there is an issue with the fix.

> Quota is not enforced properly when subroles have reservations.
> ---
>
> Key: MESOS-9688
> URL: https://issues.apache.org/jira/browse/MESOS-9688
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: resource-management
> Fix For: 1.8.0
>
>
> Note: the discussion here concerns quota enforcement for top-level role, 
> setting quota on sublevel role is not supported.
> If a subrole directly makes a reservation, the accounting of 
> `roleConsumedQuota` will be off:
> https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.cpp#L1703-L1705
> Specifically, in this formula:
> `Consumed Quota = reservations + allocation - allocated reservations`
> The `reservations` part does not account subrole's reservation to its 
> ancestors. If a reservation is made directly for role "a/b", its reservation 
> is accounted only for "a/b" but not for "a". Similarly, if a top role ( "a") 
> reservation is refined to a subrole ("a/b"), the current code first subtracts 
> the reservation from "a" and then track that under "a/b".
> We should make it hierarchical-aware.
> The "allocation" and "allocated reservations" are both tracked in the sorter 
> where the hierarchical relationship is considered -- allocations are added 
> hierarchically.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-9696) Test MasterQuotaTest.AvailableResourcesSingleDisconnectedAgent is flaky

2019-04-03 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9696:
--

Assignee: Benjamin Mahler

> Test MasterQuotaTest.AvailableResourcesSingleDisconnectedAgent is flaky
> ---
>
> Key: MESOS-9696
> URL: https://issues.apache.org/jira/browse/MESOS-9696
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 1.8.0
>Reporter: Benjamin Bannier
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: flaky, flaky-test, resource-management
> Attachments: test.log
>
>
> The test {{MasterQuotaTest.AvailableResourcesSingleDisconnectedAgent}} is 
> flaky, especially under additional system load.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-9691) Quota headroom calculation is off when subroles are involved.

2019-04-03 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9691?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9691:
--

Assignee: Benjamin Mahler

> Quota headroom calculation is off when subroles are involved.
> -
>
> Key: MESOS-9691
> URL: https://issues.apache.org/jira/browse/MESOS-9691
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: resource-management
>
> Quota "availableHeadroom" calculation:
> https://github.com/apache/mesos/blob/6276f7e73b0dbe7df49a7315cd1b83340d66f4ea/src/master/allocator/mesos/hierarchical.cpp#L1751-L1754
> is off when subroles are involved.
> Specifically, in the formula 
> {noformat}
> available headroom = total resources - allocated resources - (total 
> reservations - allocated reservations) - unallocated revocable resources
> {noformat}
> The "allocated resources" part is hierarchical-aware and aggregate that 
> across all roles, thus allocations to subroles will be counted multiple times 
> (in the case of "a/b", once for "a" and once for "a/b").
> The "total reservations"  is correct, since today it is "flat" (reservations 
> made to "a/b" are not counted to "a"). Thus all reservations are only counted 
> once -- which is the correct semantic here. However, once we fix MESOS-9688 
> (which likely requires reservation tracking to be hierarchical-aware), we 
> need to ensure that the accounting is still correct.
> The "allocated reservations" is hierarchical-aware, thus overlap accounting 
> would occur.
> Basically, when calculating the available headroom, we need to ensure 
> "single-counting". Ideally, we only need to look at the root's consumptions.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-9688) Quota is not enforced properly when subroles have reservations.

2019-04-03 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9688:
--

Assignee: Benjamin Mahler

> Quota is not enforced properly when subroles have reservations.
> ---
>
> Key: MESOS-9688
> URL: https://issues.apache.org/jira/browse/MESOS-9688
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Meng Zhu
>Assignee: Benjamin Mahler
>Priority: Critical
>  Labels: resource-management
>
> Note: the discussion here concerns quota enforcement for top-level role, 
> setting quota on sublevel role is not supported.
> If a subrole directly makes a reservation, the accounting of 
> `roleConsumedQuota` will be off:
> https://github.com/apache/mesos/blob/master/src/master/allocator/mesos/hierarchical.cpp#L1703-L1705
> Specifically, in this formula:
> `Consumed Quota = reservations + allocation - allocated reservations`
> The `reservations` part does not account subrole's reservation to its 
> ancestors. If a reservation is made directly for role "a/b", its reservation 
> is accounted only for "a/b" but not for "a". Similarly, if a top role ( "a") 
> reservation is refined to a subrole ("a/b"), the current code first subtracts 
> the reservation from "a" and then track that under "a/b".
> We should make it hierarchical-aware.
> The "allocation" and "allocated reservations" are both tracked in the sorter 
> where the hierarchical relationship is considered -- allocations are added 
> hierarchically.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9689) Migrate stout hashmap and hashset to Abseil's "swiss tables".

2019-03-28 Thread Benjamin Mahler (JIRA)

Benjamin Mahler created MESOS-9689:
--

 Summary: Migrate stout hashmap and hashset to Abseil's "swiss 
tables".
 Key: MESOS-9689
 URL: https://issues.apache.org/jira/browse/MESOS-9689
 Project: Mesos
  Issue Type: Improvement
  Components: stout
Reporter: Benjamin Mahler


For improved lookup and insertion performance, as well as lower memory 
consumption, we should migrate stout's hashmap / hashset wrappers to use 
Abseil's containers.

There are some subtleties to migration, see: 
https://abseil.io/docs/cpp/guides/container

See also: https://youtu.be/ncHmEUmJZf4



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9680) Remove automatic disablement of GLOG_drop_log_memory.

2019-03-27 Thread Benjamin Mahler (JIRA)

Benjamin Mahler created MESOS-9680:
--

 Summary: Remove automatic disablement of GLOG_drop_log_memory.
 Key: MESOS-9680
 URL: https://issues.apache.org/jira/browse/MESOS-9680
 Project: Mesos
  Issue Type: Improvement
Reporter: Benjamin Mahler


Once we upgrade to glog 0.4.0, we no longer need our special case disablement 
of GLOG_drop_log_memory (see MESOS-920):

https://github.com/apache/mesos/blob/1.7.2/src/logging/logging.cpp#L184-L194

This is because 0.4.0 includes https://github.com/google/glog/pull/145 which 
fixes the issue we filed:https://github.com/google/glog/issues/84.




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-8248) Expose information about GPU assigned to a task

2019-03-21 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16798495#comment-16798495
 ] 

Benjamin Mahler commented on MESOS-8248:


[~jomach] also, let's use MESOS-5255

> Expose information about GPU assigned to a task
> ---
>
> Key: MESOS-8248
> URL: https://issues.apache.org/jira/browse/MESOS-8248
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization, gpu
>Reporter: Karthik Anantha Padmanabhan
>Priority: Major
>  Labels: GPU
>
> As a framework author I'd like information about the gpu that was assigned to 
> a task.
> `nvidia-smi` for example provides the following information GPU UUID, boardId 
> minor number etc. It would useful to expose this information when a task is 
> assigned to a GPU instance.
> This will make it possible to monitor resource usage for a task on GPU which 
> is not possible when



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-8248) Expose information about GPU assigned to a task

2019-03-21 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-8248?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16798494#comment-16798494
 ] 

Benjamin Mahler commented on MESOS-8248:


[~jomach] send an email to the dev@ mailing list with your proposal, feel free 
also to use the #containerizer slack channel.

> Expose information about GPU assigned to a task
> ---
>
> Key: MESOS-8248
> URL: https://issues.apache.org/jira/browse/MESOS-8248
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization, gpu
>Reporter: Karthik Anantha Padmanabhan
>Priority: Major
>  Labels: GPU
>
> As a framework author I'd like information about the gpu that was assigned to 
> a task.
> `nvidia-smi` for example provides the following information GPU UUID, boardId 
> minor number etc. It would useful to expose this information when a task is 
> assigned to a GPU instance.
> This will make it possible to monitor resource usage for a task on GPU which 
> is not possible when



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-9292) Rejected quotas request error messages should specify which resources were overcommitted.

2019-03-19 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9292:
--

Assignee: Benjamin Mahler
  Sprint: Resource Mgmt RI12 Sp 42

> Rejected quotas request error messages should specify which resources were 
> overcommitted.
> -
>
> Key: MESOS-9292
> URL: https://issues.apache.org/jira/browse/MESOS-9292
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Benno Evers
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: multitenancy
>
> If we reject a quota request due to not having enough available resources, we 
> fail with the following error:
> {noformat}
> Not enough available cluster capacity to reasonably satisfy quota
> request; the force flag can be used to override this check
> {noformat}
> but we don't print *which* resource was not available. This can be confusing 
> to operators when the quota was attempted to be set for multiple resources at 
> once.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-7883) Quota heuristic check not accounting for mount volumes

2019-03-13 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-7883?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-7883:
--

Assignee: Benjamin Mahler
  Sprint: Resource Mgmt RI12 Sp 42

> Quota heuristic check not accounting for mount volumes
> --
>
> Key: MESOS-7883
> URL: https://issues.apache.org/jira/browse/MESOS-7883
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Vincent Roy
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: resource-management
>
> This may be expected but came as a surprise to us. We are unable to create a 
> quota bigger than the root disk space on slaves.
> Given two clusters with the same number of slaves and root disk size, but one 
> that also has mount volumes, is what the disk resources look like:
> {noformat}
> [root@fin-fang-foom-master-1 ~]# curl -s master.mesos:5050/state | jq 
> '.slaves[] .resources .disk'
> 28698
> 28699
> 28698
> 28698
> 28697
> {noformat}
> {noformat}
> [root@hydra-master-1 ~]# curl -s master.mesos:5050/state | jq '.slaves[] 
> .resources .disk'
> 50817
> 50817
> 50814
> 50819
> 50817
> {noformat}
> In {{fin-fang-foom}}, I was able to create a quota for {{143490mb}} which is 
> the total of available disk resources, root in this case, as reported by 
> Mesos. For {{hydra}}, I am only able to create a quota for {{143489mb}}. This 
> is equivalent to the total of root disks available in {{hydra}} rather than 
> the total available disks reported by Mesos resources which is {{254084mb}}.
> With a modified Mesos that adds logging to {{quota_handler}}, we can see that 
> only the {{disk(*)}} number increases in {{nonStaticClusterResources}} after 
> every iteration. The final iteration is {{disk(*):143489}} which is the 
> maximum quota I was able to create on {{hydra}}. We expected that quota 
> heuristic check would also include resources such as 
> {{disk(*)[MOUNT:/dcos/volume2]:7373}}
> {noformat}
> Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.763764 
> 24902 quota_handler.cpp:71] Performing capacity heuristic check for a set 
> quota request
> Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.763783 
> 24902 quota_handler.cpp:87] heuristic: total quota 'disk(*):143489'
> Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.763870 
> 24902 quota_handler.cpp:111] heuristic: nonStaticAgentResources = 
> 'ports(*):[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, 
> 8182-32000]; disk(*)[MOUNT:/dcos/volume0]:7373; 
> disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373; 
> disk(*):28698; cpus(*):4; mem(*):15023'
> Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.763923 
> 24902 quota_handler.cpp:113] heuristic: nonStaticClusterResources = 
> 'ports(*):[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, 
> 8182-32000]; disk(*)[MOUNT:/dcos/volume0]:7373; 
> disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373; 
> disk(*):28698; cpus(*):4; mem(*):15023'
> Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.763989 
> 24902 quota_handler.cpp:111] heuristic: nonStaticAgentResources = 
> 'ports(*):[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, 
> 8182-32000]; disk(*)[MOUNT:/dcos/volume0]:7373; 
> disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373; 
> disk(*):28698; cpus(*):4; mem(*):15023'
> Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.764022 
> 24902 quota_handler.cpp:113] heuristic: nonStaticClusterResources = 
> 'ports(*):[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, 
> 8182-32000]; disk(*)[MOUNT:/dcos/volume0]:7373; 
> disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373; 
> disk(*):57396; cpus(*):8; mem(*):30046; disk(*)[MOUNT:/dcos/volume0]:7373; 
> disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373'
> Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.764077 
> 24902 quota_handler.cpp:111] heuristic: nonStaticAgentResources = 
> 'ports(*):[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, 
> 8182-32000]; disk(*)[MOUNT:/dcos/volume0]:7373; 
> disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373; 
> disk(*):28695; cpus(*):4; mem(*):15023'
> Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.764119 
> 24902 quota_handler.cpp:113] heuristic: nonStaticClusterResources = 
> 'ports(*):[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, 
> 8182-32000]; disk(*)[MOUNT:/dcos/volume0]:7373; 
> disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373; 
> disk(*):86091; cpus(*):12; mem(*):45069; disk(*)[MOUNT:/dcos/volume0]:7373; 
> disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373; 
>

[jira] [Commented] (MESOS-9634) Soft CPU limit for windows JobObject

2019-03-13 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-9634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16792077#comment-16792077
 ] 

Benjamin Mahler commented on MESOS-9634:


Linked in a related ticket, I thought we had a ticket for "burstable 
containers" but I can't seem to find one.

> Soft CPU limit for windows JobObject
> 
>
> Key: MESOS-9634
> URL: https://issues.apache.org/jira/browse/MESOS-9634
> Project: Mesos
>  Issue Type: Wish
>  Components: allocation, containerization
>Reporter: Andrei Stryia
>Priority: Major
>
> We are using Mesos to run Windows payload. As I see, CPU utilization on the 
> slave nodes is not very good. Because of the hard cap limit, process cannot 
> use more CPU resources even if there are a lot of free CPU resources at the 
> moment (e.g. only one task is started on the node at the moment).
>  I know, the reason of such behavior is 
> {{JOB_OBJECT_CPU_RATE_CONTROL_HARD_CAP}} control flag of the Job Object.
> But what about ability to use {{JOB_OBJECT_CPU_RATE_CONTROL_MIN_MAX_RATE}} 
> control flag, where MinRate will be limit specified in Task config while 
> MaxRate will be 100%CPU. This option will work the same way as cgroups/cpu 
> and add more elasticity.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-9640) Add authorization support for `UPDATE_QUOTA` call.

2019-03-13 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9640:
--

Assignee: Till Toenshoff

> Add authorization support for `UPDATE_QUOTA` call.
> --
>
> Key: MESOS-9640
> URL: https://issues.apache.org/jira/browse/MESOS-9640
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Meng Zhu
>Assignee: Till Toenshoff
>Priority: Major
>  Labels: mesosphere, resource-management
>
> For the new `UPDATE_QUOTA` call, we need to add the corresponding 
> authorization support. Unfortunately, there is already an action named 
> `update_quotas`. We can use `update_quota_configs` instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-9640) Add authorization support for `UPDATE_QUOTA` call.

2019-03-13 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-9640?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-9640:
--

Assignee: (was: Meng Zhu)

> Add authorization support for `UPDATE_QUOTA` call.
> --
>
> Key: MESOS-9640
> URL: https://issues.apache.org/jira/browse/MESOS-9640
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Meng Zhu
>Priority: Major
>  Labels: mesosphere, resource-management
>
> For the new `UPDATE_QUOTA` call, we need to add the corresponding 
> authorization support. Unfortunately, there is already an action named 
> `update_quotas`. We can use `update_quota_configs` instead.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (MESOS-9618) Display quota consumption in the webui.

2019-02-28 Thread Benjamin Mahler (JIRA)

Benjamin Mahler created MESOS-9618:
--

 Summary: Display quota consumption in the webui.
 Key: MESOS-9618
 URL: https://issues.apache.org/jira/browse/MESOS-9618
 Project: Mesos
  Issue Type: Improvement
  Components: webui
Reporter: Benjamin Mahler


Currently, the Roles table in the webui displays allocation and quota 
guarantees / limits. However, quota "consumption" is different from allocation, 
in that reserved resources are always considered consumed against the quota.

This discrepancy has led to confusion from users. One exampled occurred when an 
agent was added with a large reservation exceeding the memory quota guarantee. 
The user sees memory chopping in offers, and since the scheduler didn't want to 
use the reservation, it can't launch its tasks.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Assigned] (MESOS-6840) Tests for quota capacity heuristic.

2019-02-27 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-6840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-6840:
--

Assignee: (was: Zhitao Li)

> Tests for quota capacity heuristic.
> ---
>
> Key: MESOS-6840
> URL: https://issues.apache.org/jira/browse/MESOS-6840
> Project: Mesos
>  Issue Type: Task
>  Components: allocation, test
>Reporter: Alexander Rukletsov
>Priority: Major
>  Labels: mesosphere, quota, resource-management
>
> We need more tests to ensure capacity heuristic works as expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-7883) Quota heuristic check not accounting for mount volumes

2019-02-27 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-7883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16779689#comment-16779689
 ] 

Benjamin Mahler commented on MESOS-7883:


Linking in quota "capacity heuristic" testing work.

> Quota heuristic check not accounting for mount volumes
> --
>
> Key: MESOS-7883
> URL: https://issues.apache.org/jira/browse/MESOS-7883
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Vincent Roy
>Priority: Major
>  Labels: resource-management
>
> This may be expected but came as a surprise to us. We are unable to create a 
> quota bigger than the root disk space on slaves.
> Given two clusters with the same number of slaves and root disk size, but one 
> that also has mount volumes, is what the disk resources look like:
> {noformat}
> [root@fin-fang-foom-master-1 ~]# curl -s master.mesos:5050/state | jq 
> '.slaves[] .resources .disk'
> 28698
> 28699
> 28698
> 28698
> 28697
> {noformat}
> {noformat}
> [root@hydra-master-1 ~]# curl -s master.mesos:5050/state | jq '.slaves[] 
> .resources .disk'
> 50817
> 50817
> 50814
> 50819
> 50817
> {noformat}
> In {{fin-fang-foom}}, I was able to create a quota for {{143490mb}} which is 
> the total of available disk resources, root in this case, as reported by 
> Mesos. For {{hydra}}, I am only able to create a quota for {{143489mb}}. This 
> is equivalent to the total of root disks available in {{hydra}} rather than 
> the total available disks reported by Mesos resources which is {{254084mb}}.
> With a modified Mesos that adds logging to {{quota_handler}}, we can see that 
> only the {{disk(*)}} number increases in {{nonStaticClusterResources}} after 
> every iteration. The final iteration is {{disk(*):143489}} which is the 
> maximum quota I was able to create on {{hydra}}. We expected that quota 
> heuristic check would also include resources such as 
> {{disk(*)[MOUNT:/dcos/volume2]:7373}}
> {noformat}
> Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.763764 
> 24902 quota_handler.cpp:71] Performing capacity heuristic check for a set 
> quota request
> Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.763783 
> 24902 quota_handler.cpp:87] heuristic: total quota 'disk(*):143489'
> Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.763870 
> 24902 quota_handler.cpp:111] heuristic: nonStaticAgentResources = 
> 'ports(*):[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, 
> 8182-32000]; disk(*)[MOUNT:/dcos/volume0]:7373; 
> disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373; 
> disk(*):28698; cpus(*):4; mem(*):15023'
> Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.763923 
> 24902 quota_handler.cpp:113] heuristic: nonStaticClusterResources = 
> 'ports(*):[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, 
> 8182-32000]; disk(*)[MOUNT:/dcos/volume0]:7373; 
> disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373; 
> disk(*):28698; cpus(*):4; mem(*):15023'
> Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.763989 
> 24902 quota_handler.cpp:111] heuristic: nonStaticAgentResources = 
> 'ports(*):[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, 
> 8182-32000]; disk(*)[MOUNT:/dcos/volume0]:7373; 
> disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373; 
> disk(*):28698; cpus(*):4; mem(*):15023'
> Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.764022 
> 24902 quota_handler.cpp:113] heuristic: nonStaticClusterResources = 
> 'ports(*):[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, 
> 8182-32000]; disk(*)[MOUNT:/dcos/volume0]:7373; 
> disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373; 
> disk(*):57396; cpus(*):8; mem(*):30046; disk(*)[MOUNT:/dcos/volume0]:7373; 
> disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373'
> Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.764077 
> 24902 quota_handler.cpp:111] heuristic: nonStaticAgentResources = 
> 'ports(*):[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, 
> 8182-32000]; disk(*)[MOUNT:/dcos/volume0]:7373; 
> disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373; 
> disk(*):28695; cpus(*):4; mem(*):15023'
> Aug 11 12:54:18 hydra-master-1 mesos-master[24896]: I0811 12:54:18.764119 
> 24902 quota_handler.cpp:113] heuristic: nonStaticClusterResources = 
> 'ports(*):[1025-2180, 2182-3887, 3889-5049, 5052-8079, 8082-8180, 
> 8182-32000]; disk(*)[MOUNT:/dcos/volume0]:7373; 
> disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373; 
> disk(*):86091; cpus(*):12; mem(*):45069; disk(*)[MOUNT:/dcos/volume0]:7373; 
> disk(*)[MOUNT:/dcos/volume1]:7373; disk(*)[MOUNT:/dcos/volume2]:7373; 
>

[jira] [Assigned] (MESOS-6840) Tests for quota capacity heuristic.

2019-02-27 Thread Benjamin Mahler (JIRA)



 [ 
https://issues.apache.org/jira/browse/MESOS-6840?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-6840:
--

Shepherd:   (was: Alexander Rukletsov)
Assignee: Benjamin Mahler
  Sprint: Resource Mgmt RI11 Sp 41

> Tests for quota capacity heuristic.
> ---
>
> Key: MESOS-6840
> URL: https://issues.apache.org/jira/browse/MESOS-6840
> Project: Mesos
>  Issue Type: Task
>  Components: allocation, test
>Reporter: Alexander Rukletsov
>Assignee: Benjamin Mahler
>Priority: Major
>  Labels: mesosphere, quota, resource-management
>
> We need more tests to ensure capacity heuristic works as expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (MESOS-6840) Tests for quota capacity heuristic.

2019-02-27 Thread Benjamin Mahler (JIRA)



[ 
https://issues.apache.org/jira/browse/MESOS-6840?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16779681#comment-16779681
 ] 

Benjamin Mahler commented on MESOS-6840:


As part of testing the capacity heuristic, we'd like to refactor the code to 
make it unit-testable.

> Tests for quota capacity heuristic.
> ---
>
> Key: MESOS-6840
> URL: https://issues.apache.org/jira/browse/MESOS-6840
> Project: Mesos
>  Issue Type: Task
>  Components: allocation, test
>Reporter: Alexander Rukletsov
>Priority: Major
>  Labels: mesosphere, quota, resource-management
>
> We need more tests to ensure capacity heuristic works as expected.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

< 1 2 3 4 5 6 7 8 9 10 >

101 - 200 of 1890 matches

Mail list logo