[jira] [Commented] (MESOS-6162) Add support for cgroups blkio subsystem blkio statistics.

2017-07-31 Thread Jason Lai (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16108168#comment-16108168
 ] 

Jason Lai commented on MESOS-6162:
--

Thanks a lot for shepherding my changes while I was away for vacation, 
[~gilbert]! And no worries! Glad to get my code committed to upstream, 
regardless of the committer name. We still have a lot of tasks to collaborate 
upon onwards :)

> Add support for cgroups blkio subsystem blkio statistics.
> -
>
> Key: MESOS-6162
> URL: https://issues.apache.org/jira/browse/MESOS-6162
> Project: Mesos
>  Issue Type: Task
>  Components: cgroups, containerization
>Reporter: haosdent
>Assignee: Jason Lai
>  Labels: cgroups, containerizer, mesosphere
> Fix For: 1.4.0
>
>
> Noted that cgroups blkio subsystem may have performance issue, refer to 
> https://github.com/opencontainers/runc/issues/861



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6162) Add support for cgroups blkio subsystem blkio statistics.

2017-07-31 Thread Gilbert Song (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16108152#comment-16108152
 ] 

Gilbert Song commented on MESOS-6162:
-

[~jasonlai], sorry I forgot to update the commits to be under your name. I made 
some changes and it was totally a mistake. Apologize for that.

/cc [~ctrlhxj] [~zhitao]

> Add support for cgroups blkio subsystem blkio statistics.
> -
>
> Key: MESOS-6162
> URL: https://issues.apache.org/jira/browse/MESOS-6162
> Project: Mesos
>  Issue Type: Task
>  Components: cgroups, containerization
>Reporter: haosdent
>Assignee: Jason Lai
>  Labels: cgroups, containerizer, mesosphere
> Fix For: 1.4.0
>
>
> Noted that cgroups blkio subsystem may have performance issue, refer to 
> https://github.com/opencontainers/runc/issues/861



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-6162) Add support for cgroups blkio subsystem blkio statistics.

2017-07-31 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-6162:

   Sprint: Mesosphere Sprint 60
 Story Points: 8
   Labels: cgroups containerizer mesosphere  (was: )
Fix Version/s: 1.4.0
  Component/s: containerization
   cgroups

> Add support for cgroups blkio subsystem blkio statistics.
> -
>
> Key: MESOS-6162
> URL: https://issues.apache.org/jira/browse/MESOS-6162
> Project: Mesos
>  Issue Type: Task
>  Components: cgroups, containerization
>Reporter: haosdent
>Assignee: Jason Lai
>  Labels: cgroups, containerizer, mesosphere
> Fix For: 1.4.0
>
>
> Noted that cgroups blkio subsystem may have performance issue, refer to 
> https://github.com/opencontainers/runc/issues/861



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7843) Support blkio control in cgroups blkio subsystem.

2017-07-31 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-7843:
---

 Summary: Support blkio control in cgroups blkio subsystem.
 Key: MESOS-7843
 URL: https://issues.apache.org/jira/browse/MESOS-7843
 Project: Mesos
  Issue Type: Task
  Components: cgroups, containerization
Reporter: Gilbert Song
Priority: Minor


Now we support blkio statistics in cgroup blkio subsystem, which read the blkio 
stats files in cgroups and exposed them in resource statistics. We may want to 
support blkio control functionalities in BlkioSubsystem::prepare().



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-6402) Add rlimit support to Mesos containerizer

2017-07-31 Thread Adam B (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6402?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-6402:
--
Epic Status: Done  (was: To Do)

> Add rlimit support to Mesos containerizer
> -
>
> Key: MESOS-6402
> URL: https://issues.apache.org/jira/browse/MESOS-6402
> Project: Mesos
>  Issue Type: Epic
>  Components: containerization
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>  Labels: mesosphere
> Fix For: 1.2.0
>
>
> We should allow containers to expose their rlimit requirements so that their 
> environment can be set up via Mesos abstractions.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-6162) Add support for cgroups blkio subsystem blkio statistics.

2017-07-31 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-6162:

Summary: Add support for cgroups blkio subsystem blkio statistics.  (was: 
Add support for cgroups blkio subsystem)

> Add support for cgroups blkio subsystem blkio statistics.
> -
>
> Key: MESOS-6162
> URL: https://issues.apache.org/jira/browse/MESOS-6162
> Project: Mesos
>  Issue Type: Task
>Reporter: haosdent
>Assignee: Jason Lai
>
> Noted that cgroups blkio subsystem may have performance issue, refer to 
> https://github.com/opencontainers/runc/issues/861



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-6390) Ensure Python support scripts are linted

2017-07-31 Thread Armand Grillet (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-6390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16108025#comment-16108025
 ] 

Armand Grillet commented on MESOS-6390:
---

We are nearly there:
https://reviews.apache.org/r/60235/
https://reviews.apache.org/r/60900/

> Ensure Python support scripts are linted
> 
>
> Key: MESOS-6390
> URL: https://issues.apache.org/jira/browse/MESOS-6390
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Benjamin Bannier
>Assignee: Armand Grillet
>  Labels: newbie, python
>
> Currently {{support/mesos-style.py}} does not lint files under {{support/}}. 
> This is mostly due to the fact that these scripts are too inconsistent 
> style-wise that they wouldn't even pass the linter now.
> We should clean up all Python scripts under {{support/}} so they pass the 
> Python linter, and activate that directory in the linter for future 
> additions. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-4941) Support update existing quota.

2017-07-31 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4941?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107934#comment-16107934
 ] 

Benjamin Mahler commented on MESOS-4941:


[~alexr] [~zhitao] are you guys still working on this? Should I be looking at 
these reviews or would it be better to take over the changes?

> Support update existing quota.
> --
>
> Key: MESOS-4941
> URL: https://issues.apache.org/jira/browse/MESOS-4941
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Zhitao Li
>Assignee: Zhitao Li
>  Labels: Quota, mesosphere, multitenancy, tech-debt
>
> We want to support updating an existing quota without the cycle of delete and 
> recreate. This avoids the possible starvation risk of losing the quota 
> between delete and recreate, and also makes the interface friendly.
> Design doc:
> https://docs.google.com/document/d/1c8fJY9_N0W04FtUQ_b_kZM6S0eePU7eYVyfUP14dSys



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7258) Provide scheduler calls to subscribe to additional roles and unsubscribe from roles.

2017-07-31 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7258:
---
Labels: multitenancy  (was: )

> Provide scheduler calls to subscribe to additional roles and unsubscribe from 
> roles.
> 
>
> Key: MESOS-7258
> URL: https://issues.apache.org/jira/browse/MESOS-7258
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, scheduler api
>Reporter: Benjamin Mahler
>  Labels: multitenancy
>
> The current support for schedulers to subscribe to additional roles or 
> unsubscribe from some of their roles requires that the scheduler obtain a new 
> subscription with the master which invalidates the event stream.
> A more lightweight mechanism would be to provide calls for the scheduler to 
> subscribe to additional roles or unsubscribe from some roles such that the 
> existing event stream remains open and offers to the new roles arrive on the 
> existing event stream. E.g.
> SUBSCRIBE_TO_ROLE
> UNSUBSCRIBE_FROM_ROLE
> One open question pertains to the terminology here, whether we would want to 
> avoid using "subscribe" in this context. An alternative would be:
> UPDATE_FRAMEWORK_INFO
> Which provides a generic mechanism for a framework to perform framework info 
> updates without obtaining a new event stream.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7258) Provide scheduler calls to subscribe to additional roles and unsubscribe from roles.

2017-07-31 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7258:
---
Component/s: (was: multitenancy)

> Provide scheduler calls to subscribe to additional roles and unsubscribe from 
> roles.
> 
>
> Key: MESOS-7258
> URL: https://issues.apache.org/jira/browse/MESOS-7258
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, scheduler api
>Reporter: Benjamin Mahler
>  Labels: multitenancy
>
> The current support for schedulers to subscribe to additional roles or 
> unsubscribe from some of their roles requires that the scheduler obtain a new 
> subscription with the master which invalidates the event stream.
> A more lightweight mechanism would be to provide calls for the scheduler to 
> subscribe to additional roles or unsubscribe from some roles such that the 
> existing event stream remains open and offers to the new roles arrive on the 
> existing event stream. E.g.
> SUBSCRIBE_TO_ROLE
> UNSUBSCRIBE_FROM_ROLE
> One open question pertains to the terminology here, whether we would want to 
> avoid using "subscribe" in this context. An alternative would be:
> UPDATE_FRAMEWORK_INFO
> Which provides a generic mechanism for a framework to perform framework info 
> updates without obtaining a new event stream.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7258) Provide scheduler calls to subscribe to additional roles and unsubscribe from roles.

2017-07-31 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7258:
---
Component/s: multitenancy

> Provide scheduler calls to subscribe to additional roles and unsubscribe from 
> roles.
> 
>
> Key: MESOS-7258
> URL: https://issues.apache.org/jira/browse/MESOS-7258
> Project: Mesos
>  Issue Type: Improvement
>  Components: master, scheduler api
>Reporter: Benjamin Mahler
>  Labels: multitenancy
>
> The current support for schedulers to subscribe to additional roles or 
> unsubscribe from some of their roles requires that the scheduler obtain a new 
> subscription with the master which invalidates the event stream.
> A more lightweight mechanism would be to provide calls for the scheduler to 
> subscribe to additional roles or unsubscribe from some roles such that the 
> existing event stream remains open and offers to the new roles arrive on the 
> existing event stream. E.g.
> SUBSCRIBE_TO_ROLE
> UNSUBSCRIBE_FROM_ROLE
> One open question pertains to the terminology here, whether we would want to 
> avoid using "subscribe" in this context. An alternative would be:
> UPDATE_FRAMEWORK_INFO
> Which provides a generic mechanism for a framework to perform framework info 
> updates without obtaining a new event stream.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-3338) Dynamic reservations are not counted as used resources in the master

2017-07-31 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-3338:
---
Labels: mesosphere multitenancy persistent-volumes  (was: mesosphere 
persistent-volumes)

> Dynamic reservations are not counted as used resources in the master
> 
>
> Key: MESOS-3338
> URL: https://issues.apache.org/jira/browse/MESOS-3338
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation, master
>Reporter: Alexander Rukletsov
>Assignee: Guangya Liu
>Priority: Minor
>  Labels: mesosphere, multitenancy, persistent-volumes
>
> Dynamically reserved resources should be considered used or allocated and 
> hence reflected in Mesos bookkeeping structures and {{state.json}}.
> I expanded the {{ReservationTest.ReserveThenUnreserve}} test with the 
> following section:
> {code}
>   // Check that the Master counts the reservation as a used resource.
>   {
> Future response =
>   process::http::get(master.get(), "state.json");
> AWAIT_READY(response);
> Try parse = JSON::parse(response.get().body);
> ASSERT_SOME(parse);
> Result cpus =
>   parse.get().find("slaves[0].used_resources.cpus");
> ASSERT_SOME_EQ(JSON::Number(1), cpus);
>   }
> {code}
> and got
> {noformat}
> ../../../src/tests/reservation_tests.cpp:168: Failure
> Value of: (cpus).get()
>   Actual: 0
> Expected: JSON::Number(1)
> Which is: 1
> {noformat}
> Idea for new resources states: 
> https://docs.google.com/drawings/d/1aquVIqPY8D_MR-cQjZu-wz5nNn3cYP3jXqegUHl-Kzc/edit



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7099) Quota can be exceeded due to coarse-grained offer technique.

2017-07-31 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-7099:
---
Labels: multitenancy  (was: )

> Quota can be exceeded due to coarse-grained offer technique.
> 
>
> Key: MESOS-7099
> URL: https://issues.apache.org/jira/browse/MESOS-7099
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: Benjamin Mahler
>Priority: Critical
>  Labels: multitenancy
>
> The current implementation of quota allocation allocates the entire available 
> resources on an agent when trying to satisfy the quota. What this means is 
> that quota can be exceeded by the size of an agent.
> This is especially problematic for large machines, consider a 48 core, 512 GB 
> memory server where a role is given 4 cores and 4GB of memory. Given our 
> current approach, we will send an offer for the entire 48 cores and 512 GB of 
> memory!
> This ticket is to perform fine grained offers when the allocation will exceed 
> the quota.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7828) Current approach to parse protobuf enum from JSON does not support upgrades

2017-07-31 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7828?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107853#comment-16107853
 ] 

Benjamin Mahler commented on MESOS-7828:


[~qianzhang] do you know how the protobuf supplied json libraries address this 
issue? Seems like they don't? Or does the 'ignore unknown fields' parameter 
they provide also apply to unknown enum values? I would be curious to know what 
they did in proto3, since they wrote this in the release notes:

{quote}
iv. Fix semantics for unknown enum values.
{quote}

https://github.com/google/protobuf/releases/tag/v3.0.0-alpha-1


> Current approach to parse protobuf enum from JSON does not support upgrades
> ---
>
> Key: MESOS-7828
> URL: https://issues.apache.org/jira/browse/MESOS-7828
> Project: Mesos
>  Issue Type: Bug
>Reporter: Qian Zhang
>Assignee: Qian Zhang
>
> To use protobuf enum in a backwards compatible way, [the suggestion on the 
> protobuf mailing 
> list|https://groups.google.com/forum/#!msg/protobuf/NhUjBfDyGmY/pf294zMi2bIJ] 
> is to use optional enum fields and include an UNKNOWN value as the first 
> entry in the enum list (and/or explicitly specifying it as the default). This 
> can handle the case of parsing protobuf message from a serialized string, but 
> it can not handle the case of parsing protobuf message from JSON.
> E.g., when I access master endpoint with an inexistent enum {{xxx}}, I will 
> get an error:
> {code}
> $ curl -X POST -H "Content-Type: application/json" -d '{"type": "xxx"}' 
> 127.0.0.1:5050/api/v1
> Failed to convert JSON into Call protobuf: Failed to find enum for 'xxx'% 
> {code}
> In the {{Call}} protobuf message, the enum {{Type}} already has a default 
> value {{UNKNOWN}} (see 
> [here|https://github.com/apache/mesos/blob/1.3.0/include/mesos/v1/master/master.proto#L45]
>  for details) and the field {{Call.type}} is optional, but the above curl 
> command will still fail. The root cause is, in the code 
> [here|https://github.com/apache/mesos/blob/1.3.0/3rdparty/stout/include/stout/protobuf.hpp#L449:L454]
>  when we try to get the enum value for the string "xxx", it will fail since 
> there is no any enum value corresponding to "xxx".



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7714) Fix agent downgrade for reservation refinement

2017-07-31 Thread Yan Xu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107680#comment-16107680
 ] 

Yan Xu commented on MESOS-7714:
---

Yes we are. Thanks! Hope we can prioritize this one (possibly over other 1.4 
blockers) so we can promote dev versions of 1.4 further for more thorough 
testing.

> Fix agent downgrade for reservation refinement
> --
>
> Key: MESOS-7714
> URL: https://issues.apache.org/jira/browse/MESOS-7714
> Project: Mesos
>  Issue Type: Bug
>Reporter: Michael Park
>Priority: Blocker
>
> The agent code only partially supports downgrading of an agent correctly.
> The checkpointed resources are done correctly, but the resources within
> the {{SlaveInfo}} message as well as tasks and executors also need to be 
> downgraded
> correctly and converted back on recovery.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7695) Add heartbeats to master stream API

2017-07-31 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-7695:
--
Shepherd: Anand Mazumdar  (was: Vinod Kone)
  Sprint: Mesosphere Sprint 60

> Add heartbeats to master stream API
> ---
>
> Key: MESOS-7695
> URL: https://issues.apache.org/jira/browse/MESOS-7695
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Vinod Kone
>Assignee: Quinn
>  Labels: newbie++
>
> Just like master uses heartbeats for scheduler API to keep the connection 
> alive, it should do the same for the streaming API.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7695) Add heartbeats to master stream API

2017-07-31 Thread Quinn (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107654#comment-16107654
 ] 

Quinn commented on MESOS-7695:
--

https://reviews.apache.org/r/61262/

> Add heartbeats to master stream API
> ---
>
> Key: MESOS-7695
> URL: https://issues.apache.org/jira/browse/MESOS-7695
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Vinod Kone
>Assignee: Quinn
>  Labels: newbie++
>
> Just like master uses heartbeats for scheduler API to keep the connection 
> alive, it should do the same for the streaming API.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-7695) Add heartbeats to master stream API

2017-07-31 Thread Quinn (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7695?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Quinn reassigned MESOS-7695:


Assignee: Quinn

> Add heartbeats to master stream API
> ---
>
> Key: MESOS-7695
> URL: https://issues.apache.org/jira/browse/MESOS-7695
> Project: Mesos
>  Issue Type: Improvement
>  Components: HTTP API
>Reporter: Vinod Kone
>Assignee: Quinn
>  Labels: newbie++
>
> Just like master uses heartbeats for scheduler API to keep the connection 
> alive, it should do the same for the streaming API.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-6489) Better support for containers that want to manage their own cgroup.

2017-07-31 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-6489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu reassigned MESOS-6489:
-

Assignee: Yan Xu  (was: Anindya Sinha)

> Better support for containers that want to manage their own cgroup.
> ---
>
> Key: MESOS-6489
> URL: https://issues.apache.org/jira/browse/MESOS-6489
> Project: Mesos
>  Issue Type: Improvement
>  Components: cgroups
>Reporter: Jie Yu
>Assignee: Yan Xu
>  Labels: cgroups
>
> Some containers want to manage their cgroup by sub-dividing the cgroup that 
> Mesos allocates to them into multiple sub-cgroups and put subprocess into the 
> corresponding sub-cgroups.
> For instance, someone wants to run Docker daemon in a Mesos container. Docker 
> daemon will manage the cgroup assigned to it by Mesos (with the help , for 
> example, cgroups namespace).
> Problems arise during the teardown of the container because two entities 
> might be manipulating the same cgroup simultaneously. For example, the Mesos 
> cgroups::destroy might fail if the task running inside is trying to delete 
> the same nested cgroup at the same time.
> To support that case, we should consider kill all the processes in the Mesos 
> cgroup first, making sure that no one will be creating sub-cgroups and moving 
> new processes into sub-cgroups. And then, destroy the cgroups recursively.
> And we need freezer because we want to make sure all processes are stopped 
> while we are sending kill signals to avoid TOCTTOU race problem. I think it 
> makes more sense to freezer the cgroups (and sub-cgroups) from top down 
> (rather than bottom up because typically, processes in the parent cgroup 
> manipulate sub-cgroups).



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7842) Basic sandbox GC metrics.

2017-07-31 Thread James Peach (JIRA)
James Peach created MESOS-7842:
--

 Summary: Basic sandbox GC metrics.
 Key: MESOS-7842
 URL: https://issues.apache.org/jira/browse/MESOS-7842
 Project: Mesos
  Issue Type: Improvement
  Components: agent, statistics
Reporter: James Peach
Assignee: James Peach


Add some basic metrics around sandbox garbage collection. At minimum it would 
be helpful to know when GC is happening and when it is failing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Created] (MESOS-7841) Use values::rangesToIntervalSet() in the port_mapping isolator.

2017-07-31 Thread James Peach (JIRA)
James Peach created MESOS-7841:
--

 Summary: Use values::rangesToIntervalSet() in the port_mapping 
isolator.
 Key: MESOS-7841
 URL: https://issues.apache.org/jira/browse/MESOS-7841
 Project: Mesos
  Issue Type: Bug
  Components: agent
Reporter: James Peach
Priority: Minor


Since we now have {{values::rangesToIntervalSet()}} in common code, we can use 
this in the {{port_mapping}} isolator instead of the local function which does 
the same thing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-7841) Use values::rangesToIntervalSet() in the port_mapping isolator.

2017-07-31 Thread James Peach (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107458#comment-16107458
 ] 

James Peach commented on MESOS-7841:


/cc [~qianzhang]

> Use values::rangesToIntervalSet() in the port_mapping isolator.
> ---
>
> Key: MESOS-7841
> URL: https://issues.apache.org/jira/browse/MESOS-7841
> Project: Mesos
>  Issue Type: Bug
>  Components: agent
>Reporter: James Peach
>Priority: Minor
>
> Since we now have {{values::rangesToIntervalSet()}} in common code, we can 
> use this in the {{port_mapping}} isolator instead of the local function which 
> does the same thing.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7215) Race condition on re-registration of non-partition-aware frameworks

2017-07-31 Thread Yan Xu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yan Xu updated MESOS-7215:
--
Shepherd: Yan Xu  (was: Neil Conway)

> Race condition on re-registration of non-partition-aware frameworks
> ---
>
> Key: MESOS-7215
> URL: https://issues.apache.org/jira/browse/MESOS-7215
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 1.2.0
>Reporter: Yan Xu
>Assignee: Megha Sharma
>Priority: Critical
>
> Prior to the partition-awareness work MESOS-5344, upon agent reregistration 
> after it has been removed, the master only sends ShutdownFrameworkMessages to 
> the agent for frameworks that it knows have been torn down. 
> With the new logic in MESOS-5344, Mesos is now sending 
> {{ShutdownFrameworkMessages}} to the agent for all non-partition-aware 
> frameworks (including the ones that are still registered)
> This is problematic. The offer from this agent can still go to the same 
> framework which can then launch new tasks. The agent then receives tasks of 
> the same framework and ignores them because it thinks the framework is 
> shutting down. The framework is not shutting down of course, so from the 
> master and the scheduler's perspective the task is pending in STAGING forever 
> until the next agent reregistration, which could happen much later.
> This also makes the semantics of `ShutdownFrameworkMessage` ambiguous: the 
> agent is assuming the framework to be going away (and act accordingly) when 
> it's not. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (MESOS-4812) Mesos fails to escape command health checks

2017-07-31 Thread Alexander Rukletsov (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107317#comment-16107317
 ] 

Alexander Rukletsov commented on MESOS-4812:


[~haosd...@gmail.com], could you please rebase this?

> Mesos fails to escape command health checks
> ---
>
> Key: MESOS-4812
> URL: https://issues.apache.org/jira/browse/MESOS-4812
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.25.0
>Reporter: Lukas Loesche
>Assignee: haosdent
>  Labels: health-check, mesosphere, tech-debt
> Attachments: health_task.gif
>
>
> As described in https://github.com/mesosphere/marathon/issues/
> I would like to run a command health check
> {noformat}
> /bin/bash -c " {noformat}
> The health check fails because Mesos, while running the command inside double 
> quotes of a sh -c "" doesn't escape the double quotes in the command.
> If I escape the double quotes myself the command health check succeeds. But 
> this would mean that the user needs intimate knowledge of how Mesos executes 
> his commands which can't be right.
> I was told this is not a Marathon but a Mesos issue so am opening this JIRA. 
> I don't know if this only affects the command health check.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Assigned] (MESOS-7805) mesos-execute has incorrect example TaskInfo in help string

2017-07-31 Thread Benjamin Bannier (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7805?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Bannier reassigned MESOS-7805:
---

Assignee: Benjamin Bannier

> mesos-execute has incorrect example TaskInfo in help string
> ---
>
> Key: MESOS-7805
> URL: https://issues.apache.org/jira/browse/MESOS-7805
> Project: Mesos
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.4.0
>Reporter: Benjamin Bannier
>Assignee: Benjamin Bannier
>  Labels: mesosphere
>
> {{mesos-execute}} documents that a task can be defined via JSON as
> {noformat}
> {
>   "name": "Name of the task",
>   "task_id": {"value" : "Id of the task"},
>   "agent_id": {"value" : ""},
>   "resources": [
> {
>   "name": "cpus",
>   "type": "SCALAR",
>   "scalar": {
> "value": 0.1
>   },
>   "role": "*"
> },
> {
>   "name": "mem",
>   "type": "SCALAR",
>   "scalar": {
> "value": 32
>   },
>   "role": "*"
> }
>   ],
>   "command": {
> "value": "sleep 1000"
>   }
> }
> {noformat}
> If one actually uses that example task definition one gets
> {noformat}
> % ./build/src/mesos-execute --master=127.0.0.1:5050 --task=task.json
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> W0719 17:08:17.909696 3291313088 parse.hpp:114] Specifying an absolute 
> filename to read a command line option out of without using 'file:// is 
> deprecated and will be removed in a future release. Simply adding 'file://' 
> to the beginning of the path should eliminate this warning.
> [warn] kq_init: detected broken kqueue; not using.: Undefined error: 0
> I0719 17:08:17.919190 119246848 scheduler.cpp:184] Version: 1.4.0
> I0719 17:08:17.923991 119783424 scheduler.cpp:470] New master detected at 
> master@127.0.0.1:5050
> Subscribed with ID bb0d36b4-fee0-4412-9cd9-1fa4e330355c-
> F0719 17:08:18.137984 119783424 resources.cpp:1081] Check failed: 
> !resource.has_role()
> *** Check failure stack trace: ***
> @0x101d65f5f  google::LogMessageFatal::~LogMessageFatal()
> @0x101d62609  google::LogMessageFatal::~LogMessageFatal()
> @0x1016ef3a3  mesos::v1::Resources::isEmpty()
> @0x1016ed267  mesos::v1::Resources::add()
> @0x1016f05af  mesos::v1::Resources::operator+=()
> @0x1016f08fb  mesos::v1::Resources::Resources()
> @0x100c0d89f  CommandScheduler::offers()
> @0x100c085e4  CommandScheduler::received()
> @0x100c0ae06  
> _ZZN7process8dispatchI16CommandSchedulerNSt3__15queueIN5mesos2v19scheduler5EventENS2_5dequeIS7_NS2_9allocatorIS7_EESC_EEvRKNS_3PIDIT_EEMSE_FvT0_ET1_ENKUlPNS_11ProcessBaseEE_clESN_
> @0x101ce5a21  process::ProcessBase::visit()
> @0x101ce3747  process::ProcessManager::resume()
> @0x101d0e243  
> _ZNSt3__114__thread_proxyINS_5tupleIJNS_10unique_ptrINS_15__thread_structENS_14default_deleteIS3_ZN7process14ProcessManager12init_threadsEvE3$_0EPvSB_
> @ 0x7fffbb5d693b  _pthread_body
> @ 0x7fffbb5d6887  _pthread_start
> @ 0x7fffbb5d608d  thread_start
> [1]73521 abort  ./build/src/mesos-execute --master=127.0.0.1:5050 
> --task=task.json
> {noformat}
> Removing the resource role field allows the task to execute.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7652) Docker image with universal containerizer does not work if WORKDIR is missing in the rootfs.

2017-07-31 Thread Alexander Rojas (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7652?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rojas updated MESOS-7652:
---
Target Version/s: 1.2.3, 1.3.2, 1.4.0  (was: 1.2.2, 1.3.2, 1.4.0)

> Docker image with universal containerizer does not work if WORKDIR is missing 
> in the rootfs.
> 
>
> Key: MESOS-7652
> URL: https://issues.apache.org/jira/browse/MESOS-7652
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 1.2.1
>Reporter: michael beisiegel
>Assignee: Gilbert Song
>Priority: Critical
>  Labels: mesosphere
>
> hello,
> used the following docker image recently
> quay.io/spinnaker/front50:master
> https://quay.io/repository/spinnaker/front50
> Here the link to the Dockerfile
> https://github.com/spinnaker/front50/blob/master/Dockerfile
> and here the source
> {color:blue}FROM java:8
> MAINTAINER delivery-engineer...@netflix.com
> COPY . workdir/
> WORKDIR workdir
> RUN GRADLE_USER_HOME=cache ./gradlew buildDeb -x test && \
>   dpkg -i ./front50-web/build/distributions/*.deb && \
>   cd .. && \
>   rm -rf workdir
> CMD ["/opt/front50/bin/front50"]{color}
> The image works fine with the docker containerizer, but the universal 
> containerizer shows the following in stderr.
> "Failed to chdir into current working directory '/workdir': No such file or 
> directory"
> The problem comes from the fact that the Dockerfile creates a workdir but 
> then later removes the created dir as part of a RUN. The docker containerizer 
> has no problem with it if you do
> docker run -ti --rm quay.io/spinnaker/front50:master bash
> you get into the working dir, but the universal containerizer fails with the 
> error.
> thanks for your help,
> Michael



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (MESOS-7622) Agent can crash if a HTTP executor tries to retry subscription in running state.

2017-07-31 Thread Alexander Rojas (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-7622?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rojas updated MESOS-7622:
---
Target Version/s: 1.2.3, 1.3.2  (was: 1.2.2, 1.3.2)

> Agent can crash if a HTTP executor tries to retry subscription in running 
> state.
> 
>
> Key: MESOS-7622
> URL: https://issues.apache.org/jira/browse/MESOS-7622
> Project: Mesos
>  Issue Type: Bug
>  Components: agent, executor
>Reporter: Aaron Wood
>Assignee: Anand Mazumdar
>Priority: Blocker
>
> It is possible that a running executor might retry its subscribe request. 
> This can lead to a crash if it previously had any launched tasks. Note that 
> the executor would still be able to subscribe again when the agent process 
> restarts and is recovering.
> {code}
> sudo ./mesos-agent --master=10.0.2.15:5050 --work_dir=/tmp/slave 
> --isolation=cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime
>  --image_providers=docker --image_provisioner_backend=overlay 
> --containerizers=mesos --launcher_dir=$(pwd) 
> --executor_environment_variables='{"LD_LIBRARY_PATH": 
> "/home/aaron/Code/src/mesos/build/src/.libs"}'
> WARNING: Logging before InitGoogleLogging() is written to STDERR
> I0605 14:58:23.748180 10710 main.cpp:323] Build: 2017-06-02 17:09:05 UTC by 
> aaron
> I0605 14:58:23.748252 10710 main.cpp:324] Version: 1.4.0
> I0605 14:58:23.755409 10710 systemd.cpp:238] systemd version `232` detected
> I0605 14:58:23.755450 10710 main.cpp:433] Initializing systemd state
> I0605 14:58:23.763049 10710 systemd.cpp:326] Started systemd slice 
> `mesos_executors.slice`
> I0605 14:58:23.763777 10710 resolver.cpp:69] Creating default secret resolver
> I0605 14:58:23.764214 10710 containerizer.cpp:230] Using isolation: 
> cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime,volume/image,environment_secret
> I0605 14:58:23.767192 10710 linux_launcher.cpp:150] Using 
> /sys/fs/cgroup/freezer as the freezer hierarchy for the Linux launcher
> E0605 14:58:23.770179 10710 shell.hpp:107] Command 'hadoop version 2>&1' 
> failed; this is the output:
> sh: 1: hadoop: not found
> I0605 14:58:23.770217 10710 fetcher.cpp:69] Skipping URI fetcher plugin 
> 'hadoop' as it could not be created: Failed to create HDFS client: Failed to 
> execute 'hadoop version 2>&1'; the command was either not found or exited 
> with a non-zero exit status: 127
> I0605 14:58:23.770643 10710 provisioner.cpp:255] Using default backend 
> 'overlay'
> I0605 14:58:23.785892 10710 slave.cpp:248] Mesos agent started on 
> (1)@127.0.1.1:5051
> I0605 14:58:23.785957 10710 slave.cpp:249] Flags at startup: 
> --appc_simple_discovery_uri_prefix="http://; 
> --appc_store_dir="/tmp/mesos/store/appc" --authenticate_http_readonly="false" 
> --authenticate_http_readwrite="false" --authenticatee="crammd5" 
> --authentication_backoff_factor="1secs" --authorizer="local" 
> --cgroups_cpu_enable_pids_and_tids_count="false" --cgroups_enable_cfs="false" 
> --cgroups_hierarchy="/sys/fs/cgroup" --cgroups_limit_swap="false" 
> --cgroups_root="mesos" --container_disk_watch_interval="15secs" 
> --containerizers="mesos" --default_role="*" --disk_watch_interval="1mins" 
> --docker="docker" --docker_kill_orphans="true" 
> --docker_registry="https://registry-1.docker.io; --docker_remove_delay="6hrs" 
> --docker_socket="/var/run/docker.sock" --docker_stop_timeout="0ns" 
> --docker_store_dir="/tmp/mesos/store/docker" 
> --docker_volume_checkpoint_dir="/var/run/mesos/isolators/docker/volume" 
> --enforce_container_disk_quota="false" 
> --executor_environment_variables="{"LD_LIBRARY_PATH":"\/home\/aaron\/Code\/src\/mesos\/build\/src\/.libs"}"
>  --executor_registration_timeout="1mins" 
> --executor_reregistration_timeout="2secs" 
> --executor_shutdown_grace_period="5secs" 
> --fetcher_cache_dir="/tmp/mesos/fetch" --fetcher_cache_size="2GB" 
> --frameworks_home="" --gc_delay="1weeks" --gc_disk_headroom="0.1" 
> --hadoop_home="" --help="false" --hostname_lookup="true" 
> --http_command_executor="false" --http_heartbeat_interval="30secs" 
> --image_providers="docker" --image_provisioner_backend="overlay" 
> --initialize_driver_logging="true" 
> --isolation="cgroups/cpu,cgroups/mem,disk/du,network/cni,filesystem/linux,docker/runtime"
>  --launcher="linux" --launcher_dir="/home/aaron/Code/src/mesos/build/src" 
> --logbufsecs="0" --logging_level="INFO" --master="10.0.2.15:5050" 
> --max_completed_executors_per_framework="150" 
> --oversubscribed_resources_interval="15secs" --perf_duration="10secs" 
> --perf_interval="1mins" --port="5051" --qos_correction_interval_min="0ns" 
> --quiet="false" --recover="reconnect" --recovery_timeout="15mins" 
> --registration_backoff_factor="1secs" --revocable_cpu_low_priority="true" 
> 

[jira] [Commented] (MESOS-7634) OsTest.ChownNoAccess fails on s390x machines

2017-07-31 Thread Nayana Thorat (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-7634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16107034#comment-16107034
 ] 

Nayana Thorat commented on MESOS-7634:
--

[~vinodkone] we could resolve the failure reported. 
Could you please check same?

> OsTest.ChownNoAccess fails on s390x machines
> 
>
> Key: MESOS-7634
> URL: https://issues.apache.org/jira/browse/MESOS-7634
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Nayana Thorat
>
> Running a custom branch of Mesos (with some fixes in docker build scripts for 
> s390x) on s390x based CI machines throws the following error when running 
> stout tests.
> {code}
> [ RUN  ] OsTest.ChownNoAccess
> ../../../../3rdparty/stout/tests/os_tests.cpp:839: Failure
> Value of: os::chown(uid.get(), gid.get(), "one", true).isError()
>   Actual: false
> Expected: true
> ../../../../3rdparty/stout/tests/os_tests.cpp:840: Failure
> Value of: os::chown(uid.get(), gid.get(), "one/two", true).isError()
>   Actual: false
> {code}
> One can repro this by building Mesos from my custom branch here: 
> https://github.com/vinodkone/mesos/tree/vinod/s390x



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)