[jira] [Created] (MESOS-4898) Make sure modules can be built outside of mesos source tree during `make distcheck`

2016-03-08 Thread Niklas Quarfot Nielsen (JIRA)
Niklas Quarfot Nielsen created MESOS-4898:
-

 Summary: Make sure modules can be built outside of mesos source 
tree during `make distcheck`
 Key: MESOS-4898
 URL: https://issues.apache.org/jira/browse/MESOS-4898
 Project: Mesos
  Issue Type: Improvement
Reporter: Niklas Quarfot Nielsen


So far, it has been up to manual testing to verify that mesos modules can built 
outside of the mesos source tree. Issues occur when we reference non-public 
headers in module headers (isolators, allocator, etc).

We should automate this. For example, during the make distcheck step



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4595) Add support for newest pre-defined Perf events to PerfEventIsolator

2016-02-05 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4595?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15134027#comment-15134027
 ] 

Niklas Quarfot Nielsen commented on MESOS-4595:
---

The structure of `PerfStatistics` is nice, but as you mention, doesn't scale 
well with the massive number of available counters.
I like the idea of a Labels field with an encoding like you mention: 
"/hw_counters/XYZ", "/kernel_pmu/ZYX", etc.
Populating that field should probably be guarded with a flag to the perf 
isolator, so the resource statistics doesn't explode in size if folks don't 
need all the information. 

> Add support for newest pre-defined Perf events to PerfEventIsolator
> ---
>
> Key: MESOS-4595
> URL: https://issues.apache.org/jira/browse/MESOS-4595
> Project: Mesos
>  Issue Type: Task
>  Components: isolation
>Reporter: Bartek Plotka
>Assignee: Bartek Plotka
>
> Currently, Perf Event Isolator is able to monitor all (specified in 
> {{--perf_events=...}}) Perf Events, but it can map only part of them in 
> {{ResourceUsage.proto}} (to be more exact in [PerfStatistics.proto | 
> https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L862])
> Since the last time {{PerfStatistics.proto}} was updated, list of supported 
> events expanded much and is growing constantly. I have created some 
> comparison table:
> || Events type || Num of matched events in PerfStatistics vs perf 4.3.3 || 
> perf 4.3.3 events ||
> | HW events  | 8  | 8  |
> | SW events | 9 | 10 |
> | HW cache event | 20 | 20 |
> | *Kernel PMU events* | *0* | *37* |
> | Tracepoint events | 0 | billion (: |
> For advance analysis (e.g during Oversubscription in QoS Controller) having 
> support for additional events is crucial. For instance in 
> [Serenity|https://github.com/mesosphere/serenity] we based some of our 
> revocation algorithms on the new [CMT| 
> https://01.org/packet-processing/cache-monitoring-technology-memory-bandwidth-monitoring-cache-allocation-technology-code-and-data]
>  feature which gives additional, useful event called {{llc_occupancy}}.
> I think we all agree that it would be great to support more (or even all) 
> perf events in {{Mesos PerfEventIsolator}} (:
> 
> Let's start a discussion over the approach. Within this task we have three 
> issues:
> # What events do we want to support in Mesos?
> ## all?
> ## only add Kernel PMU Events?
> ---
> I don't have a strong opinion on that, since i have never used {{Tracepoint 
> events}}. We currently need PMU events.
> # How to add new (or modify existing) events in {{mesos.proto}}?
> We can distinguish here 3 approaches:
> *# Add new events statically in {{PerfStatistics.proto}} as a separate 
> optional fields. (like it is currently)
> *# Instead of optional fields in {{PerfStatistics.proto}} message we could 
> have a {{key-value}} map (something like {{labels}} in other messages) and 
> feed it dynamically in {{PerfEventIsolator}}
> *# We could mix above approaches and just add mentioned map to existing 
> {{PerfStatistics.proto}} for additional events (:
> ---
> IMO: Approach 1) is somehow explicit - users can view what events to expect 
> (although they are parsed in a different manner e.g {{"-"}} to {{"_"}}), but 
> we would end with a looong message and a lot of copy-paste work. And we have 
> to maintain that!
> Approach 2 & 3 are more elastic, and we don't have problem mentioned in the 
> issue below (: And we *always* support *all* perf events in all kernel 
> versions (:
> IMO approaches 2 & 3 are the best.
> # How to support different naming format? For instance 
> {{intel_cqm/llc_occupancy/}} with {{"/"}} in name or  
> {{migrate:mm_migrate_pages}} with {{":"}}. I don't think it is possible to 
> have these as the field names in {{.proto}} syntax



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3361) Update MesosContainerizer to dynamically pick/enable isolators

2016-01-19 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-3361:
--
Shepherd:   (was: Niklas Quarfot Nielsen)

> Update MesosContainerizer to dynamically pick/enable isolators
> --
>
> Key: MESOS-3361
> URL: https://issues.apache.org/jira/browse/MESOS-3361
> Project: Mesos
>  Issue Type: Task
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>  Labels: mesosphere
>
> This would allow the frameworks to opt-in/opt-out of network isolation per 
> container. Thus, one can launch some containers with their own IPs while 
> other containers still share the host IP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3358) Add TaskStatus label decorator hooks for Master

2016-01-19 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3358?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-3358:
--
Shepherd:   (was: Niklas Quarfot Nielsen)

> Add TaskStatus label decorator hooks for Master
> ---
>
> Key: MESOS-3358
> URL: https://issues.apache.org/jira/browse/MESOS-3358
> Project: Mesos
>  Issue Type: Task
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>  Labels: mesosphere
>
> The hook will be triggered when Master receives TaskStatus message from Agent 
> or when the Master itself generates a TASK_LOST status. The hook should also 
> provide a list of the previous TaskStatuses to the module.
> The use case is to allow a "cleanup" module to release IPs if an agent is 
> lost. The previous statuses will contain the IP address(es) to be released.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3362) Allow Isolators to advertise "capabilities" via SlaveInfo

2016-01-19 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3362?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-3362:
--
Shepherd:   (was: Niklas Quarfot Nielsen)

> Allow Isolators to advertise "capabilities" via SlaveInfo
> -
>
> Key: MESOS-3362
> URL: https://issues.apache.org/jira/browse/MESOS-3362
> Project: Mesos
>  Issue Type: Task
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>  Labels: mesosphere
>
> A network-isolator module can thus advertise that it can assign per-container 
> IP and can provide network-isolation.
> The SlaveInfo protobuf will be extended to include "Capabilities" similar to 
> FrameworkInfo::Capabilities.
> The isolator interface needs to be extended to create `info()` that return a 
> `IsolatorInfo` message. The `IsolatorInfo` message can include "Capabilities" 
> to be sent to Frameworks as part of SlaveInfo.
> The Isolator::info() interface will be used by Slave during initialization to 
> compile SlaveInfo::Capabilities.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3740) LIBPROCESS_IP not passed to Docker containers

2016-01-19 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-3740:
--
Shepherd:   (was: Niklas Quarfot Nielsen)

> LIBPROCESS_IP not passed to Docker containers
> -
>
> Key: MESOS-3740
> URL: https://issues.apache.org/jira/browse/MESOS-3740
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0
> Environment: Mesos 0.24.1
>Reporter: Cody Maloney
>  Labels: mesosphere
>
> Docker containers aren't currently passed all the same environment variables 
> that Mesos Containerizer tasks are. See: 
> https://github.com/apache/mesos/blob/master/src/slave/containerizer/containerizer.cpp#L254
>  for all the environment variables explicitly set for mesos containers.
> While some of them don't necessarily make sense for docker containers, when 
> the docker has inside of it a libprocess process (A mesos framework 
> scheduler) and is using {{--net=host}} the task needs to have LIBPROCESS_IP 
> set otherwise the same sort of problems that happen because of MESOS-3553 can 
> happen (libprocess will try to guess the machine's IP address with likely bad 
> results in a number of operating environment).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3585) Add a test module for ip-per-container support

2016-01-19 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-3585:
--
Shepherd:   (was: Niklas Quarfot Nielsen)

> Add a test module for ip-per-container support
> --
>
> Key: MESOS-3585
> URL: https://issues.apache.org/jira/browse/MESOS-3585
> Project: Mesos
>  Issue Type: Task
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>  Labels: mesosphere
>
> With the addition of {{NetworkInfo}} to allow frameworks to request 
> IP-per-container for their tasks, we should add a simple module that mimics 
> the behavior of a real network-isolation module for testing purposes. We can 
> then add this module in {{src/examples}} and write some tests against it.
> This module can also serve as a template module for third-party network 
> isolation provides for building their own network isolator modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2646) Update Master to send revocable resources in separate offers

2016-01-19 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2646?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106566#comment-15106566
 ] 

Niklas Quarfot Nielsen commented on MESOS-2646:
---

[~JamesYongQiaoWang] Sorry about the delay. Do you still have capacity for some 
oversubscription work?

> Update Master to send revocable resources in separate offers
> 
>
> Key: MESOS-2646
> URL: https://issues.apache.org/jira/browse/MESOS-2646
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Vinod Kone
>Assignee: Yongqiao Wang
>  Labels: twitter
> Attachments: code-diff.txt
>
>
> Master will send separate offers for revocable and non-revocable/regular 
> resources. This allows master to rescind revocable offers (e.g, when a new 
> oversubscribed resources estimate comes from the slave) without impacting 
> regular offers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2688) Slave should kill revocable tasks if oversubscription is disabled

2016-01-19 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2688?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106561#comment-15106561
 ] 

Niklas Quarfot Nielsen commented on MESOS-2688:
---

[~bmahler] - so, is that an OK for doing it on the slave? :)

> Slave should kill revocable tasks if oversubscription is disabled
> -
>
> Key: MESOS-2688
> URL: https://issues.apache.org/jira/browse/MESOS-2688
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>Assignee: Jie Yu
>  Labels: twitter
>
> If oversubscription is disabled on a restarted slave (that had it previously 
> enabled), it should kill revocable tasks.
> Slave knows this information from the Resources of a container that it 
> checkpoints and recovers.
> Add a new reason OVERSUBSCRIPTION_DISABLED.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2695) Add master flag to enable/disable oversubscription

2016-01-19 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106565#comment-15106565
 ] 

Niklas Quarfot Nielsen commented on MESOS-2695:
---

[~vi...@twitter.com] should we mark as 'won't fix' for now?

> Add master flag to enable/disable oversubscription
> --
>
> Key: MESOS-2695
> URL: https://issues.apache.org/jira/browse/MESOS-2695
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>  Labels: twitter
>
> This flag lets an operator control cluster level oversubscription. 
> The master should send revocable offers to framework if this flag is enabled 
> and the framework opts in to receive them.
> Master should ignore revocable resources from slaves if the flag is disabled.
> Need tests for all these scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4429) Add oversubscription benchmark/stress/test framework

2016-01-19 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4429?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-4429:
--
Description: 
To evaluate the function and quality of oversubscription modules, we could ship 
a test framework which can:
1) Launch on oversubscribed and non-oversubscribed resources in a controlled 
manner. For example, register as two different frameworks and see that 
resources from slack resources of one framework can be used by the other.

2) Measure time to react for different scenarios. For example, measure the time 
it takes from slack appearing on a slave to the offer being issued with 
revocable resources. The time to react for changing usage patterns e.g. time to 
reclaim oversubscribed resources when regular tasks need them back.

3) Count the number of offer rescind, preemptions, etc. to deem the stability 
of the policy.

4) Be able to measure % extra work being able to run.

5) Work across different resource dimensions as cpu time, memory, network, 
caches.

[~Bartek Plotka] has been working on something similar for Serenity in 
https://github.com/mesosphere/serenity/tree/master/src/framework which we can 
reuse as a base.

  was:
To evaluate the function and quality of oversubscription modules, we could ship 
a test framework which can:
1) Launch on oversubscribed and non-oversubscribed resources in a controlled 
manner. For example, register as two different frameworks and see that 
resources from slack resources of one framework can be used by the other.
2) Measure time to react for different scenarios. For example, measure the time 
it takes from slack appearing on a slave to the offer being issued with 
revocable resources. The time to react for changing usage patterns e.g. time to 
reclaim oversubscribed resources when regular tasks need them back.
3) Count the number of offer rescind, preemptions, etc. to deem the stability 
of the policy.
4) Be able to measure % extra work being able to run.
5) Work across different resource dimensions as cpu time, memory, network, 
caches.

[~Bartek Plotka] has been working on something similar for Serenity in 
https://github.com/mesosphere/serenity/tree/master/src/framework which we can 
reuse as a base.


> Add oversubscription benchmark/stress/test framework
> 
>
> Key: MESOS-4429
> URL: https://issues.apache.org/jira/browse/MESOS-4429
> Project: Mesos
>  Issue Type: Task
>Reporter: Niklas Quarfot Nielsen
>
> To evaluate the function and quality of oversubscription modules, we could 
> ship a test framework which can:
> 1) Launch on oversubscribed and non-oversubscribed resources in a controlled 
> manner. For example, register as two different frameworks and see that 
> resources from slack resources of one framework can be used by the other.
> 2) Measure time to react for different scenarios. For example, measure the 
> time it takes from slack appearing on a slave to the offer being issued with 
> revocable resources. The time to react for changing usage patterns e.g. time 
> to reclaim oversubscribed resources when regular tasks need them back.
> 3) Count the number of offer rescind, preemptions, etc. to deem the stability 
> of the policy.
> 4) Be able to measure % extra work being able to run.
> 5) Work across different resource dimensions as cpu time, memory, network, 
> caches.
> [~Bartek Plotka] has been working on something similar for Serenity in 
> https://github.com/mesosphere/serenity/tree/master/src/framework which we can 
> reuse as a base.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2930) Allow the Resource Estimator to express over-allocation of revocable resources.

2016-01-19 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106688#comment-15106688
 ] 

Niklas Quarfot Nielsen commented on MESOS-2930:
---

Hi [~bmahler] - sorry for the super tardy reply.

For Serenity, the Estimator and QoS controllers acts as edges on a shared 
pipeline of filters (lives in it's own actor). In short, the estimator pushes 
usage statistics in and awaits estimates, whereas the QoS controller awaits 
corrections from the pipeline.

> Allow the Resource Estimator to express over-allocation of revocable 
> resources.
> ---
>
> Key: MESOS-2930
> URL: https://issues.apache.org/jira/browse/MESOS-2930
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Reporter: Benjamin Mahler
>Assignee: Klaus Ma
>
> Currently the resource estimator returns the amount of oversubscription 
> resources that are available, since resources cannot be negative, this allows 
> the resource estimator to express the following:
> (1) Return empty resources: We are fully allocated for oversubscription 
> resources.
> (2) Return non-empty resources: We are under-allocated for oversubscription 
> resources. In other words, some are available.
> However, there is an additional situation that we cannot express:
> (3) Analogous to returning non-empty "negative" resources: We are 
> over-allocated for oversubscription resources. Do not re-offer any of the 
> over-allocated oversubscription resources that are recovered.
> Without (3), the slave can only shrink the total pool of oversubscription 
> resources by returning (1) as resources are recovered, until the pool is 
> shrunk to the desired size. However, this approach is only best-effort, it's 
> possible for a framework to launch more tasks in the window of time (15 
> seconds by default) that the slave polls the estimator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3889) Modify Oversubscription documentation to explicitly forbid the QoS Controller from killing executors running on optimistically offered resources.

2016-01-19 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3889?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15106748#comment-15106748
 ] 

Niklas Quarfot Nielsen commented on MESOS-3889:
---

[~hartem] [~klaus1982] Can you add a bit of context on this ticket? :)

> Modify Oversubscription documentation to explicitly forbid the QoS Controller 
> from killing executors running on optimistically offered resources.
> -
>
> Key: MESOS-3889
> URL: https://issues.apache.org/jira/browse/MESOS-3889
> Project: Mesos
>  Issue Type: Bug
>Reporter: Artem Harutyunyan
>Assignee: Klaus Ma
>  Labels: mesosphere
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-4429) Add oversubscription benchmark/stress/test framework

2016-01-19 Thread Niklas Quarfot Nielsen (JIRA)
Niklas Quarfot Nielsen created MESOS-4429:
-

 Summary: Add oversubscription benchmark/stress/test framework
 Key: MESOS-4429
 URL: https://issues.apache.org/jira/browse/MESOS-4429
 Project: Mesos
  Issue Type: Task
Reporter: Niklas Quarfot Nielsen


To evaluate the function and quality of oversubscription modules, we could ship 
a test framework which can:
1) Launch on oversubscribed and non-oversubscribed resources in a controlled 
manner. For example, register as two different frameworks and see that 
resources from slack resources of one framework can be used by the other.
2) Measure time to react for different scenarios. For example, measure the time 
it takes from slack appearing on a slave to the offer being issued with 
revocable resources. The time to react for changing usage patterns e.g. time to 
reclaim oversubscribed resources when regular tasks need them back.
3) Count the number of offer rescind, preemptions, etc. to deem the stability 
of the policy.
4) Be able to measure % extra work being able to run.
5) Work across different resource dimensions as cpu time, memory, network, 
caches.

[~Bartek Plotka] has been working on something similar for Serenity in 
https://github.com/mesosphere/serenity/tree/master/src/framework which we can 
reuse as a base.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-314) Support the cgroups 'cpusets' subsystem.

2016-01-11 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15091904#comment-15091904
 ] 

Niklas Quarfot Nielsen commented on MESOS-314:
--

Also, I have a vague memory of us having an existing ticket for cpu set 
isolators. [~jvanremoortere] do you recall which ticket it is? I haven't been 
able to dig it up. In terms of the oversubscription work going on in Serenity, 
it would be great to get some notion of core pinning / affinity in Mesos :)

> Support the cgroups 'cpusets' subsystem.
> 
>
> Key: MESOS-314
> URL: https://issues.apache.org/jira/browse/MESOS-314
> Project: Mesos
>  Issue Type: Story
>Reporter: Benjamin Mahler
>  Labels: twitter
>
> We'd like to add support for the cpusets subsystem, in order to support 
> pinning to cpus.
> This has several potential benefits:
> 1. Improved isolation against other tenants, when given exclusive access to 
> cores.
> 2. Improved performance, if pinned to several cores with good locality in the 
> CPU topology.
> 3. An alternative / complement to CFS for applying an upper limit on CPU 
> usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-314) Support the cgroups 'cpusets' subsystem.

2016-01-11 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15091873#comment-15091873
 ] 

Niklas Quarfot Nielsen commented on MESOS-314:
--

[~tstclair] Reg. the cpuset bug in groups. Do you know if this is still the 
case?

> Support the cgroups 'cpusets' subsystem.
> 
>
> Key: MESOS-314
> URL: https://issues.apache.org/jira/browse/MESOS-314
> Project: Mesos
>  Issue Type: Story
>Reporter: Benjamin Mahler
>  Labels: twitter
>
> We'd like to add support for the cpusets subsystem, in order to support 
> pinning to cpus.
> This has several potential benefits:
> 1. Improved isolation against other tenants, when given exclusive access to 
> cores.
> 2. Improved performance, if pinned to several cores with good locality in the 
> CPU topology.
> 3. An alternative / complement to CFS for applying an upper limit on CPU 
> usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-314) Support the cgroups 'cpusets' subsystem.

2016-01-11 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-314?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15091873#comment-15091873
 ] 

Niklas Quarfot Nielsen edited comment on MESOS-314 at 1/11/16 1:07 PM:
---

[~tstclair] Reg. the cpuset bug in cgroups. Do you know if this is still the 
case?


was (Author: nnielsen):
[~tstclair] Reg. the cpuset bug in groups. Do you know if this is still the 
case?

> Support the cgroups 'cpusets' subsystem.
> 
>
> Key: MESOS-314
> URL: https://issues.apache.org/jira/browse/MESOS-314
> Project: Mesos
>  Issue Type: Story
>Reporter: Benjamin Mahler
>  Labels: twitter
>
> We'd like to add support for the cpusets subsystem, in order to support 
> pinning to cpus.
> This has several potential benefits:
> 1. Improved isolation against other tenants, when given exclusive access to 
> cores.
> 2. Improved performance, if pinned to several cores with good locality in the 
> CPU topology.
> 3. An alternative / complement to CFS for applying an upper limit on CPU 
> usage.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3358) Add TaskStatus label decorator hooks for Master

2015-12-04 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042276#comment-15042276
 ] 

Niklas Quarfot Nielsen commented on MESOS-3358:
---

[~karya] Ping ^^. Is work on this coming up? If not, we should probably 
unassign ourselves from the ticket until we have a path forward.

> Add TaskStatus label decorator hooks for Master
> ---
>
> Key: MESOS-3358
> URL: https://issues.apache.org/jira/browse/MESOS-3358
> Project: Mesos
>  Issue Type: Task
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>  Labels: mesosphere
>
> The hook will be triggered when Master receives TaskStatus message from Agent 
> or when the Master itself generates a TASK_LOST status. The hook should also 
> provide a list of the previous TaskStatuses to the module.
> The use case is to allow a "cleanup" module to release IPs if an agent is 
> lost. The previous statuses will contain the IP address(es) to be released.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3362) Allow Isolators to advertise "capabilities" via SlaveInfo

2015-12-04 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3362?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042266#comment-15042266
 ] 

Niklas Quarfot Nielsen commented on MESOS-3362:
---

[~vinodkone] They probably overlap. Think these were higher level capabilities. 
One isolator (say, through sitting within a module with a user/vendor selected 
name), that one may provide one or more capabilities.
Maybe @mesos-2221 is enough for now. [~karya] what do you think?

> Allow Isolators to advertise "capabilities" via SlaveInfo
> -
>
> Key: MESOS-3362
> URL: https://issues.apache.org/jira/browse/MESOS-3362
> Project: Mesos
>  Issue Type: Task
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>  Labels: mesosphere
>
> A network-isolator module can thus advertise that it can assign per-container 
> IP and can provide network-isolation.
> The SlaveInfo protobuf will be extended to include "Capabilities" similar to 
> FrameworkInfo::Capabilities.
> The isolator interface needs to be extended to create `info()` that return a 
> `IsolatorInfo` message. The `IsolatorInfo` message can include "Capabilities" 
> to be sent to Frameworks as part of SlaveInfo.
> The Isolator::info() interface will be used by Slave during initialization to 
> compile SlaveInfo::Capabilities.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3740) LIBPROCESS_IP not passed to Docker containers

2015-12-04 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042259#comment-15042259
 ] 

Niklas Quarfot Nielsen commented on MESOS-3740:
---

[~tnachen] Sorry for the radio silence. Do you have capacity to take this 
on/shepherding?

> LIBPROCESS_IP not passed to Docker containers
> -
>
> Key: MESOS-3740
> URL: https://issues.apache.org/jira/browse/MESOS-3740
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0
> Environment: Mesos 0.24.1
>Reporter: Cody Maloney
>  Labels: mesosphere
>
> Docker containers aren't currently passed all the same environment variables 
> that Mesos Containerizer tasks are. See: 
> https://github.com/apache/mesos/blob/master/src/slave/containerizer/containerizer.cpp#L254
>  for all the environment variables explicitly set for mesos containers.
> While some of them don't necessarily make sense for docker containers, when 
> the docker has inside of it a libprocess process (A mesos framework 
> scheduler) and is using {{--net=host}} the task needs to have LIBPROCESS_IP 
> set otherwise the same sort of problems that happen because of MESOS-3553 can 
> happen (libprocess will try to guess the machine's IP address with likely bad 
> results in a number of operating environment).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3485) Make hook execution order deterministic

2015-12-04 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042262#comment-15042262
 ] 

Niklas Quarfot Nielsen commented on MESOS-3485:
---

Awesome! I think we can land your patch if we can figure out a way to test it :)

> Make hook execution order deterministic
> ---
>
> Key: MESOS-3485
> URL: https://issues.apache.org/jira/browse/MESOS-3485
> Project: Mesos
>  Issue Type: Improvement
>  Components: modules
>Reporter: Felix Abecassis
>Assignee: haosdent
>
> Currently, when using multiple hooks of the same type, the execution order is 
> implementation-defined. 
> This is because in src/hook/manager.cpp, the list of available hooks is 
> stored in a {{hashmap}}. A hashmap is probably unnecessary for 
> this task since the number of hooks should remain reasonable. A data 
> structure preserving ordering should be used instead to allow the user to 
> predict the execution order of the hooks. I suggest that the execution order 
> should be the order in which hooks are specified with {{--hooks}} when 
> starting an agent/master.
> This will be useful when combining multiple hooks after MESOS-3366 is done.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3361) Update MesosContainerizer to dynamically pick/enable isolators

2015-12-04 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042272#comment-15042272
 ] 

Niklas Quarfot Nielsen commented on MESOS-3361:
---

[~karya] Do you still want to work on this?

> Update MesosContainerizer to dynamically pick/enable isolators
> --
>
> Key: MESOS-3361
> URL: https://issues.apache.org/jira/browse/MESOS-3361
> Project: Mesos
>  Issue Type: Task
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>  Labels: mesosphere
>
> This would allow the frameworks to opt-in/opt-out of network isolation per 
> container. Thus, one can launch some containers with their own IPs while 
> other containers still share the host IP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3585) Add a test module for ip-per-container support

2015-12-04 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3585?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15042425#comment-15042425
 ] 

Niklas Quarfot Nielsen commented on MESOS-3585:
---

Hi [~karya]; have you started work on this?

> Add a test module for ip-per-container support
> --
>
> Key: MESOS-3585
> URL: https://issues.apache.org/jira/browse/MESOS-3585
> Project: Mesos
>  Issue Type: Task
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>  Labels: mesosphere
>
> With the addition of {{NetworkInfo}} to allow frameworks to request 
> IP-per-container for their tasks, we should add a simple module that mimics 
> the behavior of a real network-isolation module for testing purposes. We can 
> then add this module in {{src/examples}} and write some tests against it.
> This module can also serve as a template module for third-party network 
> isolation provides for building their own network isolator modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3840) Build broken: 'adding 'bool' to a string does not append to the string' in filesystem tests

2015-11-06 Thread Niklas Quarfot Nielsen (JIRA)
Niklas Quarfot Nielsen created MESOS-3840:
-

 Summary: Build broken: 'adding 'bool' to a string does not append 
to the string' in filesystem tests
 Key: MESOS-3840
 URL: https://issues.apache.org/jira/browse/MESOS-3840
 Project: Mesos
  Issue Type: Bug
Reporter: Niklas Quarfot Nielsen
Priority: Blocker


{code}
../../src/tests/containerizer/filesystem_isolator_tests.cpp:125:52: error: 
adding 'bool' to a string does not append to the string 
[-Werror,-Wstring-plus-int]
return Error("Failed to create root dir: " + mkdir.isError());
 ~~^
../../src/tests/containerizer/filesystem_isolator_tests.cpp:125:52: note: use 
array indexing to silence this warning
return Error("Failed to create root dir: " + mkdir.isError());
   ^
 & []
{code}

{code}
 124   if (mkdir.isError()) {
 125 return Error("Failed to create root dir: " + mkdir.isError());
 126   }
{code}

should be

{code}
 124   if (mkdir.isError()) {
 125 return Error("Failed to create root dir: " + mkdir.error());
 126   }
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3766) Can not kill task in Status STAGING

2015-10-23 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14971789#comment-14971789
 ] 

Niklas Quarfot Nielsen commented on MESOS-3766:
---

Thanks [~anandmazumdar]!

[~matth...@mesosphere.io] - I haven't been able to repro yet. How many slaves 
where you running? Is it mesos-local? Can you repro easily (and maybe enable 
verbose logging)?

[~anandmazumdar] - do you have time to take this one on?

> Can not kill task in Status STAGING
> ---
>
> Key: MESOS-3766
> URL: https://issues.apache.org/jira/browse/MESOS-3766
> Project: Mesos
>  Issue Type: Bug
>  Components: general
>Affects Versions: 0.25.0
> Environment: OSX 
>Reporter: Matthias Veit
>Assignee: Niklas Quarfot Nielsen
> Attachments: master.log.zip, slave.log.zip
>
>
> I have created a simple Marathon Application with instance count 100 (100 
> tasks) with a simple sleep command. Before all tasks were running, I killed 
> all tasks. This operation was successful, except 2 tasks. These 2 tasks are 
> in state STAGING (according to the mesos UI). Marathon tries to kill those 
> tasks every 5 seconds (for over an hour now) - unsuccessfully.
> I picked one task and grepped the slave log:
> {noformat}
> I1020 12:39:38.480478 315482112 slave.cpp:1270] Got assigned task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d for framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:39:38.887559 315482112 slave.cpp:1386] Launching task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d for framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:39:38.898221 315482112 slave.cpp:4852] Launching executor 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- with resour
> I1020 12:39:38.899521 315482112 slave.cpp:1604] Queuing task 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' for executor 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework '80
> I1020 12:39:39.740401 313872384 containerizer.cpp:640] Starting container 
> '5ce75a17-12db-4c8f-9131-b40f8280b9f7' for executor 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of fr
> I1020 12:39:40.495931 313872384 containerizer.cpp:873] Checkpointing 
> executor's forked pid 37096 to 
> '/tmp/mesos/meta/slaves/80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-S0/frameworks
> I1020 12:39:41.744439 313335808 slave.cpp:2379] Got registration for executor 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-000
> I1020 12:39:42.080734 313335808 slave.cpp:1760] Sending queued task 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' to executor 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of frame
> I1020 12:40:13.073390 312262656 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:18.079651 312262656 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:23.097504 313335808 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:28.118443 313872384 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:33.138137 313335808 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:38.158529 316018688 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:43.177901 314408960 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:48.197852 313872384 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:53.216672 316018688 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:58.238471 314945536 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:41:03.256614 312799232 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:41:08.276450 313335808 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:41:13.297114 315482112 slave.cpp:1789] Asked to kill task 
> 

[jira] [Commented] (MESOS-3775) MasterAllocatorTest.SlaveLost is slow

2015-10-22 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14969952#comment-14969952
 ] 

Niklas Quarfot Nielsen commented on MESOS-3775:
---

[~marco-mesos] Can you help schedule this during the next sprint?

> MasterAllocatorTest.SlaveLost is slow
> -
>
> Key: MESOS-3775
> URL: https://issues.apache.org/jira/browse/MESOS-3775
> Project: Mesos
>  Issue Type: Bug
>  Components: technical debt, test
>Reporter: Alexander Rukletsov
>Priority: Minor
>  Labels: mesosphere
>
> The {{MasterAllocatorTest.SlaveLost}} takes more that {{5s}} to complete. A 
> brief look into the code hints that the stopped agent does not quit 
> immediately (and hence its resources are not released by the allocator) 
> because [it waits for the executor to 
> terminate|https://github.com/apache/mesos/blob/master/src/tests/master_allocator_tests.cpp#L717].
>  {{5s}} timeout comes from {{EXECUTOR_SHUTDOWN_GRACE_PERIOD}} agent constant.
> Possible solutions:
> * Do not wait until the stopped agent quits (can be flaky, needs deeper 
> analysis).
> * Decrease the agent's {{executor_shutdown_grace_period}} flag.
> * Terminate the executor faster (this may require some refactoring since the 
> executor driver is created in the {{TestContainerizer}} and we do not have 
> direct access to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3775) MasterAllocatorTest.SlaveLost is slow

2015-10-22 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3775?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-3775:
--
Labels: mesosphere  (was: )

> MasterAllocatorTest.SlaveLost is slow
> -
>
> Key: MESOS-3775
> URL: https://issues.apache.org/jira/browse/MESOS-3775
> Project: Mesos
>  Issue Type: Bug
>  Components: technical debt, test
>Reporter: Alexander Rukletsov
>Priority: Minor
>  Labels: mesosphere
>
> The {{MasterAllocatorTest.SlaveLost}} takes more that {{5s}} to complete. A 
> brief look into the code hints that the stopped agent does not quit 
> immediately (and hence its resources are not released by the allocator) 
> because [it waits for the executor to 
> terminate|https://github.com/apache/mesos/blob/master/src/tests/master_allocator_tests.cpp#L717].
>  {{5s}} timeout comes from {{EXECUTOR_SHUTDOWN_GRACE_PERIOD}} agent constant.
> Possible solutions:
> * Do not wait until the stopped agent quits (can be flaky, needs deeper 
> analysis).
> * Decrease the agent's {{executor_shutdown_grace_period}} flag.
> * Terminate the executor faster (this may require some refactoring since the 
> executor driver is created in the {{TestContainerizer}} and we do not have 
> direct access to it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3766) Can not kill task in Status STAGING

2015-10-20 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965474#comment-14965474
 ] 

Niklas Quarfot Nielsen commented on MESOS-3766:
---

[~matth...@mesosphere.io] acknowledged; will take a look.

Can you share the full logs in the mean time? Any details that precedes the 
stuck state would help. 

> Can not kill task in Status STAGING
> ---
>
> Key: MESOS-3766
> URL: https://issues.apache.org/jira/browse/MESOS-3766
> Project: Mesos
>  Issue Type: Bug
>  Components: general
>Affects Versions: 0.25.0
> Environment: OSX 
>Reporter: Matthias Veit
>Assignee: Niklas Quarfot Nielsen
>
> I have created a simple Marathon Application with instance count 100 (100 
> tasks) with a simple sleep command. Before all tasks were running, I killed 
> all tasks. This operation was successful, except 2 tasks. These 2 tasks are 
> in state STAGING (according to the mesos UI). Marathon tries to kill those 
> tasks every 5 seconds (for over an hour now) - unsuccessfully.
> I picked one task and grepped the slave log:
> {noformat}
> I1020 12:39:38.480478 315482112 slave.cpp:1270] Got assigned task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d for framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:39:38.887559 315482112 slave.cpp:1386] Launching task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d for framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:39:38.898221 315482112 slave.cpp:4852] Launching executor 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- with resour
> I1020 12:39:38.899521 315482112 slave.cpp:1604] Queuing task 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' for executor 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework '80
> I1020 12:39:39.740401 313872384 containerizer.cpp:640] Starting container 
> '5ce75a17-12db-4c8f-9131-b40f8280b9f7' for executor 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of fr
> I1020 12:39:40.495931 313872384 containerizer.cpp:873] Checkpointing 
> executor's forked pid 37096 to 
> '/tmp/mesos/meta/slaves/80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-S0/frameworks
> I1020 12:39:41.744439 313335808 slave.cpp:2379] Got registration for executor 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-000
> I1020 12:39:42.080734 313335808 slave.cpp:1760] Sending queued task 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' to executor 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of frame
> I1020 12:40:13.073390 312262656 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:18.079651 312262656 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:23.097504 313335808 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:28.118443 313872384 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:33.138137 313335808 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:38.158529 316018688 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:43.177901 314408960 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:48.197852 313872384 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:53.216672 316018688 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:58.238471 314945536 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:41:03.256614 312799232 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:41:08.276450 313335808 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:41:13.297114 315482112 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:41:18.316463 316018688 slave.cpp:1789] Asked to kill task 

[jira] [Commented] (MESOS-3766) Can not kill task in Status STAGING

2015-10-20 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3766?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14965665#comment-14965665
 ] 

Niklas Quarfot Nielsen commented on MESOS-3766:
---

[~matth...@mesosphere.io] Also, can you grab the master and slave state 
endpoint data?

> Can not kill task in Status STAGING
> ---
>
> Key: MESOS-3766
> URL: https://issues.apache.org/jira/browse/MESOS-3766
> Project: Mesos
>  Issue Type: Bug
>  Components: general
>Affects Versions: 0.25.0
> Environment: OSX 
>Reporter: Matthias Veit
>Assignee: Niklas Quarfot Nielsen
>
> I have created a simple Marathon Application with instance count 100 (100 
> tasks) with a simple sleep command. Before all tasks were running, I killed 
> all tasks. This operation was successful, except 2 tasks. These 2 tasks are 
> in state STAGING (according to the mesos UI). Marathon tries to kill those 
> tasks every 5 seconds (for over an hour now) - unsuccessfully.
> I picked one task and grepped the slave log:
> {noformat}
> I1020 12:39:38.480478 315482112 slave.cpp:1270] Got assigned task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d for framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:39:38.887559 315482112 slave.cpp:1386] Launching task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d for framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:39:38.898221 315482112 slave.cpp:4852] Launching executor 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8- with resour
> I1020 12:39:38.899521 315482112 slave.cpp:1604] Queuing task 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' for executor 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework '80
> I1020 12:39:39.740401 313872384 containerizer.cpp:640] Starting container 
> '5ce75a17-12db-4c8f-9131-b40f8280b9f7' for executor 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of fr
> I1020 12:39:40.495931 313872384 containerizer.cpp:873] Checkpointing 
> executor's forked pid 37096 to 
> '/tmp/mesos/meta/slaves/80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-S0/frameworks
> I1020 12:39:41.744439 313335808 slave.cpp:2379] Got registration for executor 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-000
> I1020 12:39:42.080734 313335808 slave.cpp:1760] Sending queued task 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' to executor 
> 'app.dc98434b-7716-11e5-a5fc-1ea69edef42d' of frame
> I1020 12:40:13.073390 312262656 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:18.079651 312262656 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:23.097504 313335808 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:28.118443 313872384 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:33.138137 313335808 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:38.158529 316018688 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:43.177901 314408960 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:48.197852 313872384 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:53.216672 316018688 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:40:58.238471 314945536 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:41:03.256614 312799232 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:41:08.276450 313335808 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:41:13.297114 315482112 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 80ba2050-bf0f-4472-a2f7-2636c4f7b8c8-
> I1020 12:41:18.316463 316018688 slave.cpp:1789] Asked to kill task 
> app.dc98434b-7716-11e5-a5fc-1ea69edef42d of framework 
> 

[jira] [Updated] (MESOS-3740) LIBPROCESS_IP not passed to Docker containers

2015-10-16 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-3740:
--
Shepherd: Niklas Quarfot Nielsen  (was: Michael Park)

> LIBPROCESS_IP not passed to Docker containers
> -
>
> Key: MESOS-3740
> URL: https://issues.apache.org/jira/browse/MESOS-3740
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0
> Environment: Mesos 0.24.1
>Reporter: Cody Maloney
>Assignee: Niklas Quarfot Nielsen
>  Labels: mesosphere
>
> Docker containers aren't currently passed all the same environment variables 
> that Mesos Containerizer tasks are. See: 
> https://github.com/apache/mesos/blob/master/src/slave/containerizer/containerizer.cpp#L254
>  for all the environment variables explicitly set for mesos containers.
> While some of them don't necessarily make sense for docker containers, when 
> the docker has inside of it a libprocess process (A mesos framework 
> scheduler) and is using {{--net=host}} the task needs to have LIBPROCESS_IP 
> set otherwise the same sort of problems that happen because of MESOS-3553 can 
> happen (libprocess will try to guess the machine's IP address with likely bad 
> results in a number of operating environment).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3740) LIBPROCESS_IP not passed to Docker containers

2015-10-16 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-3740:
--
Assignee: Michael Park  (was: Niklas Quarfot Nielsen)

> LIBPROCESS_IP not passed to Docker containers
> -
>
> Key: MESOS-3740
> URL: https://issues.apache.org/jira/browse/MESOS-3740
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0
> Environment: Mesos 0.24.1
>Reporter: Cody Maloney
>Assignee: Michael Park
>  Labels: mesosphere
>
> Docker containers aren't currently passed all the same environment variables 
> that Mesos Containerizer tasks are. See: 
> https://github.com/apache/mesos/blob/master/src/slave/containerizer/containerizer.cpp#L254
>  for all the environment variables explicitly set for mesos containers.
> While some of them don't necessarily make sense for docker containers, when 
> the docker has inside of it a libprocess process (A mesos framework 
> scheduler) and is using {{--net=host}} the task needs to have LIBPROCESS_IP 
> set otherwise the same sort of problems that happen because of MESOS-3553 can 
> happen (libprocess will try to guess the machine's IP address with likely bad 
> results in a number of operating environment).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3752) CentOS 6 dependency install fails at Maven

2015-10-16 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3752?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-3752:
--
Target Version/s: 0.26.0  (was: 0.25.0)

> CentOS 6 dependency install fails at Maven
> --
>
> Key: MESOS-3752
> URL: https://issues.apache.org/jira/browse/MESOS-3752
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: documentation, installation, mesosphere
>
> It seems the Apache Maven dependencies have changed such that following the 
> Getting Started docs for CentOS 6.6 will fail at Maven installation:
> {code}
> ---> Package apache-maven.noarch 0:3.3.3-2.el6 will be installed
> --> Processing Dependency: java-devel >= 1:1.7.0 for package: 
> apache-maven-3.3.3-2.el6.noarch
> --> Finished Dependency Resolution
> Error: Package: apache-maven-3.3.3-2.el6.noarch (epel-apache-maven)
>Requires: java-devel >= 1:1.7.0
>Available: java-1.5.0-gcj-devel-1.5.0.0-29.1.el6.x86_64 (base)
>java-devel = 1.5.0
>Available: 
> 1:java-1.6.0-openjdk-devel-1.6.0.35-1.13.7.1.el6_6.x86_64 (base)
>java-devel = 1:1.6.0
>Available: 
> 1:java-1.6.0-openjdk-devel-1.6.0.36-1.13.8.1.el6_7.x86_64 (updates)
>java-devel = 1:1.6.0
>  You could try using --skip-broken to work around the problem
>  You could try running: rpm -Va --nofiles --nodigest
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3736) Support docker local store pull same image simultaneously

2015-10-16 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-3736:
--
Target Version/s: 0.26.0  (was: 0.25.0)

> Support docker local store pull same image simultaneously 
> --
>
> Key: MESOS-3736
> URL: https://issues.apache.org/jira/browse/MESOS-3736
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>
> The current local store implements get() using the local puller. For all 
> requests of pulling same docker image at the same time, the local puller just 
> untar the image tarball as many times as those requests are, and cp all of 
> them to the same directory, which wastes time and bear high demand of 
> computation. We should be able to support the local store/puller only do 
> these for the first time, and the simultaneous pulling request should wait 
> for the promised future and get it once the first pulling finishes. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3740) LIBPROCESS_IP not passed to Docker containers

2015-10-14 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-3740:
--
Shepherd: Michael Park

> LIBPROCESS_IP not passed to Docker containers
> -
>
> Key: MESOS-3740
> URL: https://issues.apache.org/jira/browse/MESOS-3740
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0
> Environment: Mesos 0.24.1
>Reporter: Cody Maloney
>Assignee: Niklas Quarfot Nielsen
>  Labels: mesosphere
>
> Docker containers aren't currently passed all the same environment variables 
> that Mesos Containerizer tasks are. See: 
> https://github.com/apache/mesos/blob/master/src/slave/containerizer/containerizer.cpp#L254
>  for all the environment variables explicitly set for mesos containers.
> While some of them don't necessarily make sense for docker containers, when 
> the docker has inside of it a libprocess process (A mesos framework 
> scheduler) and is using {{--net=host}} the task needs to have LIBPROCESS_IP 
> set otherwise the same sort of problems that happen because of MESOS-3553 can 
> happen (libprocess will try to guess the machine's IP address with likely bad 
> results in a number of operating environment).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-3740) LIBPROCESS_IP not passed to Docker containers

2015-10-14 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3740?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen reassigned MESOS-3740:
-

Assignee: Niklas Quarfot Nielsen

> LIBPROCESS_IP not passed to Docker containers
> -
>
> Key: MESOS-3740
> URL: https://issues.apache.org/jira/browse/MESOS-3740
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization, docker
>Affects Versions: 0.25.0
> Environment: Mesos 0.24.1
>Reporter: Cody Maloney
>Assignee: Niklas Quarfot Nielsen
>  Labels: mesosphere
>
> Docker containers aren't currently passed all the same environment variables 
> that Mesos Containerizer tasks are. See: 
> https://github.com/apache/mesos/blob/master/src/slave/containerizer/containerizer.cpp#L254
>  for all the environment variables explicitly set for mesos containers.
> While some of them don't necessarily make sense for docker containers, when 
> the docker has inside of it a libprocess process (A mesos framework 
> scheduler) and is using {{--net=host}} the task needs to have LIBPROCESS_IP 
> set otherwise the same sort of problems that happen because of MESOS-3553 can 
> happen (libprocess will try to guess the machine's IP address with likely bad 
> results in a number of operating environment).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3485) Make hook execution order deterministic

2015-10-12 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-3485:
--
Story Points: 3

> Make hook execution order deterministic
> ---
>
> Key: MESOS-3485
> URL: https://issues.apache.org/jira/browse/MESOS-3485
> Project: Mesos
>  Issue Type: Improvement
>  Components: modules
>Reporter: Felix Abecassis
>Assignee: haosdent
>
> Currently, when using multiple hooks of the same type, the execution order is 
> implementation-defined. 
> This is because in src/hook/manager.cpp, the list of available hooks is 
> stored in a {{hashmap}}. A hashmap is probably unnecessary for 
> this task since the number of hooks should remain reasonable. A data 
> structure preserving ordering should be used instead to allow the user to 
> predict the execution order of the hooks. I suggest that the execution order 
> should be the order in which hooks are specified with {{--hooks}} when 
> starting an agent/master.
> This will be useful when combining multiple hooks after MESOS-3366 is done.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3700) Deprecate resource_monitoring_interval flag

2015-10-12 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-3700:
--
Story Points: 1

> Deprecate resource_monitoring_interval flag
> ---
>
> Key: MESOS-3700
> URL: https://issues.apache.org/jira/browse/MESOS-3700
> Project: Mesos
>  Issue Type: Bug
>Reporter: Guangya Liu
>Assignee: Guangya Liu
> Fix For: 0.26.0
>
>
> This parameter should be deprecated after 0.23.0 release as it has no use 
> now. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3366) Allow resources/attributes discovery

2015-10-12 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-3366:
--
Story Points: 3

> Allow resources/attributes discovery
> 
>
> Key: MESOS-3366
> URL: https://issues.apache.org/jira/browse/MESOS-3366
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Reporter: Felix Abecassis
>
> In heterogeneous clusters, tasks sometimes have strong constraints on the 
> type of hardware they need to execute on. The current solution is to use 
> custom resources and attributes on the agents. Detecting non-standard 
> resources/attributes requires wrapping the "mesos-slave" binary behind a 
> script and use custom code to probe the agent. Unfortunately, this approach 
> doesn't allow composition. The solution would be to provide a hook/module 
> mechanism to allow users to use custom code performing resources/attributes 
> discovery.
> Please review the detailed document below:
> https://docs.google.com/document/d/15OkebDezFxzeyLsyQoU0upB0eoVECAlzEkeg0HQAX9w
> Feel free to express comments/concerns by annotating the document or by 
> replying to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3326) Make use of C++11 atomics

2015-10-05 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-3326:
--
Fix Version/s: (was: 0.25.0)
   0.26.0

> Make use of C++11 atomics
> -
>
> Key: MESOS-3326
> URL: https://issues.apache.org/jira/browse/MESOS-3326
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>Assignee: Neil Conway
> Fix For: 0.26.0
>
>
> Now that we require C++11, we can make use of std::atomic. For example:
> * libprocess/process.cpp uses a bare int + __sync_synchronize() for "running"
> * __sync_synchronize() is used in logging.hpp in libprocess and fork.hpp in 
> stout
> * sched/sched.cpp uses a volatile int for "running" -- this is wrong, 
> "volatile" is not sufficient to ensure safe concurrent access
> * "volatile" is used in a few other places -- most are probably dubious but I 
> haven't looked closely



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3561) Create a user doc for network isolation using per-container IP

2015-10-05 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3561?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-3561:
--
Labels: mesosphere  (was: mesosphe)

> Create a user doc for network isolation using per-container IP
> --
>
> Key: MESOS-3561
> URL: https://issues.apache.org/jira/browse/MESOS-3561
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>Priority: Blocker
>  Labels: mesosphere
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3326) Make use of C++11 atomics

2015-10-05 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3326?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-3326:
--
Target Version/s: 0.26.0  (was: 0.25.0)

> Make use of C++11 atomics
> -
>
> Key: MESOS-3326
> URL: https://issues.apache.org/jira/browse/MESOS-3326
> Project: Mesos
>  Issue Type: Bug
>Reporter: Neil Conway
>Assignee: Neil Conway
> Fix For: 0.26.0
>
>
> Now that we require C++11, we can make use of std::atomic. For example:
> * libprocess/process.cpp uses a bare int + __sync_synchronize() for "running"
> * __sync_synchronize() is used in logging.hpp in libprocess and fork.hpp in 
> stout
> * sched/sched.cpp uses a volatile int for "running" -- this is wrong, 
> "volatile" is not sufficient to ensure safe concurrent access
> * "volatile" is used in a few other places -- most are probably dubious but I 
> haven't looked closely



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3519) Fix file descriptor leakage / double close in the code base

2015-10-05 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3519?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-3519:
--
Fix Version/s: 0.25.0

> Fix file descriptor leakage / double close in the code base
> ---
>
> Key: MESOS-3519
> URL: https://issues.apache.org/jira/browse/MESOS-3519
> Project: Mesos
>  Issue Type: Bug
>Reporter: Chi Zhang
>Assignee: Chi Zhang
> Fix For: 0.25.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3556) mesos.cli broken in 0.24.x

2015-09-30 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14938853#comment-14938853
 ] 

Niklas Quarfot Nielsen commented on MESOS-3556:
---

While not a master at Python; I can help taking a look at it with you and help 
land it

> mesos.cli broken in 0.24.x
> --
>
> Key: MESOS-3556
> URL: https://issues.apache.org/jira/browse/MESOS-3556
> Project: Mesos
>  Issue Type: Bug
>  Components: cli
>Affects Versions: 0.24.0, 0.24.1
>Reporter: Radoslaw Gruchalski
>Assignee: Marco Massenzio
>  Labels: mesosphere
>
> The issue was initially reported on the mailing list: 
> http://www.mail-archive.com/user@mesos.apache.org/msg04670.html
> The format of the master data stored in zookeeper has changed but the 
> mesos.cli does not reflect these changes causing tools like {{mesos-tail}} 
> and {{mesos-ps}} to fail.
> Example error from {{mesos-tail}}:
> {noformat}
> mesos-master ~$ mesos tail -f -n 50 service
> Traceback (most recent call last):
>   File "/usr/local/bin/mesos-tail", line 11, in 
> sys.exit(main())
>   File "/usr/local/lib/python2.7/dist-packages/mesos/cli/cli.py", line 61, in 
> wrapper
> return fn(*args, **kwargs)
>   File "/usr/local/lib/python2.7/dist-packages/mesos/cli/cmds/tail.py", line 
> 55, in main
> args.task, args.file, fail=(not args.follow)):
>   File "/usr/local/lib/python2.7/dist-packages/mesos/cli/cluster.py", line 
> 27, 
> in files
> tlist = MASTER.tasks(fltr)
>   File "/usr/local/lib/python2.7/dist-packages/mesos/cli/master.py", line 
> 174, 
> in tasks
> self._task_list(active_only
>   File "/usr/local/lib/python2.7/dist-packages/mesos/cli/master.py", line 
> 153, 
> in _task_list
> *[util.merge(x, *keys) for x in self.frameworks(active_only)])
>   File "/usr/local/lib/python2.7/dist-packages/mesos/cli/master.py", line 
> 185, 
> in frameworks
> return util.merge(self.state, *keys)
>   File "/usr/local/lib/python2.7/dist-packages/mesos/cli/util.py", line 58, 
> in 
> __get__
> value = self.fget(inst)
>   File "/usr/local/lib/python2.7/dist-packages/mesos/cli/master.py", line 
> 123, 
> in state
> return self.fetch("/master/state.json").json()
>   File "/usr/local/lib/python2.7/dist-packages/mesos/cli/master.py", line 64, 
> in fetch
> return requests.get(urlparse.urljoin(self.host, url), **kwargs)
>   File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 69, in 
> get
> return request('get', url, params=params, **kwargs)
>   File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 50, in 
> request
> response = session.request(method=method, url=url, **kwargs)
>   File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 
> 451, 
> in request
> prep = self.prepare_request(req)
>   File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 
> 382, 
> in prepare_request
> hooks=merge_hooks(request.hooks, self.hooks),
>   File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 304, 
> in prepare
> self.prepare_url(url, params)
>   File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 357, 
> in prepare_url
> raise InvalidURL(*e.args)
> requests.exceptions.InvalidURL: Failed to parse: 
> 10.100.1.100:5050","port":5050,"version":"0.24.1"}
> {noformat}
> The problem exists in 
> https://github.com/mesosphere/mesos-cli/blob/master/mesos/cli/master.py#L107. 
> The code should be along the lines of:
> {noformat}
> try:
> parsed =  json.loads(val)
> return parsed["address"]["ip"] + ":" + 
> str(parsed["address"]["port"])
> except Exception:
> return val.split("@")[-1]
> {noformat}
> This causes the master address to come back correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3563) Revocable task CPU shows as zero in /state.json

2015-09-30 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14939021#comment-14939021
 ] 

Niklas Quarfot Nielsen commented on MESOS-3563:
---

Can you share the full state.json?

> Revocable task CPU shows as zero in /state.json
> ---
>
> Key: MESOS-3563
> URL: https://issues.apache.org/jira/browse/MESOS-3563
> Project: Mesos
>  Issue Type: Bug
>Reporter: Maxim Khutornenko
>
> The slave's state.json reports revocable task resources as zero:
> {noformat}
> resources: {
> cpus: 0,
> disk: 3071,
> mem: 1248,
> ports: "[31715-31715]"
> },
> {noformat}
> Also, there is no indication that a task uses revocable CPU. It would be 
> great to have this type of info.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3563) Revocable task CPU shows as zero in /state.json

2015-09-30 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14939044#comment-14939044
 ] 

Niklas Quarfot Nielsen commented on MESOS-3563:
---

I didn't understand the full context of your problem; assume you are using the 
fixed resource estimator? At any regard, looks like Vinod is on it.

> Revocable task CPU shows as zero in /state.json
> ---
>
> Key: MESOS-3563
> URL: https://issues.apache.org/jira/browse/MESOS-3563
> Project: Mesos
>  Issue Type: Bug
>Reporter: Maxim Khutornenko
>Assignee: Vinod Kone
>
> The slave's state.json reports revocable task resources as zero:
> {noformat}
> resources: {
> cpus: 0,
> disk: 3071,
> mem: 1248,
> ports: "[31715-31715]"
> },
> {noformat}
> Also, there is no indication that a task uses revocable CPU. It would be 
> great to have this type of info.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3556) mesos.cli broken in 0.24.x

2015-09-30 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3556?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14937442#comment-14937442
 ] 

Niklas Quarfot Nielsen commented on MESOS-3556:
---

[~marco-mesos] Do you have/need a shepherd for this?

> mesos.cli broken in 0.24.x
> --
>
> Key: MESOS-3556
> URL: https://issues.apache.org/jira/browse/MESOS-3556
> Project: Mesos
>  Issue Type: Bug
>  Components: cli
>Affects Versions: 0.24.0, 0.24.1
>Reporter: Radoslaw Gruchalski
>Assignee: Marco Massenzio
>  Labels: mesosphere
>
> The issue was initially reported on the mailing list: 
> http://www.mail-archive.com/user@mesos.apache.org/msg04670.html
> The format of the master data stored in zookeeper has changed but the 
> mesos.cli does not reflect these changes causing tools like {{mesos-tail}} 
> and {{mesos-ps}} to fail.
> Example error from {{mesos-tail}}:
> {noformat}
> mesos-master ~$ mesos tail -f -n 50 service
> Traceback (most recent call last):
>   File "/usr/local/bin/mesos-tail", line 11, in 
> sys.exit(main())
>   File "/usr/local/lib/python2.7/dist-packages/mesos/cli/cli.py", line 61, in 
> wrapper
> return fn(*args, **kwargs)
>   File "/usr/local/lib/python2.7/dist-packages/mesos/cli/cmds/tail.py", line 
> 55, in main
> args.task, args.file, fail=(not args.follow)):
>   File "/usr/local/lib/python2.7/dist-packages/mesos/cli/cluster.py", line 
> 27, 
> in files
> tlist = MASTER.tasks(fltr)
>   File "/usr/local/lib/python2.7/dist-packages/mesos/cli/master.py", line 
> 174, 
> in tasks
> self._task_list(active_only
>   File "/usr/local/lib/python2.7/dist-packages/mesos/cli/master.py", line 
> 153, 
> in _task_list
> *[util.merge(x, *keys) for x in self.frameworks(active_only)])
>   File "/usr/local/lib/python2.7/dist-packages/mesos/cli/master.py", line 
> 185, 
> in frameworks
> return util.merge(self.state, *keys)
>   File "/usr/local/lib/python2.7/dist-packages/mesos/cli/util.py", line 58, 
> in 
> __get__
> value = self.fget(inst)
>   File "/usr/local/lib/python2.7/dist-packages/mesos/cli/master.py", line 
> 123, 
> in state
> return self.fetch("/master/state.json").json()
>   File "/usr/local/lib/python2.7/dist-packages/mesos/cli/master.py", line 64, 
> in fetch
> return requests.get(urlparse.urljoin(self.host, url), **kwargs)
>   File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 69, in 
> get
> return request('get', url, params=params, **kwargs)
>   File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 50, in 
> request
> response = session.request(method=method, url=url, **kwargs)
>   File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 
> 451, 
> in request
> prep = self.prepare_request(req)
>   File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 
> 382, 
> in prepare_request
> hooks=merge_hooks(request.hooks, self.hooks),
>   File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 304, 
> in prepare
> self.prepare_url(url, params)
>   File "/usr/local/lib/python2.7/dist-packages/requests/models.py", line 357, 
> in prepare_url
> raise InvalidURL(*e.args)
> requests.exceptions.InvalidURL: Failed to parse: 
> 10.100.1.100:5050","port":5050,"version":"0.24.1"}
> {noformat}
> The problem exists in 
> https://github.com/mesosphere/mesos-cli/blob/master/mesos/cli/master.py#L107. 
> The code should be along the lines of:
> {noformat}
> try:
> parsed =  json.loads(val)
> return parsed["address"]["ip"] + ":" + 
> str(parsed["address"]["port"])
> except Exception:
> return val.split("@")[-1]
> {noformat}
> This causes the master address to come back correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3554) Allocator changes trigger large re-compiles

2015-09-30 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14937445#comment-14937445
 ] 

Niklas Quarfot Nielsen commented on MESOS-3554:
---

Do you need a shepherd for this?

> Allocator changes trigger large re-compiles
> ---
>
> Key: MESOS-3554
> URL: https://issues.apache.org/jira/browse/MESOS-3554
> Project: Mesos
>  Issue Type: Improvement
>  Components: allocation
>Reporter: Joris Van Remoortere
>Assignee: Joris Van Remoortere
>  Labels: mesosphere
>
> Due to the templatized nature of the allocator, even small changes trigger 
> large recompiles of the code-base. This make iterating on changes expensive 
> for developers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1615) Create design document for Optimistic Offers

2015-09-30 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-1615:
--
Shepherd: Joris Van Remoortere

> Create design document for Optimistic Offers
> 
>
> Key: MESOS-1615
> URL: https://issues.apache.org/jira/browse/MESOS-1615
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Dominic Hamon
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> As a first step toward Optimistic Offers, take the description from the epic 
> and build an implementation design doc that can be shared for comments.
> Note: the links to the working group notes and design doc are located in the 
> [JIRA Epic|MESOS-1607].



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3515) Support Subscribe Call for HTTP based Executors

2015-09-30 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14937447#comment-14937447
 ] 

Niklas Quarfot Nielsen commented on MESOS-3515:
---

Hi [~anandmazumdar] - have you found a shepherd for this?

> Support Subscribe Call for HTTP based Executors
> ---
>
> Key: MESOS-3515
> URL: https://issues.apache.org/jira/browse/MESOS-3515
> Project: Mesos
>  Issue Type: Task
>Reporter: Anand Mazumdar
>Assignee: Anand Mazumdar
>  Labels: mesosphere
>
> We need to add a {{subscribe(...)}} method in {{src/slave/slave.cpp}} to 
> introduce the ability for HTTP based executors to subscribe and then receive 
> events on the persistent HTTP connection. Most of the functionality needed 
> would be similar to {{Master::subscribe}} in {{src/master/master.cpp}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3183) Documentation images do not load

2015-09-30 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14937451#comment-14937451
 ] 

Niklas Quarfot Nielsen commented on MESOS-3183:
---

[~davelester] Can you be the shepherd for this?

> Documentation images do not load
> 
>
> Key: MESOS-3183
> URL: https://issues.apache.org/jira/browse/MESOS-3183
> Project: Mesos
>  Issue Type: Documentation
>  Components: documentation
>Affects Versions: 0.24.0
>Reporter: James Mulcahy
>Assignee: Joseph Wu
>Priority: Minor
>  Labels: mesosphere
> Attachments: rake.patch
>
>
> Any images which are referenced from the generated docs ({{docs/*.md}}) do 
> not show up on the website.  For example:
> * [External 
> Containerizer|http://mesos.apache.org/documentation/latest/external-containerizer/]
> * [Fetcher Cache 
> Internals|http://mesos.apache.org/documentation/latest/fetcher-cache-internals/]
> * [Maintenance|http://mesos.apache.org/documentation/latest/maintenance/] 
> * 
> [Oversubscription|http://mesos.apache.org/documentation/latest/oversubscription/]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3516) Add user doc for networking support in Mesos 0.25.0

2015-09-28 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-3516:
--
Target Version/s: 0.25.0

> Add user doc for networking support in Mesos 0.25.0
> ---
>
> Key: MESOS-3516
> URL: https://issues.apache.org/jira/browse/MESOS-3516
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Niklas Quarfot Nielsen
>Assignee: Kapil Arya
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3532) 3 Master HA setup restarts every 3 minutes

2015-09-28 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3532?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14933512#comment-14933512
 ] 

Niklas Quarfot Nielsen commented on MESOS-3532:
---

@jieyu - does this look familiar?

[~edonahue3rd] Can you share your quorum configuration for the masters?

> 3 Master HA setup restarts every 3 minutes
> --
>
> Key: MESOS-3532
> URL: https://issues.apache.org/jira/browse/MESOS-3532
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.23.0
>Reporter: Edward Donahue III
>
> CentOS 7.1, 3 Node cluster, each host has mesos master/slave and zookeeper 
> setup.
> After I pushed out a bad zoo.cfg  (added 2 extra zookeeper hosts that didn't 
> exist) about every three minutes the elected master restarts and this keeps 
> happening, when I have just one of the three masters running, it restarts 
> every 3 minutes.  
> I fixed the configs, deleted all the files under 
> (/var/log/zookeeper/version-2/, /var/lib/zookeeper/version-2/).  Is there 
> another step I need to take, I feel like zookeeper is the issue (also where I 
> lack knowledge), this cluster was stable for months until I push out the bad 
> zoo.cfg.
> The master logs have this output every second:
> I0928 13:56:05.281518 28448 replica.cpp:641] Replica in EMPTY status received 
> a broadcasted recover request
> I0928 13:56:05.351608 28450 replica.cpp:641] Replica in EMPTY status received 
> a broadcasted recover request
> I0928 13:56:05.351794 28448 recover.cpp:195] Received a recover response from 
> a replica in EMPTY status
> I0928 13:56:05.352700 28452 recover.cpp:195] Received a recover response from 
> a replica in EMPTY status
> I0928 13:56:05.352963 28447 recover.cpp:195] Received a recover response from 
> a replica in VOTING status
> The mesos-slaves don't even register in time:
> I0928 13:55:40.041491 28418 slave.cpp:3087] master@10.251.132.179:5050 exited
> W0928 13:55:40.041574 28418 slave.cpp:3090] Master disconnected! Waiting for 
> a new master to be elected
> E0928 13:55:40.250059 28420 socket.hpp:107] Shutdown failed on fd=9: 
> Transport endpoint is not connected [107]
> I0928 13:55:48.005607 28418 detector.cpp:138] Detected a new leader: (id='14')
> I0928 13:55:48.005836 28417 group.cpp:656] Trying to get 
> '/mesos/info_14' in ZooKeeper
> W0928 13:55:48.006597 28417 detector.cpp:444] Leading master 
> master@10.251.132.177:5050 is using a Protobuf binary f...ESOS-2340)
> I0928 13:55:48.006652 28417 detector.cpp:481] A new leading master 
> (UPID=master@10.251.132.177:5050) is detected
> I0928 13:55:48.006731 28417 slave.cpp:684] New master detected at 
> master@10.251.132.177:5050
> I0928 13:55:48.006891 28417 slave.cpp:709] No credentials provided. 
> Attempting to register without authentication
> I0928 13:55:48.006911 28417 slave.cpp:720] Detecting new master
> I0928 13:55:48.006940 28417 status_update_manager.cpp:176] Pausing sending 
> status updates



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-3123) DockerContainerizerTest.ROOT_DOCKER_Launch_Executor_Bridged fails & crashes

2015-09-28 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934166#comment-14934166
 ] 

Niklas Quarfot Nielsen edited comment on MESOS-3123 at 9/28/15 10:08 PM:
-

Just ran into this during testing of Mesos 0.25.0 rc1 on Ubuntu 14.04

{code}
[ RUN  ] DockerContainerizerTest.ROOT_DOCKER_Launch_Executor
2015-09-28 
22:00:14,166:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:00:17,504:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:00:20,841:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:00:24,178:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
../../mesos/src/tests/containerizer/docker_containerizer_tests.cpp:254: Failure
Value of: statusRunning.get().state()
  Actual: TASK_FAILED
Expected: TASK_RUNNING
2015-09-28 
22:00:27,515:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:00:30,851:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:00:34,188:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:00:37,526:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:00:40,863:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:00:44,208:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:00:47,546:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:00:50,884:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:00:54,222:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:00:57,560:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:01:00,899:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:01:04,238:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:01:07,575:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:01:10,912:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:01:14,249:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:01:17,587:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:01:20,925:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:01:24,264:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
../../mesos/src/tests/containerizer/docker_containerizer_tests.cpp:255: Failure
Failed to wait 

[jira] [Updated] (MESOS-3516) Add user doc for networking support in Mesos 0.25.0

2015-09-28 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-3516:
--
Story Points: 2

> Add user doc for networking support in Mesos 0.25.0
> ---
>
> Key: MESOS-3516
> URL: https://issues.apache.org/jira/browse/MESOS-3516
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Niklas Quarfot Nielsen
>Assignee: Niklas Quarfot Nielsen
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3123) DockerContainerizerTest.ROOT_DOCKER_Launch_Executor_Bridged fails & crashes

2015-09-28 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3123?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934166#comment-14934166
 ] 

Niklas Quarfot Nielsen commented on MESOS-3123:
---

{code}
[ RUN  ] DockerContainerizerTest.ROOT_DOCKER_Launch_Executor
2015-09-28 
22:00:14,166:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:00:17,504:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:00:20,841:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:00:24,178:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
../../mesos/src/tests/containerizer/docker_containerizer_tests.cpp:254: Failure
Value of: statusRunning.get().state()
  Actual: TASK_FAILED
Expected: TASK_RUNNING
2015-09-28 
22:00:27,515:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:00:30,851:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:00:34,188:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:00:37,526:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:00:40,863:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:00:44,208:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:00:47,546:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:00:50,884:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:00:54,222:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:00:57,560:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:01:00,899:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:01:04,238:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:01:07,575:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:01:10,912:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:01:14,249:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:01:17,587:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:01:20,925:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
2015-09-28 
22:01:24,264:7267(0x2ba9fb511700):ZOO_ERROR@handle_socket_error_msg@1697: 
Socket [127.0.0.1:53630] zk retcode=-4, errno=111(Connection refused): server 
refused to accept the client
../../mesos/src/tests/containerizer/docker_containerizer_tests.cpp:255: Failure
Failed to wait 1mins for statusFinished
../../mesos/src/tests/containerizer/docker_containerizer_tests.cpp:246: Failure
Actual function 

[jira] [Created] (MESOS-3538) CgroupsNoHierarchyTest.ROOT_CGROUPS_NOHIERARCHY_MountUnmountHierarchy test is flaky

2015-09-28 Thread Niklas Quarfot Nielsen (JIRA)
Niklas Quarfot Nielsen created MESOS-3538:
-

 Summary: 
CgroupsNoHierarchyTest.ROOT_CGROUPS_NOHIERARCHY_MountUnmountHierarchy test is 
flaky
 Key: MESOS-3538
 URL: https://issues.apache.org/jira/browse/MESOS-3538
 Project: Mesos
  Issue Type: Bug
Reporter: Niklas Quarfot Nielsen
Priority: Blocker


{code}
$ sudo ./bin/mesos-tests.sh 
--gtest_filter="CgroupsNoHierarchyTest.ROOT_CGROUPS_NOHIERARCHY_MountUnmountHierarchy"
 
Source directory: /home/vagrant/mesos
Build directory: /home/vagrant/mesos-build
-
We cannot run any cgroups tests that require mounting
hierarchies because you have the following hierarchies mounted:
/sys/fs/cgroup/blkio, /sys/fs/cgroup/cpu, /sys/fs/cgroup/cpuacct, 
/sys/fs/cgroup/cpuset, /sys/fs/cgroup/devices, /sys/fs/cgroup/freezer, 
/sys/fs/cgroup/hugetlb, /sys/fs/cgroup/memory, /sys/fs/cgroup/perf_event, 
/sys/fs/cgroup/systemd
We'll disable the CgroupsNoHierarchyTest test fixture for now.
-
sh: 1: perf: not found
-
No 'perf' command found so no 'perf' tests will be run
-
/bin/nc
Note: Google Test filter = 
CgroupsNoHierarchyTest.ROOT_CGROUPS_NOHIERARCHY_MountUnmountHierarchy-MesosContainerizerSlaveRecoveryTest.CGROUPS_ROOT_PerfRollForward:PerfEventIsolatorTest.ROOT_CGROUPS_Sample:UserCgroupIsolatorTest/2.ROOT_CGROUPS_UserCgroup:CgroupsAnyHierarchyWithPerfEventTest.ROOT_CGROUPS_Perf:PerfTest.ROOT_Events:PerfTest.ROOT_Sample:PerfTest.Parse:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/0:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/1:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/2:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/3:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/4:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/5:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/6:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/7:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/8:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/9:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/10:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/11:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/12:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/13:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/14:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/15:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/16:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/17:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/18:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/19:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/20:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/21:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/22:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/23:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/24:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/25:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/26:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/27:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/28:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/29:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/30:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/31:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/32:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/33:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/34:SlaveAndFrameworkCount/HierarchicalAllocator_BENCHMARK_Test.AddAndUpdateSlave/35:SlaveCount/Registrar_BENCHMARK_Test.Performance/0:SlaveCount/Registrar_BENCHMARK_Test.Performance/1:SlaveCount/Registrar_BENCHMARK_Test.Performance/2:SlaveCount/Registrar_BENCHMARK_Test.Performance/3
[==] Running 1 test from 1 test case.
[--] Global test environment set-up.
[--] 1 test from 

[jira] [Commented] (MESOS-3538) CgroupsNoHierarchyTest.ROOT_CGROUPS_NOHIERARCHY_MountUnmountHierarchy test is flaky

2015-09-28 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14934219#comment-14934219
 ] 

Niklas Quarfot Nielsen commented on MESOS-3538:
---

Thanks Jie! I will rerun the test and see if that solves the problem (and close 
the ticket if everything is OK)

> CgroupsNoHierarchyTest.ROOT_CGROUPS_NOHIERARCHY_MountUnmountHierarchy test is 
> flaky
> ---
>
> Key: MESOS-3538
> URL: https://issues.apache.org/jira/browse/MESOS-3538
> Project: Mesos
>  Issue Type: Bug
>Reporter: Niklas Quarfot Nielsen
>Priority: Blocker
>
> {code}
> $ sudo ./bin/mesos-tests.sh 
> --gtest_filter="CgroupsNoHierarchyTest.ROOT_CGROUPS_NOHIERARCHY_MountUnmountHierarchy"
>  
> Source directory: /home/vagrant/mesos
> Build directory: /home/vagrant/mesos-build
> -
> We cannot run any cgroups tests that require mounting
> hierarchies because you have the following hierarchies mounted:
> /sys/fs/cgroup/blkio, /sys/fs/cgroup/cpu, /sys/fs/cgroup/cpuacct, 
> /sys/fs/cgroup/cpuset, /sys/fs/cgroup/devices, /sys/fs/cgroup/freezer, 
> /sys/fs/cgroup/hugetlb, /sys/fs/cgroup/memory, /sys/fs/cgroup/perf_event, 
> /sys/fs/cgroup/systemd
> We'll disable the CgroupsNoHierarchyTest test fixture for now.
> -
> sh: 1: perf: not found
> -
> No 'perf' command found so no 'perf' tests will be run
> -
> /bin/nc
> Note: Google Test filter = 
> 

[jira] [Commented] (MESOS-3493) benchmark for declining offers

2015-09-25 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3493?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14908369#comment-14908369
 ] 

Niklas Quarfot Nielsen commented on MESOS-3493:
---

Hey James!

Let's get this on a committers radar (and get a shepherd assigned to the issue 
and reviewers added to your reviews).

[~jieyu] [~mcypark] - Would you be up for it? I can help too but think you have 
more experience in the allocator. Just let me know.

> benchmark for declining offers
> --
>
> Key: MESOS-3493
> URL: https://issues.apache.org/jira/browse/MESOS-3493
> Project: Mesos
>  Issue Type: Improvement
>  Components: test
>Reporter: James Peach
>Priority: Minor
>
> I wrote a benchmark that can be used to demonstrate the performance issues 
> addressed in MESOS-3052, MESOS-3051, MESOS-3157 and MESOS-3075. The benchmark 
> simulates a number of frameworks that start declining all offers once they 
> reach the limit of work they need to do.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3516) Add user doc for networking support in Mesos 0.25.0

2015-09-24 Thread Niklas Quarfot Nielsen (JIRA)
Niklas Quarfot Nielsen created MESOS-3516:
-

 Summary: Add user doc for networking support in Mesos 0.25.0
 Key: MESOS-3516
 URL: https://issues.apache.org/jira/browse/MESOS-3516
 Project: Mesos
  Issue Type: Documentation
Reporter: Niklas Quarfot Nielsen
Assignee: Kapil Arya






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3282) Web UI no longer shows Tasks information

2015-09-23 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3282?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-3282:
--
Target Version/s: 0.25.0

> Web UI no longer shows Tasks information
> 
>
> Key: MESOS-3282
> URL: https://issues.apache.org/jira/browse/MESOS-3282
> Project: Mesos
>  Issue Type: Bug
>  Components: webui
>Affects Versions: 0.23.0
>Reporter: Diogo Gomes
>Assignee: haosdent
> Attachments: task_infomartions.png
>
>
> After update Mesos to 0.23 the Tasks box no longer shows info. Reading the 
> code,  seems like it depends of state.json endpoint. In 0.22.1, it's possible 
> to see data like "failed_tasks" and"finished_tasks" and this is not present 
> anymore in 0.23.0 state.json, as is required in 
> https://github.com/apache/mesos/blob/a0811310c82ee25644fc9a6362313ce3619e46d9/src/webui/master/static/js/controllers.js#L126



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3498) Failed to create a containerizer

2015-09-23 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3498?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-3498:
--
Target Version/s: 0.25.0

> Failed to create a containerizer
> 
>
> Key: MESOS-3498
> URL: https://issues.apache.org/jira/browse/MESOS-3498
> Project: Mesos
>  Issue Type: Bug
>  Components: docker
>Affects Versions: 0.25.0
> Environment: Docker 1.8.2
> Ubuntu 14.04.1 LTS
> User: root
> Linux li202-122 4.1.5-x86_64-linode61 #7 SMP Mon Aug 24 13:46:31 EDT 2015 
> x86_64 x86_64 x86_64 GNU/Linux
>Reporter: Rafael Capucho
>Assignee: Kapil Arya
>Priority: Blocker
>
> I'm using that script to compile mesos[1] and it's working properly, i can 
> deploy mesos master ok.
> [1] - 
> https://bitbucket.org/rafaelcapucho/docker-mesos-master/src/b8709d3cbe52255f8eb5df17f79abff6b1945e95/Dockerfile?at=master=file-view-default
> The problem is when I will deploy mesos slave with the same [1] script, 
> running:
> docker run rafa/docker-mesos-master mesos-slave --logging_level='INFO' 
> --log_dir=/var/log/mesos --master="zk://173.255.192.122:2181/mesos"
> It exit after couple of seconds, the log:
> root@li202-122:~# docker run -e "MESOS_LOG_DIR=/var/log/mesos" 
> rafa/docker-mesos-master mesos-slave --logging_level='INFO' 
> --log_dir=/var/log/mesos --master="zk://173.255.192.122:2181/mesos"
> I0923 00:28:11.621003 1 logging.cpp:172] INFO level logging started!
> I0923 00:28:11.621363 1 main.cpp:185] Build: 2015-09-22 23:39:14 by 
> I0923 00:28:11.621389 1 main.cpp:187] Version: 0.25.0
> I0923 00:28:11.621397 1 main.cpp:194] Git SHA: 
> f5ec1d006794ef906e8b56861aa771888a73702f
> I0923 00:28:11.622469 1 containerizer.cpp:143] Using isolation: 
> posix/cpu,posix/mem,filesystem/posix
> Failed to create a containerizer: Could not create MesosContainerizer: Failed 
> to create launcher: Failed to create Linux launcher: Failed to create root 
> cgroup /sys/fs/cgroup/freezer/mesos: Failed to create directory 
> '/sys/fs/cgroup/freezer/mesos': Read-only file system



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2642) Provide a way for frameworks to report their SLO per executor or task.

2015-09-23 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14904975#comment-14904975
 ] 

Niklas Quarfot Nielsen commented on MESOS-2642:
---

Draft document for the design: 
https://docs.google.com/document/d/1EfBESIeqJvj_hVGz_oIVHDnJiQtFASKwrtAIQgL0x0k/edit#heading=h.pldb4jdil4u1

Major changes coming up are the notion of SLI (Service Level Indicator) vs SLO 
(The target; Service Level Objective).

We see two themes:
 - How to produce/export the SLI and SLO from the framework and it's tasks
 - How to consume them from the oversubscription modules and external 
monitoring solutions

We have highlighted a few ways of doing this, and will follow up with a more 
solidified architecture proposal soon (will be in the same document). Stay tuned

> Provide a way for frameworks to report their SLO per executor or task.
> --
>
> Key: MESOS-2642
> URL: https://issues.apache.org/jira/browse/MESOS-2642
> Project: Mesos
>  Issue Type: Story
>Reporter: Niklas Quarfot Nielsen
>
> In context of oversubscription, but useful as a general feature; allowing 
> frameworks to report their SLO (to which degree, if it is violated, etc) 
> allows for better monitoring, scheduling and more aggressive oversubscription 
> strategies.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3451) Failing tests after changes to Isolator/MesosContainerizer API

2015-09-23 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-3451:
--
Shepherd: Jie Yu  (was: Niklas Quarfot Nielsen)

> Failing tests after changes to Isolator/MesosContainerizer API
> --
>
> Key: MESOS-3451
> URL: https://issues.apache.org/jira/browse/MESOS-3451
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>Priority: Blocker
>
> The failures are related to the following recent commits :
> e047f7d69b5297cc787487b6093119a3be517e48
> fc541a9a97eb1d86c27452019ff217eed11ed5a3
> 6923bb3e8cfbddde9fbabc6ca4edc29d9fc96c06



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3489) Add support for exposing Accept/Decline responses for inverse offers

2015-09-23 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14905195#comment-14905195
 ] 

Niklas Quarfot Nielsen commented on MESOS-3489:
---

Awaiting review/shipit from [~jvanremoortere]

> Add support for exposing Accept/Decline responses for inverse offers
> 
>
> Key: MESOS-3489
> URL: https://issues.apache.org/jira/browse/MESOS-3489
> Project: Mesos
>  Issue Type: Bug
>Reporter: Artem Harutyunyan
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> Current implementation of maintenance primitives does not support exposing 
> Accept/Decline responses of frameworks to the cluster operators. 
> This functionality is necessary to provide visibility to operators into 
> whether a given framework is ready to comply with the posted maintenance 
> schedule.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3507) As an operator, I want a way to inspect queued tasks in running schedulers

2015-09-23 Thread Niklas Quarfot Nielsen (JIRA)
Niklas Quarfot Nielsen created MESOS-3507:
-

 Summary: As an operator, I want a way to inspect queued tasks in 
running schedulers
 Key: MESOS-3507
 URL: https://issues.apache.org/jira/browse/MESOS-3507
 Project: Mesos
  Issue Type: Story
Reporter: Niklas Quarfot Nielsen


Currently, there is no uniform way of getting a notion of 'awaiting' tasks i.e. 
expressing that a framework has more work to do. This information is useful for 
auto-scaling and anomaly detection systems. Schedulers tend to expose this over 
their own http endpoints, but the format across schedulers are most likely not 
compatible.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3051) performance issues with port ranges comparison

2015-09-22 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3051?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903605#comment-14903605
 ] 

Niklas Quarfot Nielsen commented on MESOS-3051:
---

[~tillt] [~js84] - Will you guys have this landed by tomorrow EOD?

> performance issues with port ranges comparison
> --
>
> Key: MESOS-3051
> URL: https://issues.apache.org/jira/browse/MESOS-3051
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Affects Versions: 0.22.1
>Reporter: James Peach
>Assignee: Joerg Schad
>  Labels: mesosphere
>
> Testing in an environment with lots of frameworks (>200), where the 
> frameworks permanently decline resources they don't need. The allocator ends 
> up spending a lot of time figuring out whether offers are refused (the code 
> path through {{HierarchicalAllocatorProcess::isFiltered()}}.
> In profiling a synthetic benchmark, it turns out that comparing port ranges 
> is very expensive, involving many temporary allocations. 61% of 
> Resources::contains() run time is in operator -= (Resource). 35% of 
> Resources::contains() run time is in Resources::_contains().
> The heaviest call chain through {{Resources::_contains}} is:
> {code}
> Running Time  Self (ms) Symbol Name
> 7237.0ms   35.5%  4.0
> mesos::Resources::_contains(mesos::Resource const&) const
> 7200.0ms   35.3%  1.0 mesos::contains(mesos::Resource 
> const&, mesos::Resource const&)
> 7133.0ms   35.0%121.0  
> mesos::operator<=(mesos::Value_Ranges const&, mesos::Value_Ranges const&)
> 6319.0ms   31.0%  7.0   
> mesos::coalesce(mesos::Value_Ranges*, mesos::Value_Ranges const&)
> 6240.0ms   30.6%161.0
> mesos::coalesce(mesos::Value_Ranges*, mesos::Value_Range const&)
> 1867.0ms9.1% 25.0 mesos::Value_Ranges::add_range()
> 1694.0ms8.3%  4.0 
> mesos::Value_Ranges::~Value_Ranges()
> 1495.0ms7.3% 16.0 
> mesos::Value_Ranges::operator=(mesos::Value_Ranges const&)
>  445.0ms2.1% 94.0 
> mesos::Value_Range::MergeFrom(mesos::Value_Range const&)
>  154.0ms0.7% 24.0 mesos::Value_Ranges::range(int) 
> const
>  103.0ms0.5% 24.0 
> mesos::Value_Ranges::range_size() const
>   95.0ms0.4%  2.0 
> mesos::Value_Range::Value_Range(mesos::Value_Range const&)
>   59.0ms0.2%  4.0 
> mesos::Value_Ranges::Value_Ranges()
>   50.0ms0.2% 50.0 mesos::Value_Range::begin() 
> const
>   28.0ms0.1% 28.0 mesos::Value_Range::end() const
>   26.0ms0.1%  0.0 
> mesos::Value_Range::~Value_Range()
> {code}
> mesos::coalesce(Value_Ranges) gets done a lot and ends up being really 
> expensive. The heaviest parts of the inverted call chain are:
> {code}
> Running Time  Self (ms)   Symbol Name
> 3209.0ms   15.7%  3209.0  mesos::Value_Range::~Value_Range()
> 3209.0ms   15.7%  0.0  
> google::protobuf::internal::GenericTypeHandler::Delete(mesos::Value_Range*)
> 3209.0ms   15.7%  0.0   void 
> google::protobuf::internal::RepeatedPtrFieldBase::Destroy()
> 3209.0ms   15.7%  0.0
> google::protobuf::RepeatedPtrField::~RepeatedPtrField()
> 3209.0ms   15.7%  0.0 
> google::protobuf::RepeatedPtrField::~RepeatedPtrField()
> 3209.0ms   15.7%  0.0  
> mesos::Value_Ranges::~Value_Ranges()
> 3209.0ms   15.7%  0.0   
> mesos::Value_Ranges::~Value_Ranges()
> 2441.0ms   11.9%  0.0
> mesos::coalesce(mesos::Value_Ranges*, mesos::Value_Range const&)
>  452.0ms2.2%  0.0
> mesos::remove(mesos::Value_Ranges*, mesos::Value_Range const&)
>  169.0ms0.8%  0.0
> mesos::operator<=(mesos::Value_Ranges const&, mesos::Value_Ranges const&)
>   82.0ms0.4%  0.0
> mesos::operator-=(mesos::Value_Ranges&, mesos::Value_Ranges const&)
>   65.0ms0.3%  0.0
> mesos::Value_Ranges::~Value_Ranges()
> 2541.0ms   12.4%  2541.0  
> google::protobuf::internal::GenericTypeHandler::New()
> 2541.0ms   12.4%  0.0  
> google::protobuf::RepeatedPtrField::TypeHandler::Type* 
> google::protobuf::internal::RepeatedPtrFieldBase::Add()
> 2305.0ms   11.3%  0.0   
> google::protobuf::RepeatedPtrField::Add()
> 2305.0ms   11.3%  0.0mesos::Value_Ranges::add_range()

[jira] [Commented] (MESOS-3457) Add flag to disable hostname lookup

2015-09-22 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3457?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903601#comment-14903601
 ] 

Niklas Quarfot Nielsen commented on MESOS-3457:
---

[~benjaminhindman] Will you get to take a look/shepherd this change by tomorrow 
EOD? If not, should we find another shepherd?

> Add flag to disable hostname lookup
> ---
>
> Key: MESOS-3457
> URL: https://issues.apache.org/jira/browse/MESOS-3457
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Cody Maloney
>Assignee: Marco Massenzio
>  Labels: mesosphere
>
> In testing / buildinging DCOS we've found that we need to set --hostname 
> explicitly on the masters. For our uses IP and `hostname` must always be the 
> same thing. 
> More in general, under certain circumstances, dynamic lookup of {{hostname}}, 
> while successful, provides undesirable results; we would also like, in those 
> circumstances, be able to just set the hostname to the chosen
> IP address (possibly set via the {{\-\- ip_discovery_command}} method).
> We suggest adding a {{\-\-no-hostname-lookup}}. 
> Note that we can introduce this flag as {{--hostname-lookup}} with a default 
> to 'true' (which is the current semantics) and that way someone can do 
> {{\-\-no-hostname-lookup}} or {{\-\-hostname-lookup=false}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3489) Add support for exposing Accept/Decline responses for inverse offers

2015-09-22 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903594#comment-14903594
 ] 

Niklas Quarfot Nielsen commented on MESOS-3489:
---

Will this land in time for 0.25.0? If not, let's bump to 0.26.0

> Add support for exposing Accept/Decline responses for inverse offers
> 
>
> Key: MESOS-3489
> URL: https://issues.apache.org/jira/browse/MESOS-3489
> Project: Mesos
>  Issue Type: Bug
>Reporter: Artem Harutyunyan
>Assignee: Joseph Wu
>  Labels: mesosphere
>
> Current implementation of maintenance primitives does not support exposing 
> Accept/Decline responses of frameworks to the cluster operators. 
> This functionality is necessary to provide visibility to operators into 
> whether a given framework is ready to comply with the posted maintenance 
> schedule.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3366) Allow resources/attributes discovery

2015-09-22 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14903715#comment-14903715
 ] 

Niklas Quarfot Nielsen commented on MESOS-3366:
---

Submitted

commit b0d1c6ea0b3f6ff17d4e947a5bf0258a649a8f65
Author: Felix Abecassis 
Date:   Tue Sep 22 13:57:58 2015 -0700

Made attributes.hpp public.

This is required in order to enable callback hooks that can modify
attributes of a slave during initialization.

Review: https://reviews.apache.org/r/38517

> Allow resources/attributes discovery
> 
>
> Key: MESOS-3366
> URL: https://issues.apache.org/jira/browse/MESOS-3366
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Reporter: Felix Abecassis
>
> In heterogeneous clusters, tasks sometimes have strong constraints on the 
> type of hardware they need to execute on. The current solution is to use 
> custom resources and attributes on the agents. Detecting non-standard 
> resources/attributes requires wrapping the "mesos-slave" binary behind a 
> script and use custom code to probe the agent. Unfortunately, this approach 
> doesn't allow composition. The solution would be to provide a hook/module 
> mechanism to allow users to use custom code performing resources/attributes 
> discovery.
> Please review the detailed document below:
> https://docs.google.com/document/d/15OkebDezFxzeyLsyQoU0upB0eoVECAlzEkeg0HQAX9w
> Feel free to express comments/concerns by annotating the document or by 
> replying to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2226) HookTest.VerifySlaveLaunchExecutorHook is flaky

2015-09-21 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14901552#comment-14901552
 ] 

Niklas Quarfot Nielsen commented on MESOS-2226:
---

https://reviews.apache.org/r/38574/

> HookTest.VerifySlaveLaunchExecutorHook is flaky
> ---
>
> Key: MESOS-2226
> URL: https://issues.apache.org/jira/browse/MESOS-2226
> Project: Mesos
>  Issue Type: Bug
>  Components: test
>Affects Versions: 0.22.0
>Reporter: Vinod Kone
>Assignee: Kapil Arya
>  Labels: flaky, flaky-test, mesosphere
>
> Observed this on internal CI
> {code}
> [ RUN  ] HookTest.VerifySlaveLaunchExecutorHook
> Using temporary directory '/tmp/HookTest_VerifySlaveLaunchExecutorHook_GjBgME'
> I0114 18:51:34.659353  4720 leveldb.cpp:176] Opened db in 1.255951ms
> I0114 18:51:34.662112  4720 leveldb.cpp:183] Compacted db in 596090ns
> I0114 18:51:34.662364  4720 leveldb.cpp:198] Created db iterator in 177877ns
> I0114 18:51:34.662719  4720 leveldb.cpp:204] Seeked to beginning of db in 
> 19709ns
> I0114 18:51:34.663010  4720 leveldb.cpp:273] Iterated through 0 keys in the 
> db in 18208ns
> I0114 18:51:34.663312  4720 replica.cpp:744] Replica recovered with log 
> positions 0 -> 0 with 1 holes and 0 unlearned
> I0114 18:51:34.664266  4735 recover.cpp:449] Starting replica recovery
> I0114 18:51:34.664908  4735 recover.cpp:475] Replica is in EMPTY status
> I0114 18:51:34.667842  4734 replica.cpp:641] Replica in EMPTY status received 
> a broadcasted recover request
> I0114 18:51:34.669117  4735 recover.cpp:195] Received a recover response from 
> a replica in EMPTY status
> I0114 18:51:34.677913  4735 recover.cpp:566] Updating replica status to 
> STARTING
> I0114 18:51:34.683157  4735 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 137939ns
> I0114 18:51:34.683507  4735 replica.cpp:323] Persisted replica status to 
> STARTING
> I0114 18:51:34.684013  4735 recover.cpp:475] Replica is in STARTING status
> I0114 18:51:34.685554  4738 replica.cpp:641] Replica in STARTING status 
> received a broadcasted recover request
> I0114 18:51:34.696512  4736 recover.cpp:195] Received a recover response from 
> a replica in STARTING status
> I0114 18:51:34.700552  4735 recover.cpp:566] Updating replica status to VOTING
> I0114 18:51:34.701128  4735 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 115624ns
> I0114 18:51:34.701478  4735 replica.cpp:323] Persisted replica status to 
> VOTING
> I0114 18:51:34.701817  4735 recover.cpp:580] Successfully joined the Paxos 
> group
> I0114 18:51:34.702569  4735 recover.cpp:464] Recover process terminated
> I0114 18:51:34.716439  4736 master.cpp:262] Master 
> 20150114-185134-2272962752-57018-4720 (fedora-19) started on 
> 192.168.122.135:57018
> I0114 18:51:34.716913  4736 master.cpp:308] Master only allowing 
> authenticated frameworks to register
> I0114 18:51:34.717136  4736 master.cpp:313] Master only allowing 
> authenticated slaves to register
> I0114 18:51:34.717488  4736 credentials.hpp:36] Loading credentials for 
> authentication from 
> '/tmp/HookTest_VerifySlaveLaunchExecutorHook_GjBgME/credentials'
> I0114 18:51:34.718077  4736 master.cpp:357] Authorization enabled
> I0114 18:51:34.719238  4738 whitelist_watcher.cpp:65] No whitelist given
> I0114 18:51:34.719755  4737 hierarchical_allocator_process.hpp:285] 
> Initialized hierarchical allocator process
> I0114 18:51:34.722584  4736 master.cpp:1219] The newly elected leader is 
> master@192.168.122.135:57018 with id 20150114-185134-2272962752-57018-4720
> I0114 18:51:34.722865  4736 master.cpp:1232] Elected as the leading master!
> I0114 18:51:34.723310  4736 master.cpp:1050] Recovering from registrar
> I0114 18:51:34.723760  4734 registrar.cpp:313] Recovering registrar
> I0114 18:51:34.725229  4740 log.cpp:660] Attempting to start the writer
> I0114 18:51:34.727893  4739 replica.cpp:477] Replica received implicit 
> promise request with proposal 1
> I0114 18:51:34.728425  4739 leveldb.cpp:306] Persisting metadata (8 bytes) to 
> leveldb took 114781ns
> I0114 18:51:34.728662  4739 replica.cpp:345] Persisted promised to 1
> I0114 18:51:34.731271  4741 coordinator.cpp:230] Coordinator attemping to 
> fill missing position
> I0114 18:51:34.733223  4734 replica.cpp:378] Replica received explicit 
> promise request for position 0 with proposal 2
> I0114 18:51:34.734076  4734 leveldb.cpp:343] Persisting action (8 bytes) to 
> leveldb took 87441ns
> I0114 18:51:34.734441  4734 replica.cpp:679] Persisted action at 0
> I0114 18:51:34.740272  4739 replica.cpp:511] Replica received write request 
> for position 0
> I0114 18:51:34.740910  4739 leveldb.cpp:438] Reading position from leveldb 
> took 59846ns
> I0114 18:51:34.741672  4739 leveldb.cpp:343] Persisting action (14 bytes) to 
> 

[jira] [Commented] (MESOS-2728) Introduce concept of cluster wide resources.

2015-09-18 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2728?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14875978#comment-14875978
 ] 

Niklas Quarfot Nielsen commented on MESOS-2728:
---

Let's form a working group :)

> Introduce concept of cluster wide resources.
> 
>
> Key: MESOS-2728
> URL: https://issues.apache.org/jira/browse/MESOS-2728
> Project: Mesos
>  Issue Type: Epic
>Reporter: Joerg Schad
>Assignee: Klaus Ma
>  Labels: mesosphere
>
> There are resources which are not provided by a single node. Consider for 
> example a external Network Bandwidth of a cluster. Being a limited resource 
> it makes sense for Mesos to manage it but still it is not a resource being 
> offered by a single node. A cluster-wide resource is still consumed by a 
> task, and when that task completes, the resources are then available to be 
> allocated to another framework/task.
> Use Cases:
> 1. Network Bandwidth
> 2. IP Addresses
> 3. Global Service Ports
> 2. Distributed File System Storage
> 3. Software Licences



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2930) Allow the Resource Estimator to express over-allocation of revocable resources.

2015-09-18 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2930?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14875997#comment-14875997
 ] 

Niklas Quarfot Nielsen commented on MESOS-2930:
---

Let's chat about this before we move forward; I have concerns about overlapping 
responsibilities of the QoS controller and estimator in this case.

> Allow the Resource Estimator to express over-allocation of revocable 
> resources.
> ---
>
> Key: MESOS-2930
> URL: https://issues.apache.org/jira/browse/MESOS-2930
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Reporter: Benjamin Mahler
>Assignee: Klaus Ma
>
> Currently the resource estimator returns the amount of oversubscription 
> resources that are available, since resources cannot be negative, this allows 
> the resource estimator to express the following:
> (1) Return empty resources: We are fully allocated for oversubscription 
> resources.
> (2) Return non-empty resources: We are under-allocated for oversubscription 
> resources. In other words, some are available.
> However, there is an additional situation that we cannot express:
> (3) Analogous to returning non-empty "negative" resources: We are 
> over-allocated for oversubscription resources. Do not re-offer any of the 
> over-allocated oversubscription resources that are recovered.
> Without (3), the slave can only shrink the total pool of oversubscription 
> resources by returning (1) as resources are recovered, until the pool is 
> shrunk to the desired size. However, this approach is only best-effort, it's 
> possible for a framework to launch more tasks in the window of time (15 
> seconds by default) that the slave polls the estimator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3366) Allow resources/attributes discovery

2015-09-16 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791418#comment-14791418
 ] 

Niklas Quarfot Nielsen commented on MESOS-3366:
---

Sorry about the tardy reply; I think we are good. Let's keep the two decorators 
split for now.

> Allow resources/attributes discovery
> 
>
> Key: MESOS-3366
> URL: https://issues.apache.org/jira/browse/MESOS-3366
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Reporter: Felix Abecassis
>
> In heterogeneous clusters, tasks sometimes have strong constraints on the 
> type of hardware they need to execute on. The current solution is to use 
> custom resources and attributes on the agents. Detecting non-standard 
> resources/attributes requires wrapping the "mesos-slave" binary behind a 
> script and use custom code to probe the agent. Unfortunately, this approach 
> doesn't allow composition. The solution would be to provide a hook/module 
> mechanism to allow users to use custom code performing resources/attributes 
> discovery.
> Please review the detailed document below:
> https://docs.google.com/document/d/15OkebDezFxzeyLsyQoU0upB0eoVECAlzEkeg0HQAX9w
> Feel free to express comments/concerns by annotating the document or by 
> replying to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3418) Factor out V1 API test helper functions

2015-09-16 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-3418:
--
Target Version/s: 0.26.0  (was: 0.25.0)

> Factor out V1 API test helper functions
> ---
>
> Key: MESOS-3418
> URL: https://issues.apache.org/jira/browse/MESOS-3418
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Joris Van Remoortere
>Assignee: Guangya Liu
>  Labels: beginner, mesosphere, newbie, v1_api
>
> We currently have some helper functionality for V1 API tests. This is copied 
> in a few test files.
> Factor this out into a common place once the API is stabilized.
> {code}
> // Helper class for using EXPECT_CALL since the Mesos scheduler API
>   // is callback based.
>   class Callbacks
>   {
>   public:
> MOCK_METHOD0(connected, void(void));
> MOCK_METHOD0(disconnected, void(void));
> MOCK_METHOD1(received, void(const std::queue&));
>   };
> {code}
> {code}
> // Enqueues all received events into a libprocess queue.
> // TODO(jmlvanre): Factor this common code out of tests into V1
> // helper.
> ACTION_P(Enqueue, queue)
> {
>   std::queue events = arg0;
>   while (!events.empty()) {
> // Note that we currently drop HEARTBEATs because most of these tests
> // are not designed to deal with heartbeats.
> // TODO(vinod): Implement DROP_HTTP_CALLS that can filter heartbeats.
> if (events.front().type() == Event::HEARTBEAT) {
>   VLOG(1) << "Ignoring HEARTBEAT event";
> } else {
>   queue->put(events.front());
> }
> events.pop();
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3063) Add an example framework using dynamic reservation

2015-09-16 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3063?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791315#comment-14791315
 ] 

Niklas Quarfot Nielsen commented on MESOS-3063:
---

Is it likely that we have this landed by Monday? If not, let's post pone target 
release to Mesos 0.26.0

> Add an example framework using dynamic reservation
> --
>
> Key: MESOS-3063
> URL: https://issues.apache.org/jira/browse/MESOS-3063
> Project: Mesos
>  Issue Type: Task
>Reporter: Michael Park
>Assignee: Klaus Ma
>
> An example framework using dynamic reservation should added to
> # test dynamic reservations further, and
> # to be used as a reference for those who want to use the dynamic reservation 
> feature.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3371) Implement process::subprocess on Windows

2015-09-16 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3371?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-3371:
--
Target Version/s: 0.26.0  (was: 0.25.0)

> Implement process::subprocess on Windows
> 
>
> Key: MESOS-3371
> URL: https://issues.apache.org/jira/browse/MESOS-3371
> Project: Mesos
>  Issue Type: Task
>  Components: libprocess
>Reporter: Alex Clemmer
>  Labels: libprocess, mesosphere
>
> From a discussion with mpark we (IIRC) concluded that even on Windows we call 
> this a couple times. We need to (1) confirm, and (2) do it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3346) Add filter support for inverse offers

2015-09-16 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791313#comment-14791313
 ] 

Niklas Quarfot Nielsen commented on MESOS-3346:
---

Is this reviewable?

> Add filter support for inverse offers
> -
>
> Key: MESOS-3346
> URL: https://issues.apache.org/jira/browse/MESOS-3346
> Project: Mesos
>  Issue Type: Task
>Reporter: Artem Harutyunyan
>Assignee: Artem Harutyunyan
>  Labels: mesosphere
>
> A filter attached to the inverse offer can be used by the framework to 
> control when it wants to be contacted again with the inverse offer, since 
> future circumstances may change the viability of the maintenance schedule.  
> The “filter” for InverseOffers is identical to the existing mechanism for 
> re-offering Offers to frameworks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3412) Revisit Unix-to-Windows permissions mapping

2015-09-16 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-3412:
--
Target Version/s: 0.26.0  (was: 0.25.0)

> Revisit Unix-to-Windows permissions mapping
> ---
>
> Key: MESOS-3412
> URL: https://issues.apache.org/jira/browse/MESOS-3412
> Project: Mesos
>  Issue Type: Task
>  Components: stout
>Reporter: Alex Clemmer
>Assignee: Alex Clemmer
>  Labels: mesosphere, permissions, stout, windows
>
> In review https://reviews.apache.org/r/37032/ there was some debate about how 
> to handle "fallback" of setting/getting Unix permissions on Windows. That is, 
> on Windows there is not native support for "group" or "other" permissions, so 
> when a user gets/sets group permissions, we can either (1) make that 
> operation a no-op, or (2) "fall back" to getting/setting "user" permissions.
> Originally the review opted for a "strictness" flag, so that at compile time 
> users could pass in a flag and change the "fallback" behavior to be "strict" 
> instead, ignoring group and other permissions setting.
> Currently (Sept 10 2015) we have pulled this option out, and only allow 
> "strict" permissions. This will probably break stuff later, but we can 
> reevaluate later.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2560) Remove RunTaskMessage.framework_id

2015-09-16 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791310#comment-14791310
 ] 

Niklas Quarfot Nielsen commented on MESOS-2560:
---

[~karya] Are you working on this? If not, let's post pone to 0.26.0 as we are 
preparing to tag the release candidate for 0.25.0

> Remove RunTaskMessage.framework_id
> --
>
> Key: MESOS-2560
> URL: https://issues.apache.org/jira/browse/MESOS-2560
> Project: Mesos
>  Issue Type: Task
>  Components: framework
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>  Labels: mesosphere
>
> The previous release doesn't use framework_id and so it can be safely removed.
> This should land only after https://issues.apache.org/jira/browse/MESOS-2559 
> has been shipped.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-2224) Add explanatory comments for Allocator interface

2015-09-16 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2224?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791326#comment-14791326
 ] 

Niklas Quarfot Nielsen edited comment on MESOS-2224 at 9/16/15 11:49 PM:
-

[~alexr] Is it likely that we have this landed by Monday? If not, let's post 
pone target release to Mesos 0.26.0


was (Author: nnielsen):
Is it likely that we have this landed by Monday? If not, let's post pone target 
release to Mesos 0.26.0

> Add explanatory comments for Allocator interface
> 
>
> Key: MESOS-2224
> URL: https://issues.apache.org/jira/browse/MESOS-2224
> Project: Mesos
>  Issue Type: Task
>  Components: allocation
>Affects Versions: 0.25.0
>Reporter: Alexander Rukletsov
>Assignee: Guangya Liu
>Priority: Minor
> Fix For: 0.25.0
>
>
> Allocator is the public API and it would be great to have comments on all 
> calls to be implemented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3037) Add a QUIESCE call to the scheduler

2015-09-16 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791323#comment-14791323
 ] 

Niklas Quarfot Nielsen commented on MESOS-3037:
---

Is it likely that we have this landed by Monday? If not, let's post pone target 
release to Mesos 0.26.0

> Add a QUIESCE call to the scheduler
> ---
>
> Key: MESOS-3037
> URL: https://issues.apache.org/jira/browse/MESOS-3037
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 0.25.0
>Reporter: Vinod Kone
>Assignee: Guangya Liu
>  Labels: September23th
> Fix For: 0.25.0
>
>
> SUPPRESS call is the complement to the current REVIVE call i.e., it will 
> inform Mesos to stop sending offers to the framework. 
> For the scheduler driver to send only Call messages (MESOS-2913), 
> DeactivateFrameworkMessage needs to be converted to Call(s). We can implement 
> this by having the driver send a SUPPRESS call followed by a DECLINE call for 
> outstanding offers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3387) Refactor MesosContainerizer to accept namespace dynamically

2015-09-16 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791409#comment-14791409
 ] 

Niklas Quarfot Nielsen commented on MESOS-3387:
---

And

commit 6923bb3e8cfbddde9fbabc6ca4edc29d9fc96c06
Author: Kapil Arya 
Date:   Wed Sep 16 17:01:16 2015 -0700

Updated Isolator::prepare to return list of required namespaces.

This allows the Isolators to decide whether or not to provide resource
isolation on a per-container level.

Review: https://reviews.apache.org/r/38365


> Refactor MesosContainerizer to accept namespace dynamically
> ---
>
> Key: MESOS-3387
> URL: https://issues.apache.org/jira/browse/MESOS-3387
> Project: Mesos
>  Issue Type: Bug
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>
> We use ContainerPrepareInfo to return a list of namespace required for the 
> particular container (as decided by the isolator). The isolator makes this 
> decision by looking at the incoming ExecutorInfo, etc. parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3420) Resolve shutdown semantics for Machine/Down

2015-09-16 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3420?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791306#comment-14791306
 ] 

Niklas Quarfot Nielsen commented on MESOS-3420:
---

[~klausma1982] Are you working on this? If not, let's post pone to 0.26.0 as we 
are preparing to tag the release candidate for 0.25.0

> Resolve shutdown semantics for Machine/Down
> ---
>
> Key: MESOS-3420
> URL: https://issues.apache.org/jira/browse/MESOS-3420
> Project: Mesos
>  Issue Type: Task
>Reporter: Joris Van Remoortere
>Assignee: Klaus Ma
>  Labels: maintenance, mesosphere
>
> When an operator uses the {{machine/down}} endpoint, the master sends a 
> shutdown message to the agent.
> We need to discuss and resolve the semantics that we want regarding the 
> operators and frameworks knowing when their tasks are terminated.
> One option is to explicitly remove the agent from the master which will send 
> the {{TASK_LOST}} updates and {{SlaveLostMessage}} directly from the master. 
> The concern around this is that during a network partition, or if the agent 
> was down at the time, that these tasks could still be running.
> This is a general problem related to task life-times being dissociated with 
> that life-time of the agent.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3041) Decline call does not include an optional "reason", in the Event/Call API

2015-09-16 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3041?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-3041:
--
Target Version/s: 0.26.0  (was: 0.25.0)

> Decline call does not include an optional "reason", in the Event/Call API
> -
>
> Key: MESOS-3041
> URL: https://issues.apache.org/jira/browse/MESOS-3041
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Joseph Wu
>Assignee: Joris Van Remoortere
>  Labels: mesosphere
>
> In the Event/Call API, the Decline call is currently used by frameworks to 
> reject resource offers.
> In the case of InverseOffers, the framework could give additional information 
> to the operators and/or allocator, as to why the InverseOffer is declined. 
> i.e. Suppose a cluster running some consensus algorithm is given an 
> InverseOffer on one of its nodes.  It may decline saying "Too few nodes" (or, 
> more verbosely, "Specified InverseOffer would lower the number of active 
> nodes below quorum").
> This change requires the following changes:
> * include/mesos/scheduler/scheduler.proto:
> {code}
> message Call {
>   ...
>   message Decline {
> repeated OfferID offer_ids = 1;
> optional Filters filters = 2;
> // Add this extra string for each OfferID
> // i.e. reasons[i] is for offer_ids[i]
> repeated string reasons = 3;
>   }
>   ...
> }
> {code}
> * src/master/master.cpp
> Change Master::decline to either store the reason, or log it.
> * Add a declineOffer overload in the (Mesos)SchedulerDriver with an optional 
> "reason".
> ** Extend the interface in include/mesos/scheduler.hpp
> ** Add/change the declineOffer method in src/sched/sched.cpp



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2063) Add InverseOffer to C++ Scheduler API.

2015-09-16 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2063?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2063:
--
Target Version/s:   (was: 0.25.0)

> Add InverseOffer to C++ Scheduler API.
> --
>
> Key: MESOS-2063
> URL: https://issues.apache.org/jira/browse/MESOS-2063
> Project: Mesos
>  Issue Type: Task
>  Components: c++ api
>Reporter: Benjamin Mahler
>Assignee: Qian Zhang
>  Labels: mesosphere, twitter
>
> The initial use case for InverseOffer in the framework API will be the 
> maintenance primitives in mesos: MESOS-1474.
> One way to add these to the C++ Scheduler API is to add a new callback:
> {code}
>   virtual void inverseResourceOffers(
>   SchedulerDriver* driver,
>   const std::vector& inverseOffers) = 0;
> {code}
> libmesos compatibility will need to be figured out here.
> We may want to leave the C++ binding untouched in favor of Event/Call, in 
> order to not break API compatibility for schedulers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3387) Refactor MesosContainerizer to accept namespace dynamically

2015-09-16 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791412#comment-14791412
 ] 

Niklas Quarfot Nielsen commented on MESOS-3387:
---

and

commit fd0a431c340d66f96f71715ace0d10d9c1b17b49
Author: Kapil Arya 
Date:   Wed Sep 16 17:01:54 2015 -0700

Added helper to model Labels message for state.json.

Also updated Task modelling to show labels only if Task.has_labels() is
true. This way, the "labels" field won't shown if there are no labels.
This makes it consistent with the modelling of rest of the "optional"
fields.

Review: https://reviews.apache.org/r/38366

> Refactor MesosContainerizer to accept namespace dynamically
> ---
>
> Key: MESOS-3387
> URL: https://issues.apache.org/jira/browse/MESOS-3387
> Project: Mesos
>  Issue Type: Bug
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>
> We use ContainerPrepareInfo to return a list of namespace required for the 
> particular container (as decided by the isolator). The isolator makes this 
> decision by looking at the incoming ExecutorInfo, etc. parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3035) As a Developer I would like a standard way to run a Subprocess in libprocess

2015-09-16 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-3035:
--
Target Version/s: 0.26.0  (was: 0.25.0)

> As a Developer I would like a standard way to run a Subprocess in libprocess
> 
>
> Key: MESOS-3035
> URL: https://issues.apache.org/jira/browse/MESOS-3035
> Project: Mesos
>  Issue Type: Story
>  Components: libprocess
>Reporter: Marco Massenzio
>Assignee: Marco Massenzio
>
> As part of MESOS-2830 and MESOS-2902 I have been researching the ability to 
> run a {{Subprocess}} and capture the {{stdout / stderr}} along with the exit 
> status code.
> {{process::subprocess()}} offers much of the functionality, but in a way that 
> still requires a lot of handiwork on the developer's part; we would like to 
> further abstract away the ability to just pass a string, an optional set of 
> command-line arguments and then collect the output of the command (bonus: 
> without blocking).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3387) Refactor MesosContainerizer to accept namespace dynamically

2015-09-16 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14791407#comment-14791407
 ] 

Niklas Quarfot Nielsen commented on MESOS-3387:
---

And

commit fc541a9a97eb1d86c27452019ff217eed11ed5a3
Author: Kapil Arya 
Date:   Wed Sep 16 17:00:55 2015 -0700

Minor refactor for MesosContainerizer.

This change moves two different pieces of code, that each iterate over
list of ContainerPrepareInfos, close together for readability. It
becomes even more relevant when looking at
https://reviews.apache.org/r/38365/ that iterates over yet another
member from ContainerPrepareInfo.

Review: https://reviews.apache.org/r/38364


> Refactor MesosContainerizer to accept namespace dynamically
> ---
>
> Key: MESOS-3387
> URL: https://issues.apache.org/jira/browse/MESOS-3387
> Project: Mesos
>  Issue Type: Bug
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>
> We use ContainerPrepareInfo to return a list of namespace required for the 
> particular container (as decided by the isolator). The isolator makes this 
> decision by looking at the incoming ExecutorInfo, etc. parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3358) Add TaskStatus label decorator hooks for Master

2015-09-14 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744202#comment-14744202
 ] 

Niklas Quarfot Nielsen commented on MESOS-3358:
---

[~karya] Ping ^^ :)

> Add TaskStatus label decorator hooks for Master
> ---
>
> Key: MESOS-3358
> URL: https://issues.apache.org/jira/browse/MESOS-3358
> Project: Mesos
>  Issue Type: Task
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>  Labels: mesosphere
>
> The hook will be triggered when Master receives TaskStatus message from Agent 
> or when the Master itself generates a TASK_LOST status. The hook should also 
> provide a list of the previous TaskStatuses to the module.
> The use case is to allow a "cleanup" module to release IPs if an agent is 
> lost. The previous statuses will contain the IP address(es) to be released.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3387) Refactor MesosContainerizer to accept namespace dynamically

2015-09-14 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3387?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14744321#comment-14744321
 ] 

Niklas Quarfot Nielsen commented on MESOS-3387:
---

[~karya] In the spirit of process, can you please find and assign shepherds to 
these changes so we can agree on direction before code is shared?

> Refactor MesosContainerizer to accept namespace dynamically
> ---
>
> Key: MESOS-3387
> URL: https://issues.apache.org/jira/browse/MESOS-3387
> Project: Mesos
>  Issue Type: Bug
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>
> We use ContainerPrepareInfo to return a list of namespace required for the 
> particular container (as decided by the isolator). The isolator makes this 
> decision by looking at the incoming ExecutorInfo, etc. parameters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3361) Update MesosContainerizer to dynamically pick/enable isolators

2015-09-10 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3361?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14739958#comment-14739958
 ] 

Niklas Quarfot Nielsen commented on MESOS-3361:
---

This is not needed for 0.25.0, right?

> Update MesosContainerizer to dynamically pick/enable isolators
> --
>
> Key: MESOS-3361
> URL: https://issues.apache.org/jira/browse/MESOS-3361
> Project: Mesos
>  Issue Type: Task
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>  Labels: mesosphere
>
> This would allow the frameworks to opt-in/opt-out of network isolation per 
> container. Thus, one can launch some containers with their own IPs while 
> other containers still share the host IP.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2875) Add containerId to ResourceUsage to enable QoS controller to target a container

2015-09-10 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14739474#comment-14739474
 ] 

Niklas Quarfot Nielsen commented on MESOS-2875:
---

Yup - will take a look later today or start tomorrow

> Add containerId to ResourceUsage to enable QoS controller to target a 
> container
> ---
>
> Key: MESOS-2875
> URL: https://issues.apache.org/jira/browse/MESOS-2875
> Project: Mesos
>  Issue Type: Improvement
>  Components: oversubscription, slave
>Affects Versions: 0.25.0
>Reporter: Niklas Quarfot Nielsen
>Assignee: Klaus Ma
>  Labels: race-condition, slave
>
> We should ensure that we are addressing the _container_ which the QoS 
> controller intended to kill. Without this check, we may run into a scenario 
> where the executor has terminated and one with the same id has started in the 
> interim i.e. running in a different container than the one the QoS controller 
> targeted.
> This most likely requires us to add containerId to the ResourceUsage message 
> and encode the containerID in the QoS Correction message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3401) Add labels to Resources

2015-09-09 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3401?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737839#comment-14737839
 ] 

Niklas Quarfot Nielsen commented on MESOS-3401:
---

What is the use case? Can't the set type represent this?

> Add labels to Resources
> ---
>
> Key: MESOS-3401
> URL: https://issues.apache.org/jira/browse/MESOS-3401
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Reporter: Adam B
>Assignee: Greg Mann
>  Labels: mesosphere, resources
>
> Similar to how we have added labels to tasks/executors (MESOS-2120), and even 
> FrameworkInfo (MESOS-2841), we should extend Resource to allow arbitrary 
> key/value pairs.
> This could be used to specify that a cpu resource has a certain speed, that a 
> disk resource is SSD, or express any other metadata about a built-in or 
> custom resource type. Only the scalar quantity will be used for determining 
> fair share in the Mesos allocator. The rest will be passed onto frameworks as 
> info they can use for scheduling decisions.
> This would require changes to how the slave specifies its `--resources` 
> (probably as json), how the slave/master reports resources in its web/json 
> API, and how resources are offered to frameworks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-3366) Allow resources/attributes discovery

2015-09-09 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737366#comment-14737366
 ] 

Niklas Quarfot Nielsen edited comment on MESOS-3366 at 9/9/15 10:19 PM:


Folks; I think this discussion has drifted a bit. Whether or not modules are 
stable because they are executed in the context of the master and slave is 
outside the scope of this thread.

[~klaus1982] - nothing prevent you from continue using scripts to set 
attributes. We need a programmatic way of this this and if you don't have 
modules that needs this, you don't have to use this.

The focus of this ticket is module development and extensibility. Making a 
decision so we can move forward; we are discussing _how_ we want this interface 
to look like. Either through an extension to the isolator or another decorator. 
Let's start exploring how these would look like.

Having a one-flag-per-extension does not sound very scalable to me, compared to 
have one interaction point which lets isolators or hooks extend the initial 
resources and attributes for the slave.


was (Author: nnielsen):
Folks; I think this discussion has drifted a bit. Whether or not modules are 
stable because they are executed in the context of the master and slave is 
outside the scope of this thread.

[~klaus1982] - nothing prevent you from continue using scripts to set 
attributes. We need a programmatic way of this this and if you don't have 
modules that needs this, you don't have to use this.

The focus of this ticket is module development and extensibility. Making a 
decision so we can move forward; we are discussing _how_ we want this interface 
to look like. Either through an extension to the isolator or another decorator. 
Let's start exploring how these would look like.

Having a one-flag-per-extension sound very scalable to me, compared to have one 
interaction point which lets isolators or hooks extend the initial resources 
and attributes for the slave.

> Allow resources/attributes discovery
> 
>
> Key: MESOS-3366
> URL: https://issues.apache.org/jira/browse/MESOS-3366
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Reporter: Felix Abecassis
>
> In heterogeneous clusters, tasks sometimes have strong constraints on the 
> type of hardware they need to execute on. The current solution is to use 
> custom resources and attributes on the agents. Detecting non-standard 
> resources/attributes requires wrapping the "mesos-slave" binary behind a 
> script and use custom code to probe the agent. Unfortunately, this approach 
> doesn't allow composition. The solution would be to provide a hook/module 
> mechanism to allow users to use custom code performing resources/attributes 
> discovery.
> Please review the detailed document below:
> https://docs.google.com/document/d/15OkebDezFxzeyLsyQoU0upB0eoVECAlzEkeg0HQAX9w
> Feel free to express comments/concerns by annotating the document or by 
> replying to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3366) Allow resources/attributes discovery

2015-09-09 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737366#comment-14737366
 ] 

Niklas Quarfot Nielsen commented on MESOS-3366:
---

Folks; I think this discussion has drifted a bit. Whether or not modules are 
stable because they are executed in the context of the master and slave is 
outside the scope of this thread.

[~klaus1982] - nothing prevent you from continue using scripts to set 
attributes. We need a programmatic way of this this and if you don't have 
modules that needs this, you don't have to use this.

The focus of this ticket is module development and extensibility. Making a 
decision so we can move forward; we are discussing _how_ we want this interface 
to look like. Either through an extension to the isolator or another decorator. 
Let's start exploring how these would look like.

Having a one-flag-per-extension sound very scalable to me, compared to have one 
interaction point which lets isolators or hooks extend the initial resources 
and attributes for the slave.

> Allow resources/attributes discovery
> 
>
> Key: MESOS-3366
> URL: https://issues.apache.org/jira/browse/MESOS-3366
> Project: Mesos
>  Issue Type: Improvement
>  Components: slave
>Reporter: Felix Abecassis
>
> In heterogeneous clusters, tasks sometimes have strong constraints on the 
> type of hardware they need to execute on. The current solution is to use 
> custom resources and attributes on the agents. Detecting non-standard 
> resources/attributes requires wrapping the "mesos-slave" binary behind a 
> script and use custom code to probe the agent. Unfortunately, this approach 
> doesn't allow composition. The solution would be to provide a hook/module 
> mechanism to allow users to use custom code performing resources/attributes 
> discovery.
> Please review the detailed document below:
> https://docs.google.com/document/d/15OkebDezFxzeyLsyQoU0upB0eoVECAlzEkeg0HQAX9w
> Feel free to express comments/concerns by annotating the document or by 
> replying to this issue.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2924) Allow simple construction via initializer list on hashset.

2015-09-09 Thread Niklas Quarfot Nielsen (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2924?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2924:
--
Fix Version/s: (was: 0.24.0)
   0.25.0

> Allow simple construction via initializer list on hashset.
> --
>
> Key: MESOS-2924
> URL: https://issues.apache.org/jira/browse/MESOS-2924
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Till Toenshoff
>Assignee: Alexander Rojas
>Priority: Minor
>  Labels: mesosphere
> Fix For: 0.25.0
>
>
> {{hashmap}} already has a initializer-list constructor, {{hashset}} should 
> offer the same.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2844) Add and document new labels field to framework info

2015-09-09 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14737627#comment-14737627
 ] 

Niklas Quarfot Nielsen commented on MESOS-2844:
---

Looks like it - thanks :)

> Add and document new labels field to framework info
> ---
>
> Key: MESOS-2844
> URL: https://issues.apache.org/jira/browse/MESOS-2844
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Niklas Quarfot Nielsen
>
> Add and document new labels field to framework info:
> {code}
> message FrameworkInfo {
>   // Used to determine the Unix user that an executor or task should
>   // be launched as. If the user field is set to an empty string Mesos
>   // will automagically set it to the current user.
>   required string user = 1;
>   // Name of the framework that shows up in the Mesos Web UI.
>   required string name = 2;
>   // Note that 'id' is only available after a framework has
>   // registered, however, it is included here in order to facilitate
>   // scheduler failover (i.e., if it is set then the
>   // MesosSchedulerDriver expects the scheduler is performing
>   // failover).
>   optional FrameworkID id = 3;
>   ...
>   // This field allows a framework to advertise its set of
>   // capabilities (e.g., ability to receive offers for revocable
>   // resources).
>   repeated Capability capabilities = 10;
>   optional Labels labels = 11;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2875) Add containerId to ResourceUsage to enable QoS controller to target a container

2015-09-08 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2875?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14735862#comment-14735862
 ] 

Niklas Quarfot Nielsen commented on MESOS-2875:
---

1) Framework launches task X for executor Y
2) QoS controller decides that Y should be killed due to interference
3) Executor Y stops naturally
4) Framework starts a new task X' for executor Y
5) QoS controller sends correction for executor Y

> Add containerId to ResourceUsage to enable QoS controller to target a 
> container
> ---
>
> Key: MESOS-2875
> URL: https://issues.apache.org/jira/browse/MESOS-2875
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Niklas Quarfot Nielsen
>
> We should ensure that we are addressing the _container_ which the QoS 
> controller intended to kill. Without this check, we may run into a scenario 
> where the executor has terminated and one with the same id has started in the 
> interim i.e. running in a different container than the one the QoS controller 
> targeted.
> This most likely requires us to add containerId to the ResourceUsage message 
> and encode the containerID in the QoS Correction message.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3358) Add TaskStatus label decorator hooks for Master

2015-09-04 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731125#comment-14731125
 ] 

Niklas Quarfot Nielsen commented on MESOS-3358:
---

Does it need to be a decorator? As long as the hook is installed in all 
masters, it should be ok (taken that only one master needs to detect the 
terminal status)

> Add TaskStatus label decorator hooks for Master
> ---
>
> Key: MESOS-3358
> URL: https://issues.apache.org/jira/browse/MESOS-3358
> Project: Mesos
>  Issue Type: Task
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>
> The hook will be triggered when Master receives TaskStatus message from Agent 
> or when the Master itself generates a TASK_LOST status. The hook should also 
> provide a list of the previous TaskStatuses to the module.
> The use case is to allow a "cleanup" module to release IPs if an agent is 
> lost. The previous statuses will contain the IP address(es) to be released.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-3369) Harden flag parsing for selecting groups launcher

2015-09-04 Thread Niklas Quarfot Nielsen (JIRA)
Niklas Quarfot Nielsen created MESOS-3369:
-

 Summary: Harden flag parsing for selecting groups launcher
 Key: MESOS-3369
 URL: https://issues.apache.org/jira/browse/MESOS-3369
 Project: Mesos
  Issue Type: Improvement
Reporter: Niklas Quarfot Nielsen


Currently, we just search the isolation flag for containment of 'cgroups'. If a 
modularized isolator contains 'cgroups' in it's name, this will influence the 
launcher selection.

https://github.com/apache/mesos/blob/master/src/slave/containerizer/mesos/containerizer.cpp#L246



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


  1   2   3   4   5   6   7   8   9   10   >