[jira] [Commented] (MESOS-5452) Agent modules should be initialized before all components except firewall.

2016-06-12 Thread Avinash Sridharan (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326865#comment-15326865
 ] 

Avinash Sridharan commented on MESOS-5452:
--

https://reviews.apache.org/r/47892/

> Agent modules should be initialized before all components except firewall.
> --
>
> Key: MESOS-5452
> URL: https://issues.apache.org/jira/browse/MESOS-5452
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
> Environment: Linux
>Reporter: Avinash Sridharan
>Assignee: Avinash Sridharan
>  Labels: mesosphere
>
> On Mesos Agents Anonymous modules should not have any dependencies, by 
> design, on any other Mesos components. This implies that Anonymous modules 
> should be initialized before all other Mesos components other than 
> `Firewall`. The dependency on `Firewall` is primarily to enforce any policies 
> to secure endpoints that might be owned by the Anonymous module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5452) Agent modules should be initialized before all components except firewall.

2016-06-12 Thread Avinash Sridharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5452?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Avinash Sridharan updated MESOS-5452:
-
  Sprint: Mesosphere Sprint 37
Story Points: 1

> Agent modules should be initialized before all components except firewall.
> --
>
> Key: MESOS-5452
> URL: https://issues.apache.org/jira/browse/MESOS-5452
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
> Environment: Linux
>Reporter: Avinash Sridharan
>Assignee: Avinash Sridharan
>  Labels: mesosphere
>
> On Mesos Agents Anonymous modules should not have any dependencies, by 
> design, on any other Mesos components. This implies that Anonymous modules 
> should be initialized before all other Mesos components other than 
> `Firewall`. The dependency on `Firewall` is primarily to enforce any policies 
> to secure endpoints that might be owned by the Anonymous module.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4781) Executor env variables should not be leaked to the command task.

2016-06-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4781?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-4781:
-
Sprint: Mesosphere Sprint 30, Mesosphere Sprint 31, Mesosphere Sprint 32, 
Mesosphere Sprint 33, Mesosphere Sprint 34, Mesosphere Sprint 35, Mesosphere 
Sprint 36, Mesosphere Sprint 37  (was: Mesosphere Sprint 30, Mesosphere Sprint 
31, Mesosphere Sprint 32, Mesosphere Sprint 33, Mesosphere Sprint 34, 
Mesosphere Sprint 35, Mesosphere Sprint 36)

> Executor env variables should not be leaked to the command task.
> 
>
> Key: MESOS-4781
> URL: https://issues.apache.org/jira/browse/MESOS-4781
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Gilbert Song
>  Labels: mesosphere
>
> Currently, command task inherits the env variables of the command executor. 
> This is less ideal because the command executor environment variables include 
> some Mesos internal env variables like MESOS_XXX and LIBPROCESS_XXX. Also, 
> this behavior does not match what Docker containerizer does. We should 
> construct the env variables from scratch for the command task, rather than 
> relying on inheriting the env variables from the command executor.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5557) Add `NvidiaGpuAllocator` component for cross-containerizer GPU allocation

2016-06-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5557:
-
Sprint: Mesosphere Sprint 36, Mesosphere Sprint 37  (was: Mesosphere Sprint 
36)

> Add `NvidiaGpuAllocator` component for cross-containerizer GPU allocation
> -
>
> Key: MESOS-5557
> URL: https://issues.apache.org/jira/browse/MESOS-5557
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>  Labels: gpu, mesosphere
>
> We need some way of allocating GPUs from a centralized location to allow both 
> the mesos containerizer and the docker containerizer to pull from central 
> pool.  We propose to build a `NvidiaGpuAllocator` for this purpose.
> 
> This component should also be overloaded to do resource enumeration of GPUs 
> based on the agent flags. This keeps all code for enumerating GPUs and the 
> resources they represent in a single centralized location.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4938) Support docker registry authentication

2016-06-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-4938:
-
Sprint: Mesosphere Sprint 31, Mesosphere Sprint 32, Mesosphere Sprint 33, 
Mesosphere Sprint 34, Mesosphere Sprint 35, Mesosphere Sprint 36, Mesosphere 
Sprint 37  (was: Mesosphere Sprint 31, Mesosphere Sprint 32, Mesosphere Sprint 
33, Mesosphere Sprint 34, Mesosphere Sprint 35, Mesosphere Sprint 36)

> Support docker registry authentication
> --
>
> Key: MESOS-4938
> URL: https://issues.apache.org/jira/browse/MESOS-4938
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Gilbert Song
>  Labels: mesosphere
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5216) Document docker volume driver isolator.

2016-06-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5216:
-
Sprint: Mesosphere Sprint 35, Mesosphere Sprint 36, Mesosphere Sprint 37  
(was: Mesosphere Sprint 35, Mesosphere Sprint 36)

> Document docker volume driver isolator.
> ---
>
> Key: MESOS-5216
> URL: https://issues.apache.org/jira/browse/MESOS-5216
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Guangya Liu
>  Labels: documentaion, mesosphere
> Fix For: 1.0.0
>
>
> Should include the followings:
> 1. What features (driver options) are supported in docker volume driver 
> isolator.
> 2. How to use docker volume driver isolator.
> *related agent flags introduction and usage.
> *isolator dependency clarification (e.g., filesystem/linux).
> *related driver daemon preprocess.
> *volumes pre-specified by users and volume cleanup.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5550) Remove Nvidia GPU Isolator's link-time dependence on `libnvidia-ml`

2016-06-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5550:
-
Sprint: Mesosphere Sprint 36, Mesosphere Sprint 37  (was: Mesosphere Sprint 
36)

> Remove Nvidia GPU Isolator's link-time dependence on `libnvidia-ml`
> ---
>
> Key: MESOS-5550
> URL: https://issues.apache.org/jira/browse/MESOS-5550
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>  Labels: gpu, mesosphere
>
> The current Nvidia GPU isolator has a dependence on `libnvidia-ml`, and as 
> such, pulls a hard dependence on this library into `libmesos`. The 
> consequence of this is that any process that relies on `libmesos` has to have 
> `libnvidia-ml` available as well, even on machines where no GPUs are 
> available.  Since this library is not easily installable through standard 
> package managers, having such a hard dependence can be burdensome.
> This ticket proposes to pull in `libnvidia-ml` as a run-time dependence 
> instead of a link-time dependence. As such, only machines that actually have 
> GPUs installed and would like to rely on this library need to have it 
> installed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4233) Logging is too verbose for sysadmins / syslog

2016-06-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-4233:
-
Sprint: Mesosphere Sprint 26, Mesosphere Sprint 27, Mesosphere Sprint 28, 
Mesosphere Sprint 29, Mesosphere Sprint 30, Mesosphere Sprint 31, Mesosphere 
Sprint 32, Mesosphere Sprint 33, Mesosphere Sprint 34, Mesosphere Sprint 35, 
Mesosphere Sprint 36, Mesosphere Sprint 37  (was: Mesosphere Sprint 26, 
Mesosphere Sprint 27, Mesosphere Sprint 28, Mesosphere Sprint 29, Mesosphere 
Sprint 30, Mesosphere Sprint 31, Mesosphere Sprint 32, Mesosphere Sprint 33, 
Mesosphere Sprint 34, Mesosphere Sprint 35, Mesosphere Sprint 36)

> Logging is too verbose for sysadmins / syslog
> -
>
> Key: MESOS-4233
> URL: https://issues.apache.org/jira/browse/MESOS-4233
> Project: Mesos
>  Issue Type: Epic
>Reporter: Cody Maloney
>Assignee: Kapil Arya
>  Labels: mesosphere
> Attachments: giant_port_range_logging
>
>
> Currently mesos logs a lot. When launching a thousand tasks in the space of 
> 10 seconds it will print tens of thousands of log lines, overwhelming syslog 
> (there is a max rate at which a process can send stuff over a unix socket) 
> and not giving useful information to a sysadmin who cares about just the 
> high-level activity and when something goes wrong.
> Note mesos also blocks writing to its log locations, so when writing a lot of 
> log messages, it can fill up the write buffer in the kernel, and be suspended 
> until the syslog agent catches up reading from the socket (GLOG does a 
> blocking fwrite to stderr). GLOG also has a big mutex around logging so only 
> one thing logs at a time.
> While for "internal debugging" it is useful to see things like "message went 
> from internal compoent x to internal component y", from a sysadmin 
> perspective I only care about the high level actions taken (launched task for 
> framework x), sent offer to framework y, got task failed from host z. Note 
> those are what I'd expect at the "INFO" level. At the "WARNING" level I'd 
> expect very little to be logged / almost nothing in normal operation. Just 
> things like "WARN: Repliacted log write took longer than expected". WARN 
> would also get things like backtraces on crashes and abnormal exits / abort.
> When trying to launch 3k+ tasks inside a second, mesos logging currently 
> overwhelms syslog with 100k+ messages, many of which are thousands of bytes. 
> Sysadmins expect to be able to use syslog to monitor basic events in their 
> system. This is too much.
> We can keep logging the messages to files, but the logging to stderr needs to 
> be reduced significantly (stderr gets picked up and forwarded to syslog / 
> central aggregation).
> What I would like is if I can set the stderr logging level to be different / 
> independent from the file logging level (Syslog giving the "sysadmin" 
> aggregated overview, files useful for debugging in depth what happened in a 
> cluster). A lot of what mesos currently logs at info is really debugging info 
> / should show up as debug log level.
> Some samples of mesos logging a lot more than a sysadmin would want / expect 
> are attached, and some are below:
>  - Every task gets printed multiple times for a basic launch:
> {noformat}
> Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: 
> I1215 22:58:29.382644  1315 master.cpp:3248] Launching task 
> envy.5b19a713-a37f-11e5-8b3e-0251692d6109 of framework 
> 5178f46d-71d6-422f-922c-5bbe82dff9cc- (marathon)
> Dec 15 22:58:30 ip-10-0-7-60.us-west-2.compute.internal mesos-master[1311]: 
> I1215 22:58:29.382925  1315 master.hpp:176] Adding task 
> envy.5b1958f2-a37f-11e5-8b3e-0251692d6109 with resources cpus(​*):0.0001; 
> mem(*​):16; ports(*):[14047-14047]
> {noformat}
>  - Every task status update prints many log lines, successful ones are part 
> of normal operation and maybe should be logged at info / debug levels, but 
> not to a sysadmin (Just show when things fail, and maybe aggregate counters 
> to tell of the volume of working)
>  - No log messagse should be really big / more than 1k characters (Would 
> prevent the giant port list attached, make that easily discoverable / bug 
> filable / fixable) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4099) parallel make tests does not build all test targets

2016-06-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-4099:
-
Sprint: Mesosphere Sprint 35, Mesosphere Sprint 36, Mesosphere Sprint 37  
(was: Mesosphere Sprint 35, Mesosphere Sprint 36)

> parallel make tests does not build all test targets
> ---
>
> Key: MESOS-4099
> URL: https://issues.apache.org/jira/browse/MESOS-4099
> Project: Mesos
>  Issue Type: Bug
>  Components: build, libprocess
>Affects Versions: 0.26.0
> Environment: Ubuntu 15.04
> clang-3.6 as well as gcc-4.9
>Reporter: Joris Van Remoortere
>Assignee: Kapil Arya
>  Labels: mesosphere
> Fix For: 1.0.0
>
>
> When inside 3rdparty/libprocess:
> Running {{make -j8 tests}} from a clean build does not yield the 
> {{libprocess-tests}} binary.
> Running it a subsequent time triggers more compilation and ends up yielding 
> the {{libprocess-tests}} binary.
> This suggests the {{test}} target is not being built correctly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5558) Update `Containerizer::resources()` to use the `NvidiaGpuAllocator`

2016-06-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5558:
-
Sprint: Mesosphere Sprint 36, Mesosphere Sprint 37  (was: Mesosphere Sprint 
36)

> Update `Containerizer::resources()` to use the `NvidiaGpuAllocator`
> ---
>
> Key: MESOS-5558
> URL: https://issues.apache.org/jira/browse/MESOS-5558
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>  Labels: gpu, mesosphere
>
> With the introduction of the shared `NvidiaGpuAllocator` component, 
> `Containerizer::resources()` should be updated to use it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4690) Reorganize 3rdparty directory

2016-06-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4690?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-4690:
-
Sprint: Mesosphere Sprint 33, Mesosphere Sprint 34, Mesosphere Sprint 35, 
Mesosphere Sprint 36, Mesosphere Sprint 37  (was: Mesosphere Sprint 33, 
Mesosphere Sprint 34, Mesosphere Sprint 35, Mesosphere Sprint 36)

> Reorganize 3rdparty directory
> -
>
> Key: MESOS-4690
> URL: https://issues.apache.org/jira/browse/MESOS-4690
> Project: Mesos
>  Issue Type: Epic
>  Components: build, libprocess, stout
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>  Labels: mesosphere
>
> This issues is currently being discussed in the dev mailing list:
> http://www.mail-archive.com/dev@mesos.apache.org/msg34349.html



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5563) Rearrange Nvidia GPU files to cleanup semantics for header inclusion.

2016-06-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5563:
-
Sprint: Mesosphere Sprint 36, Mesosphere Sprint 37  (was: Mesosphere Sprint 
36)

> Rearrange Nvidia GPU files to cleanup semantics for header inclusion.
> -
>
> Key: MESOS-5563
> URL: https://issues.apache.org/jira/browse/MESOS-5563
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>  Labels: gpu, mesosphere
>
> Currently, components outside of 
> `src/slave/containerizers/mesos/isolators/gpu` have to protect their 
> #includes for certain Nvidia header files with the ENABLE_NVIDIA_GPU_SUPPORT 
> flag. Other headers strictly *could not* be wrapped in this flag.
> 
> We need to clean up this header madness, by creating a common "nvidia.hpp" 
> header that takes care of all the dependencies. All componenents outside of 
> `src/slave/containerizers/mesos/isolators/gpu`
> should only need to #include this one header instead of managing everything 
> themselves.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5257) Add autodiscovery for GPU resources

2016-06-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5257:
-
Sprint: Mesosphere Sprint 35, Mesosphere Sprint 36, Mesosphere Sprint 37  
(was: Mesosphere Sprint 35, Mesosphere Sprint 36)

> Add autodiscovery for GPU resources
> ---
>
> Key: MESOS-5257
> URL: https://issues.apache.org/jira/browse/MESOS-5257
> Project: Mesos
>  Issue Type: Task
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>  Labels: isolator
>
> Right now, the only way to enumerate the available GPUs on an agent is to use 
> the `--nvidia_gpu_devices` flag and explicitly list them out.  Instead, we 
> should leverage NVML to autodiscover the GPUs that are available and only use 
> this flag as a way to explicitly list out the GPUs you want to make available 
> in order to restrict access to some of them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5556) Fix method of populating device entries for `/dev/nvidia-uvm`, etc.

2016-06-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5556:
-
Sprint: Mesosphere Sprint 36, Mesosphere Sprint 37  (was: Mesosphere Sprint 
36)

> Fix method of populating device entries for `/dev/nvidia-uvm`, etc.
> ---
>
> Key: MESOS-5556
> URL: https://issues.apache.org/jira/browse/MESOS-5556
> Project: Mesos
>  Issue Type: Bug
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>  Labels: gpu, mesosphere
>
> Currently, the major/minor numbers of `/dev/nvidiactl` and `/dev/nvidia-uvm` 
> are hard-coded. This causes problems for `/dev/nvidia-uvm` because its major 
> number is part of the "Experimental" device range on Linux.
> Because this range is experimental, there is no guarantee which device
> number will be assigned to it on a given machine.  We should use 
> `os:stat::rdev()` to extract the major/minor numbers programatically.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5554) Change major/minor device types for Nvidia GPUs to `unsigned int`

2016-06-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5554:
-
Sprint: Mesosphere Sprint 36, Mesosphere Sprint 37  (was: Mesosphere Sprint 
36)

> Change major/minor device types for Nvidia GPUs to `unsigned int`
> -
>
> Key: MESOS-5554
> URL: https://issues.apache.org/jira/browse/MESOS-5554
> Project: Mesos
>  Issue Type: Bug
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>  Labels: gpu, mesosphere
>
> Currently, the GPU struct specifies the type of its `major` and `minor` 
> fields as `dev_t`, which is actually a concatenation of both the major and 
> minor device numbers accessible through the `major()` and `minor()` macros. 
> These macros return an `unsigned int` when handed a `dev_t`, so it makes 
> sense for these fields to be of that type instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5401) Add ability to inject a Volume of Nvidia GPU-related libraries into a docker container.

2016-06-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5401?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5401:
-
Sprint: Mesosphere Sprint 35, Mesosphere Sprint 36, Mesosphere Sprint 37  
(was: Mesosphere Sprint 35, Mesosphere Sprint 36)

> Add ability to inject a Volume of Nvidia GPU-related libraries into a docker 
> container.
> ---
>
> Key: MESOS-5401
> URL: https://issues.apache.org/jira/browse/MESOS-5401
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>
> In order to support Nvidia GPUs with docker containers in Mesos, we need to 
> be able to consolidate all Nvidia libraries into a common volume and inject 
> that volume into the container.
> More info on why this is necessary here: 
> https://github.com/NVIDIA/nvidia-docker/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5582) Create a `cgroups/devices` isolator.

2016-06-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5582?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5582:
-
Sprint: Mesosphere Sprint 36, Mesosphere Sprint 37  (was: Mesosphere Sprint 
36)

> Create a `cgroups/devices` isolator.
> 
>
> Key: MESOS-5582
> URL: https://issues.apache.org/jira/browse/MESOS-5582
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>  Labels: gpu, isolator, mesosphere
>
> Currently, all the logic for the `cgroups/devices` isolator is bundled into 
> the Nvidia GPU Isolator. We should abstract it out into it's own component 
> and remove the redundant logic from the Nvidia GPU Isolator. Assuming the 
> guaranteed ordering between isolators from MESOS-5581, we can be sure that 
> the dependency order between the `cgroups/devices` and `gpu/nvidia` isolators 
> is met.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2043) Framework auth fail with timeout error and never get authenticated

2016-06-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-2043:
-
Sprint: Mesosphere Sprint 35, Mesosphere Sprint 36, Mesosphere Sprint 37  
(was: Mesosphere Sprint 35, Mesosphere Sprint 36)

> Framework auth fail with timeout error and never get authenticated
> --
>
> Key: MESOS-2043
> URL: https://issues.apache.org/jira/browse/MESOS-2043
> Project: Mesos
>  Issue Type: Bug
>  Components: master, scheduler driver, security, slave
>Affects Versions: 0.21.0
>Reporter: Bhuvan Arumugam
>Assignee: Benjamin Bannier
>Priority: Critical
>  Labels: mesosphere, security
> Attachments: aurora-scheduler.20141104-1606-1706.log, master.log, 
> mesos-master.20141104-1606-1706.log, slave.log
>
>
> I'm facing this issue in master as of 
> https://github.com/apache/mesos/commit/74ea59e144d131814c66972fb0cc14784d3503d4
> As [~adam-mesos] mentioned in IRC, this sounds similar to MESOS-1866. I'm 
> running 1 master and 1 scheduler (aurora). The framework authentication fail 
> due to time out:
> error on mesos master:
> {code}
> I1104 19:37:17.741449  8329 master.cpp:3874] Authenticating 
> scheduler-d2d4437b-d375-4467-a583-362152fe065a@SCHEDULER_IP:8083
> I1104 19:37:17.741585  8329 master.cpp:3885] Using default CRAM-MD5 
> authenticator
> I1104 19:37:17.742106  8336 authenticator.hpp:169] Creating new server SASL 
> connection
> W1104 19:37:22.742959  8329 master.cpp:3953] Authentication timed out
> W1104 19:37:22.743548  8329 master.cpp:3930] Failed to authenticate 
> scheduler-d2d4437b-d375-4467-a583-362152fe065a@SCHEDULER_IP:8083: 
> Authentication discarded
> {code}
> scheduler error:
> {code}
> I1104 19:38:57.885486 49012 sched.cpp:283] Authenticating with master 
> master@MASTER_IP:PORT
> I1104 19:38:57.885928 49002 authenticatee.hpp:133] Creating new client SASL 
> connection
> I1104 19:38:57.890581 49007 authenticatee.hpp:224] Received SASL 
> authentication mechanisms: CRAM-MD5
> I1104 19:38:57.890656 49007 authenticatee.hpp:250] Attempting to authenticate 
> with mechanism 'CRAM-MD5'
> W1104 19:39:02.891196 49005 sched.cpp:378] Authentication timed out
> I1104 19:39:02.891850 49018 sched.cpp:338] Failed to authenticate with master 
> master@MASTER_IP:PORT: Authentication discarded
> {code}
> Looks like 2 instances {{scheduler-20f88a53-5945-4977-b5af-28f6c52d3c94}} & 
> {{scheduler-d2d4437b-d375-4467-a583-362152fe065a}} of same framework is 
> trying to authenticate and fail.
> {code}
> W1104 19:36:30.769420  8319 master.cpp:3930] Failed to authenticate 
> scheduler-20f88a53-5945-4977-b5af-28f6c52d3c94@SCHEDULER_IP:8083: Failed to 
> communicate with authenticatee
> I1104 19:36:42.701441  8328 master.cpp:3860] Queuing up authentication 
> request from scheduler-d2d4437b-d375-4467-a583-362152fe065a@SCHEDULER_IP:8083 
> because authentication is still in progress
> {code}
> Restarting master and scheduler didn't fix it. 
> This particular issue happen with 1 master and 1 scheduler after MESOS-1866 
> is fixed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5562) Add class to share Nvidia-specific components between containerizers

2016-06-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5562:
-
Sprint: Mesosphere Sprint 36, Mesosphere Sprint 37  (was: Mesosphere Sprint 
36)

> Add class to share Nvidia-specific components between containerizers
> 
>
> Key: MESOS-5562
> URL: https://issues.apache.org/jira/browse/MESOS-5562
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>  Labels: gpu, mesosphere
>
> Once we have an `NvidiaGPUAllocator` component, we need some way to share it 
> across multiple containerizers.  Moreover, we anticipate needing other Nvidia 
> components to share across multiple containerizers as well (e.g. an 
> `NvidiaVolumeManager` component). As such, we should add a wrapper class 
> around these components to make it easily passable to each containerizer 
> without having to continually add a bunch of parameters to the Containerizer 
> interface.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5552) Bundle NVML headers for Nvidia GPU support.

2016-06-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5552:
-
Sprint: Mesosphere Sprint 36, Mesosphere Sprint 37  (was: Mesosphere Sprint 
36)

> Bundle NVML headers for Nvidia GPU support.
> ---
>
> Key: MESOS-5552
> URL: https://issues.apache.org/jira/browse/MESOS-5552
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>  Labels: gpu, mesosphere
>
> Currently, we rely on a script to install the Nvidia GDK as a build 
> dependence for building Mesos with Nvidia GPU support.
> A previous ticket removed the Mesos build dependence on `libnvidia-ml` which 
> comes as part of the GDK. This ticket proposes bundling the NVML headers with 
> Mesos in order to completely remove the build dependence on the GDK.
> With this change it will be much simpler to configure and build with Nvidia 
> GPU support.  All that will be required is:
> {noformat}
> ../configure --enable-nvidia-gpu-support
> make -j
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5559) Integrate the `NvidiaGpuAllocator` into the `NvidiaGpuIsolator`

2016-06-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5559:
-
Sprint: Mesosphere Sprint 36, Mesosphere Sprint 37  (was: Mesosphere Sprint 
36)

> Integrate the `NvidiaGpuAllocator` into the `NvidiaGpuIsolator`
> ---
>
> Key: MESOS-5559
> URL: https://issues.apache.org/jira/browse/MESOS-5559
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>  Labels: gpu, mesosphere
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5551) Move the Nvidia GPU isolator from `cgroups/devices/gpu/nvidia` to `gpu/nvidia`

2016-06-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5551:
-
Sprint: Mesosphere Sprint 36, Mesosphere Sprint 37  (was: Mesosphere Sprint 
36)

> Move the Nvidia GPU isolator from `cgroups/devices/gpu/nvidia` to `gpu/nvidia`
> --
>
> Key: MESOS-5551
> URL: https://issues.apache.org/jira/browse/MESOS-5551
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>  Labels: gpu, mesosphere
>
> Currently, the Nvidia GPU isolator lives in 
> `src/slave/containerizers/mesos/isolators/cgroups/devices/gpu/nvidia`. 
> However, in the future this isolator will do more than simply isolate GPUs 
> using the cgroups devices subsystem (e.g. volume management for injecting 
> machine specific Nvidia libraries into a container). For this reason, we 
> should preemptively move this isolator up to 
> `src/slave/containerizers/mesos/isolators/gpu/nvidia`. As part of this, we 
> should update the string we pass to the `--isolator` agent flag to reflect 
> this.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5445) Allow libprocess/stout to build without first doing `make` in 3rdparty.

2016-06-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5445:
-
Sprint: Mesosphere Sprint 35, Mesosphere Sprint 36, Mesosphere Sprint 37  
(was: Mesosphere Sprint 35, Mesosphere Sprint 36)

> Allow libprocess/stout to build without first doing `make` in 3rdparty.
> ---
>
> Key: MESOS-5445
> URL: https://issues.apache.org/jira/browse/MESOS-5445
> Project: Mesos
>  Issue Type: Bug
>  Components: build
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>  Labels: mesosphere
> Fix For: 1.0.0
>
>
> After the 3rdparty reorg, libprocess/stout are enable to build their 
> dependencies and so one has to do `make` in 3rdpart/ before building 
> libprocess/stout.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4766) Improve allocator performance.

2016-06-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4766?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-4766:
-
Sprint: Mesosphere Sprint 32, Mesosphere Sprint 33, Mesosphere Sprint 34, 
Mesosphere Sprint 35, Mesosphere Sprint 36, Mesosphere Sprint 37  (was: 
Mesosphere Sprint 32, Mesosphere Sprint 33, Mesosphere Sprint 34, Mesosphere 
Sprint 35, Mesosphere Sprint 36)

> Improve allocator performance.
> --
>
> Key: MESOS-4766
> URL: https://issues.apache.org/jira/browse/MESOS-4766
> Project: Mesos
>  Issue Type: Epic
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Michael Park
>Priority: Critical
>
> This is an epic to track the various tickets around improving the performance 
> of the allocator, including the following:
> * Preventing un-necessary backup of the allocator.
> * Reducing the cost of allocations and allocator state updates.
> * Improving performance of the DRF sorter.
> * More benchmarking to simulate scenarios with performance issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5577) Modules using replicated log state API require zookeeper headers

2016-06-12 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5577?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-5577:
-
Sprint: Mesosphere Sprint 36, Mesosphere Sprint 37  (was: Mesosphere Sprint 
36)

> Modules using replicated log state API require zookeeper headers
> 
>
> Key: MESOS-5577
> URL: https://issues.apache.org/jira/browse/MESOS-5577
> Project: Mesos
>  Issue Type: Bug
>  Components: modules
>Affects Versions: 1.0.0
>Reporter: Avinash Sridharan
>Assignee: Avinash Sridharan
>  Labels: mesosphere
> Fix For: 1.0.0
>
>
> The state API uses zookeeper client headers and hence the bundled zookeeper 
> headers need to be installed during Mesos installation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4586) Resources clarification in Mesos UI

2016-06-12 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4586?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326574#comment-15326574
 ] 

haosdent commented on MESOS-4586:
-

[~xds2000] I just double check the code today. According to the code, let me 
use cpu as a example,

{code}
for each framework {
  $scope.used_cpus += framework.resources.cpus;
}
$scope.used_cpus -= $scope.offered_cpus;
$scope.idle_cpus = $scope.total_cpus - ($scope.offered_cpus + $scope.used_cpus);
{code}

My understand is 
{code}
total_cpus = idle_cpus + allocated_cpus
allocated_cpus = offered_cpus + used_cpus
{code}

So the webui looks correct. Do you encounter the problem that {{Idle}} show 
enough resources but stuck when launch tasks?

> Resources clarification in Mesos UI
> ---
>
> Key: MESOS-4586
> URL: https://issues.apache.org/jira/browse/MESOS-4586
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Craig W
>Assignee: haosdent
>
> On the Mesos UI under the "resources" section when it lists CPUs and Mem, 
> it seems to be calculated by sum up every executor cpu and memory statistics, 
> which would be less than <= "allocated" resources.
> On the page that displays information for a slave it shows the CPUs and Mem 
> show used and allocated.
> When I look at the Mesos UI front page, I was looking at "Idle" resources as 
> the amount of resources I have available for offers. However, that's not the 
> case. It would be nice to have it show the amount of "free" or "available" 
> resources as well as "idle", so I can better determine how many resources I 
> actually have available for scheduling additional tasks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5513) Implement SET_LOGGING_LEVEL Call in v1 agent API.

2016-06-12 Thread Abhishek Dasgupta (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5513?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Abhishek Dasgupta updated MESOS-5513:
-
Assignee: haosdent  (was: Abhishek Dasgupta)

> Implement SET_LOGGING_LEVEL Call in v1 agent API.
> -
>
> Key: MESOS-5513
> URL: https://issues.apache.org/jira/browse/MESOS-5513
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>Assignee: haosdent
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5604) Check for incorrect use of `.then` (as opposed to `.then(defer(self() ))`).

2016-06-12 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326551#comment-15326551
 ] 

Jie Yu commented on MESOS-5604:
---

commit 9ad4f3fb85892c8a29e08182e1a48dee0e6ada47
Author: Joerg Schad 
Date:   Sun Jun 12 10:28:02 2016 -0700

Fixed continuation logic in docker containerizer.

Previously the continuation followed via \\\`.then(\\\[=\\\]\\\`, which
potentially executes the continuation on a different process. This patch
fixes this behavior (it should run on the same process) and avoids
potential race conditions.

Review: https://reviews.apache.org/r/48599/

> Check for incorrect use of `.then` (as opposed to `.then(defer(self() ))`).
> ---
>
> Key: MESOS-5604
> URL: https://issues.apache.org/jira/browse/MESOS-5604
> Project: Mesos
>  Issue Type: Task
>Reporter: Joerg Schad
>Assignee: Joerg Schad
>
> We recently experienced a race condition (MESOS-5587) as we used .then() 
> instead of .then(defer(self() ).
> When looking at the code base we found a number of other potentially wrong 
> uses which could cause race conditions. 
> This Jira is supposed to track the investigation effort in this matter.
> Note that this problem does not only apply to .then, but also for example to 
> .onAny(). From looking at the codebase, it just seems we are more disciplined 
> when using .onAny(). Hence we first look at .then().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5604) Check for incorrect use of `.then` (as opposed to `.then(defer(self() ))`).

2016-06-12 Thread Joerg Schad (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5604?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326479#comment-15326479
 ] 

Joerg Schad commented on MESOS-5604:


Fixed continuation logic in docker.cpp.
https://reviews.apache.org/r/48599/

> Check for incorrect use of `.then` (as opposed to `.then(defer(self() ))`).
> ---
>
> Key: MESOS-5604
> URL: https://issues.apache.org/jira/browse/MESOS-5604
> Project: Mesos
>  Issue Type: Task
>Reporter: Joerg Schad
>Assignee: Joerg Schad
>
> We recently experienced a race condition (MESOS-5587) as we used .then() 
> instead of .then(defer(self() ).
> When looking at the code base we found a number of other potentially wrong 
> uses which could cause race conditions. 
> This Jira is supposed to track the investigation effort in this matter.
> Note that this problem does not only apply to .then, but also for example to 
> .onAny(). From looking at the codebase, it just seems we are more disciplined 
> when using .onAny(). Hence we first look at .then().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5605) Improve documentation for using persistent volumes.

2016-06-12 Thread Joerg Schad (JIRA)
Joerg Schad created MESOS-5605:
--

 Summary: Improve documentation for using persistent volumes. 
 Key: MESOS-5605
 URL: https://issues.apache.org/jira/browse/MESOS-5605
 Project: Mesos
  Issue Type: Documentation
Reporter: Joerg Schad
Assignee: Joerg Schad


When using persistent volumes at a arangoDB we ran into a few pitfalls.
We should document them in order for others to avoid those issues.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5377) Improve DRF behavior with scarce resources.

2016-06-12 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5377?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326333#comment-15326333
 ] 

Guangya Liu commented on MESOS-5377:


[~bmahler] I posted a prototype here 
https://github.com/jay-lau/mesos/commit/d411e20350f9c10100314da113f705f00ea55d74

The main idea for this is:
1) Added a new flag named as {{allocator_fairness_excluded_resource_names}} to 
define the scare resources.
2) Added helper functions to filter out scare resource.
{code}
// Tests if the given Resource object is non scare. If the
// fairnessExcludeResourceNames is specified, all of the resources in
// fairnessExcludeResourceNames will be treated as scare resources,
// and those resources will be filtered out.
static bool isNonScare(
const Resource& resource,
const Option& fairnessExcludeResourceNames);
// Returns the non scare resources, all of the resources in
// fairnessExcludeResourceNames will be treated as scare resources,
// and those resources will be filtered out.
Resources nonScare(const Option&
  fairnessExcludeResourceNames = None()) const;
{code}
3) Filter out the scare resources in allocator, the sorter is not aware of the 
scare resources.


> Improve DRF behavior with scarce resources.
> ---
>
> Key: MESOS-5377
> URL: https://issues.apache.org/jira/browse/MESOS-5377
> Project: Mesos
>  Issue Type: Epic
>  Components: allocation
>Reporter: Benjamin Mahler
>Assignee: Guangya Liu
>
> The allocator currently uses the notion of Weighted [Dominant Resource 
> Fairness|https://www.cs.berkeley.edu/~alig/papers/drf.pdf] (WDRF) to 
> establish a linear notion of fairness across allocation roles.
> DRF behaves well for resources that are present within each machine in a 
> cluster (e.g. CPUs, memory, disk). However, some resources (e.g. GPUs) are 
> only present on a subset of machines in the cluster.
> Consider the behavior when there are the following agents in a cluster:
> 1000 agents with (cpus:4,mem:1024,disk:1024)
> 1 agent with (gpus:1,cpus:4,mem:1024,disk:1024)
> If a role wishes to use both GPU and non-GPU resources for tasks, consuming 1 
> GPU will lead DRF to consider the role to have a 100% share of the cluster, 
> since it consumes 100% of the GPUs in the cluster. This framework will then 
> not receive any other offers.
> Among possible improvements, fairness can have understanding of resource 
> packages. In a sense there is 1 GPU package that is competed on and 1000 
> non-GPU packages competed on, and ideally a role's consumption of the single 
> GPU package does not have a large effect on the role's access to the other 
> 1000 non-GPU packages.
> In the interim, we should consider having a recommended way to deal with 
> scarce resources in the current model.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-4973) Duplicates in 'unregistered_frameworks' in /state

2016-06-12 Thread shijinkui (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-4973?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15326228#comment-15326228
 ] 

shijinkui commented on MESOS-4973:
--

why so many 'unregistered_frameworks'? I have the same problem.

 "0c66ac6a-9ced-4bfc-bc35-c0e7b59e3704-0024",
"b65d63c1-50d8-47e7-885a-2b52d89ee6e5-5809",
"0c66ac6a-9ced-4bfc-bc35-c0e7b59e3704-0032",
"4f872d26-8d33-46f0-a495-2dfb6eedae54-0004",
"20b47d34-c45a-4e2b-9cca-6081e16ef3f5-0004",
"e2260bda-d8a4-4849-85cf-bc4c5e074285-0007",
"0c66ac6a-9ced-4bfc-bc35-c0e7b59e3704-0026",
"0c66ac6a-9ced-4bfc-bc35-c0e7b59e3704-0017",
"20b47d34-c45a-4e2b-9cca-6081e16ef3f5-0005",

> Duplicates in 'unregistered_frameworks' in /state 
> --
>
> Key: MESOS-4973
> URL: https://issues.apache.org/jira/browse/MESOS-4973
> Project: Mesos
>  Issue Type: Bug
>Reporter: Yan Xu
>Priority: Minor
>  Labels: mesosphere
>
> In our clusters where many frameworks run, 'unregistered_frameworks' 
> currently doesn't show what it semantically means, but rather "a list of 
> frameworkIDs for each orphaned task", which means a lot of duplicated 
> frameworkIDs.
> For this filed to be useful we need to deduplicate when outputting the list.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)