[jira] [Commented] (MESOS-5330) Agent should backoff before connecting to the master

2016-05-09 Thread David Robinson (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5330?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15277594#comment-15277594
 ] 

David Robinson commented on MESOS-5330:
---

https://reviews.apache.org/r/47154/

> Agent should backoff before connecting to the master
> 
>
> Key: MESOS-5330
> URL: https://issues.apache.org/jira/browse/MESOS-5330
> Project: Mesos
>  Issue Type: Bug
>Reporter: David Robinson
>Assignee: David Robinson
>
> When an agent is started it starts a background task (libprocess process?) to 
> detect the leading master. When the leading master is detected (or changes) 
> the [SocketManager's link() method is called and a TCP connection to the 
> master is 
> established|https://github.com/apache/mesos/blob/a138e2246a30c4b5c9bc3f7069ad12204dcaffbc/src/slave/slave.cpp#L954].
>  The agent _then_ backs off before sending a ReRegisterSlave message via the 
> newly established connection. The agent needs to backoff _before_ attempting 
> to establish a TCP connection to the master, not before sending the first 
> message over the connection.
> During scale tests at Twitter we discovered that agents can SYN flood the 
> master upon leader changes, then the problem described in MESOS-5200 can 
> occur where ephemeral connections are used, which exacerbates the problem. 
> The end result is a lot of hosts setting up and tearing down TCP connections 
> every slave_ping_timeout seconds (15 by default), connections failing to be 
> established, hosts being marked as unhealthy and being shutdown. We observed 
> ~800 passive TCP connections per second on the leading master during scale 
> tests.
> The problem can be somewhat mitigated by tuning the kernel to handle a 
> thundering herd of TCP connections, but ideally there would not be a 
> thundering herd to begin with.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (MESOS-5330) Agent should backoff before connecting to the master

2016-05-09 Thread David Robinson (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5330?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

David Robinson updated MESOS-5330:
--
Comment: was deleted

(was: https://reviews.apache.org/r/47080/)

> Agent should backoff before connecting to the master
> 
>
> Key: MESOS-5330
> URL: https://issues.apache.org/jira/browse/MESOS-5330
> Project: Mesos
>  Issue Type: Bug
>Reporter: David Robinson
>Assignee: David Robinson
>
> When an agent is started it starts a background task (libprocess process?) to 
> detect the leading master. When the leading master is detected (or changes) 
> the [SocketManager's link() method is called and a TCP connection to the 
> master is 
> established|https://github.com/apache/mesos/blob/a138e2246a30c4b5c9bc3f7069ad12204dcaffbc/src/slave/slave.cpp#L954].
>  The agent _then_ backs off before sending a ReRegisterSlave message via the 
> newly established connection. The agent needs to backoff _before_ attempting 
> to establish a TCP connection to the master, not before sending the first 
> message over the connection.
> During scale tests at Twitter we discovered that agents can SYN flood the 
> master upon leader changes, then the problem described in MESOS-5200 can 
> occur where ephemeral connections are used, which exacerbates the problem. 
> The end result is a lot of hosts setting up and tearing down TCP connections 
> every slave_ping_timeout seconds (15 by default), connections failing to be 
> established, hosts being marked as unhealthy and being shutdown. We observed 
> ~800 passive TCP connections per second on the leading master during scale 
> tests.
> The problem can be somewhat mitigated by tuning the kernel to handle a 
> thundering herd of TCP connections, but ideally there would not be a 
> thundering herd to begin with.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess

2016-05-09 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15277516#comment-15277516
 ] 

Joseph Wu commented on MESOS-5342:
--

We only use Github PR's for website/UI related changes.  Everything else needs 
to go through ReviewBoard.

> CPU pinning/binding support for CgroupsCpushareIsolatorProcess
> --
>
> Key: MESOS-5342
> URL: https://issues.apache.org/jira/browse/MESOS-5342
> Project: Mesos
>  Issue Type: Improvement
>  Components: cgroups, containerization
>Affects Versions: 0.28.1
>Reporter: Chris
>
> The cgroups isolator currently lacks support for binding (also called 
> pinning) containers to a set of cores. The GNU/Linux kernel is known to make 
> sub-optimal core assignments for processes and threads. Poor assignments 
> impact program performance, specifically in terms of cache locality. 
> Applications requiring GPU resources can benefit from this feature by getting 
> access to cores closest to the GPU hardware, which reduces cpu-gpu copy 
> latency.
> Most cluster management systems from the HPC community (SLURM) provide both 
> cgroup isolation and cpu binding. This feature would provide similar 
> capabilities. The current interest in supporting Intel's Cache Allocation 
> Technology, and the advent of Intel's Knights-series processors, will require 
> making choices about where container's are going to run on the mesos-agent's 
> processor(s) cores - this feature is a step toward developing a robust 
> solution.
> The improvement in this JIRA ticket will handle hardware topology detection, 
> track container-to-core utilization in a histogram, and use a mathematical 
> optimization technique to select cores for container assignment based on 
> latency and the container-to-core utilization histogram.
> For GPU tasks, the improvement will prioritize selection of cores based on 
> latency between the GPU and cores in an effort to minimize copy latency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5354) Update "driver" as optional for DockerVolume.

2016-05-09 Thread Guangya Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5354?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Guangya Liu updated MESOS-5354:
---
Description: 
After some test with docker API, I found that when "docker run" to create a 
container, the volume name is required but volume driver is optional. When 
using "dvdcli", both name and driver are required. We are now defining the 
"driver" as required, we should update "driver" to optional so that the 
DockerContainerizer still works even if user did not specify driver when 
creating a container with volume.

{code}
message DockerVolume {
  // Driver of the volume, it can be flocker, convoy, raxrey etc.
  required string driver = 1;   Update "driver" as optional for DockerVolume.
> -
>
> Key: MESOS-5354
> URL: https://issues.apache.org/jira/browse/MESOS-5354
> Project: Mesos
>  Issue Type: Bug
>Reporter: Guangya Liu
>Assignee: Guangya Liu
>
> After some test with docker API, I found that when "docker run" to create a 
> container, the volume name is required but volume driver is optional. When 
> using "dvdcli", both name and driver are required. We are now defining the 
> "driver" as required, we should update "driver" to optional so that the 
> DockerContainerizer still works even if user did not specify driver when 
> creating a container with volume.
> {code}
> message DockerVolume {
>   // Driver of the volume, it can be flocker, convoy, raxrey etc.
>   required string driver = 1;     // Name of the volume.
>   required string name = 2;
>   // Volume driver specific options.
>   optional Parameters driver_options = 3;
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5354) Update "driver" as optional for DockerVolume.

2016-05-09 Thread Guangya Liu (JIRA)
Guangya Liu created MESOS-5354:
--

 Summary: Update "driver" as optional for DockerVolume.
 Key: MESOS-5354
 URL: https://issues.apache.org/jira/browse/MESOS-5354
 Project: Mesos
  Issue Type: Bug
Reporter: Guangya Liu
Assignee: Guangya Liu


After some test with docker API, I found that when "docker run" to create a 
container, the volume name is required but volume driver is optional. When 
using "dvdcli", both name and driver are required. We are now defining the 
"driver" as required, we should update "driver" to optional so that the 
DockerContainerizer still works even if user did not specify driver when 
creating a container with volume.

{code}
message DockerVolume {
  // Driver of the volume, it can be flocker, convoy, raxrey etc.
  required string driver = 1;  < Shall we update this to optional?

  // Name of the volume.
  required string name = 2;

  // Volume driver specific options.
  optional Parameters driver_options = 3;
}
{code}




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess

2016-05-09 Thread Chris (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15277463#comment-15277463
 ] 

Chris commented on MESOS-5342:
--

[~kaysoky] Sure thing - I've done some of this work prior to writing the code 
in a local README. Shouldn't be too much trouble transposing that information 
onto googledocs. Oh, should the source be posted on github under a separate 
branch for review?

> CPU pinning/binding support for CgroupsCpushareIsolatorProcess
> --
>
> Key: MESOS-5342
> URL: https://issues.apache.org/jira/browse/MESOS-5342
> Project: Mesos
>  Issue Type: Improvement
>  Components: cgroups, containerization
>Affects Versions: 0.28.1
>Reporter: Chris
>
> The cgroups isolator currently lacks support for binding (also called 
> pinning) containers to a set of cores. The GNU/Linux kernel is known to make 
> sub-optimal core assignments for processes and threads. Poor assignments 
> impact program performance, specifically in terms of cache locality. 
> Applications requiring GPU resources can benefit from this feature by getting 
> access to cores closest to the GPU hardware, which reduces cpu-gpu copy 
> latency.
> Most cluster management systems from the HPC community (SLURM) provide both 
> cgroup isolation and cpu binding. This feature would provide similar 
> capabilities. The current interest in supporting Intel's Cache Allocation 
> Technology, and the advent of Intel's Knights-series processors, will require 
> making choices about where container's are going to run on the mesos-agent's 
> processor(s) cores - this feature is a step toward developing a robust 
> solution.
> The improvement in this JIRA ticket will handle hardware topology detection, 
> track container-to-core utilization in a histogram, and use a mathematical 
> optimization technique to select cores for container assignment based on 
> latency and the container-to-core utilization histogram.
> For GPU tasks, the improvement will prioritize selection of cores based on 
> latency between the GPU and cores in an effort to minimize copy latency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5353) Use `Connection` abstraction to compare stale connections in scheduler library.

2016-05-09 Thread Anand Mazumdar (JIRA)
Anand Mazumdar created MESOS-5353:
-

 Summary: Use `Connection` abstraction to compare stale connections 
in scheduler library.
 Key: MESOS-5353
 URL: https://issues.apache.org/jira/browse/MESOS-5353
 Project: Mesos
  Issue Type: Improvement
Reporter: Anand Mazumdar
Priority: Minor


Previously, we had a bug in the {{Connection}} abstraction in libprocess that 
hindered the ability to pass it onto {{defer}} callbacks since it could 
sometimes lead to deadlock (MESOS-4658). Now that it is resolved, we might 
consider not using {{UUID}} objects for stale connection checks but directly 
using the {{Connection}} abstraction in the scheduler library.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1575) master sets failover timeout to 0 when framework requests a high value

2016-05-09 Thread JIRA

[ 
https://issues.apache.org/jira/browse/MESOS-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15277406#comment-15277406
 ] 

José Guilherme Vanz commented on MESOS-1575:


[~neilc], are there something more to change in the patch? 

> master sets failover timeout to 0 when framework requests a high value
> --
>
> Key: MESOS-1575
> URL: https://issues.apache.org/jira/browse/MESOS-1575
> Project: Mesos
>  Issue Type: Bug
>Reporter: Kevin Sweeney
>Assignee: José Guilherme Vanz
>  Labels: newbie, twitter
>
> In response to a registered RPC we observed the following behavior:
> {noformat}
> W0709 19:07:32.982997 11400 master.cpp:612] Using the default value for 
> 'failover_timeout' becausethe input value is invalid: Argument out of the 
> range that a Duration can represent due to int64_t's size limit
> I0709 19:07:32.983008 11404 hierarchical_allocator_process.hpp:408] 
> Deactivated framework 20140709-184342-119646400-5050-11380-0003
> I0709 19:07:32.983013 11400 master.cpp:617] Giving framework 
> 20140709-184342-119646400-5050-11380-0003 0ns to failover
> I0709 19:07:32.983271 11404 master.cpp:2201] Framework failover timeout, 
> removing framework 20140709-184342-119646400-5050-11380-0003
> I0709 19:07:32.983294 11404 master.cpp:2688] Removing framework 
> 20140709-184342-119646400-5050-11380-0003
> I0709 19:07:32.983678 11404 hierarchical_allocator_process.hpp:363] Removed 
> framework 20140709-184342-119646400-5050-11380-0003
> {noformat}
> This was using the following frameworkInfo.
> {code}
> FrameworkInfo frameworkInfo = FrameworkInfo.newBuilder()
> .setUser("test")
> .setName("jvm")
> .setFailoverTimeout(Long.MAX_VALUE)
> .build();
> {code}
> Instead of silently defaulting large values to 0 the master should refuse to 
> process the request.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5352) Docker volume isolator cleanup can be blocked by first cleanup failure.

2016-05-09 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-5352:
---

 Summary: Docker volume isolator cleanup can be blocked by first 
cleanup failure.
 Key: MESOS-5352
 URL: https://issues.apache.org/jira/browse/MESOS-5352
 Project: Mesos
  Issue Type: Bug
  Components: containerization
Reporter: Gilbert Song


The summary title may be confusing, please look at the description below for 
details.

Some background:
1). In docker volume isolator cleanup, currently we do reference counting for 
docker volumes. Volume driver `unmount` will only be called if the ref count is 
1. 
2). We have built a hash map `infos` to track on docker volume mount 
information for one specific containerId. And a containerId will be erased form 
the hash map only if all driver `unmount` calls success (each subprocess return 
a ready future).

The issue in this JIRA is that if we have a slave running (not shut down or 
reboot in this case), then keep launching frameworks which make use of docker 
volumes. Once any docker volume isolator cleanup returns a failure, all the 
other `unmount` calls to these volumes will be blocked by the reference count, 
since the `_cleanup()` returns a failure and the containerId in the hash map 
`infos` is not erased even through all volume may be unmounted/detached 
correctly. (docker volume isolator calls driver unmount as a subprocess, and a 
failure message may be possibly returned by the driver even if all volumes are 
unmount/detached correctly). Then, the extra containerId in infos could make 
all other isolator cleanup calls to contain one extra volume when doing the 
reference counting, which mean it rejects to call driver unmount. So after all 
tasks finish, all those docker volumes from the first failure will still with 
the `attached` status.

This issue will be gone after the slave recover, but we cannot rely on 
restarting the slave every time hitting this case.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess

2016-05-09 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15277245#comment-15277245
 ] 

Joseph Wu commented on MESOS-5342:
--

You can post a link to the document as a JIRA link (we usually use Google Docs, 
but anything will work).

> CPU pinning/binding support for CgroupsCpushareIsolatorProcess
> --
>
> Key: MESOS-5342
> URL: https://issues.apache.org/jira/browse/MESOS-5342
> Project: Mesos
>  Issue Type: Improvement
>  Components: cgroups, containerization
>Affects Versions: 0.28.1
>Reporter: Chris
>
> The cgroups isolator currently lacks support for binding (also called 
> pinning) containers to a set of cores. The GNU/Linux kernel is known to make 
> sub-optimal core assignments for processes and threads. Poor assignments 
> impact program performance, specifically in terms of cache locality. 
> Applications requiring GPU resources can benefit from this feature by getting 
> access to cores closest to the GPU hardware, which reduces cpu-gpu copy 
> latency.
> Most cluster management systems from the HPC community (SLURM) provide both 
> cgroup isolation and cpu binding. This feature would provide similar 
> capabilities. The current interest in supporting Intel's Cache Allocation 
> Technology, and the advent of Intel's Knights-series processors, will require 
> making choices about where container's are going to run on the mesos-agent's 
> processor(s) cores - this feature is a step toward developing a robust 
> solution.
> The improvement in this JIRA ticket will handle hardware topology detection, 
> track container-to-core utilization in a histogram, and use a mathematical 
> optimization technique to select cores for container assignment based on 
> latency and the container-to-core utilization histogram.
> For GPU tasks, the improvement will prioritize selection of cores based on 
> latency between the GPU and cores in an effort to minimize copy latency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5351) DockerVolumeIsolatorTest.ROOT_INTERNET_CURL_CommandTaskRootfsWithVolumes is flaky

2016-05-09 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-5351:


 Summary: 
DockerVolumeIsolatorTest.ROOT_INTERNET_CURL_CommandTaskRootfsWithVolumes is 
flaky
 Key: MESOS-5351
 URL: https://issues.apache.org/jira/browse/MESOS-5351
 Project: Mesos
  Issue Type: Bug
  Components: test
 Environment: GCC 4.9
CentOS 7 and Fedora 23 (Both SSL or no-SSL)
Reporter: Joseph Wu


Consistently fails on Mesosphere internal CI:
{code}
[14:38:12] :   [Step 10/10] [ RUN  ] 
DockerVolumeIsolatorTest.ROOT_INTERNET_CURL_CommandTaskRootfsWithVolumes
[14:38:12]W:   [Step 10/10] I0509 14:38:12.782032  2386 cluster.cpp:149] 
Creating default 'local' authorizer
[14:38:12]W:   [Step 10/10] I0509 14:38:12.786592  2386 leveldb.cpp:174] Opened 
db in 4.462265ms
[14:38:12]W:   [Step 10/10] I0509 14:38:12.787979  2386 leveldb.cpp:181] 
Compacted db in 1.368995ms
[14:38:12]W:   [Step 10/10] I0509 14:38:12.788007  2386 leveldb.cpp:196] 
Created db iterator in 4994ns
[14:38:12]W:   [Step 10/10] I0509 14:38:12.788014  2386 leveldb.cpp:202] Seeked 
to beginning of db in 724ns
[14:38:12]W:   [Step 10/10] I0509 14:38:12.788019  2386 leveldb.cpp:271] 
Iterated through 0 keys in the db in 388ns
[14:38:12]W:   [Step 10/10] I0509 14:38:12.788031  2386 replica.cpp:779] 
Replica recovered with log positions 0 -> 0 with 1 holes and 0 unlearned
[14:38:12]W:   [Step 10/10] I0509 14:38:12.788249  2402 recover.cpp:447] 
Starting replica recovery
[14:38:12]W:   [Step 10/10] I0509 14:38:12.788316  2402 recover.cpp:473] 
Replica is in EMPTY status
[14:38:12]W:   [Step 10/10] I0509 14:38:12.788684  2406 replica.cpp:673] 
Replica in EMPTY status received a broadcasted recover request from 
(18057)@172.30.2.145:48816
[14:38:12]W:   [Step 10/10] I0509 14:38:12.788744  2405 recover.cpp:193] 
Received a recover response from a replica in EMPTY status
[14:38:12]W:   [Step 10/10] I0509 14:38:12.788869  2400 recover.cpp:564] 
Updating replica status to STARTING
[14:38:12]W:   [Step 10/10] I0509 14:38:12.789206  2406 master.cpp:383] Master 
6c04237d-91d6-4a05-849a-8b46fdeafe76 (ip-172-30-2-145.mesosphere.io) started on 
172.30.2.145:48816
[14:38:12]W:   [Step 10/10] I0509 14:38:12.789216  2406 master.cpp:385] Flags 
at startup: --acls="" --allocation_interval="1secs" 
--allocator="HierarchicalDRF" --authenticate="true" --authenticate_http="true" 
--authenticate_http_frameworks="true" --authenticate_slaves="true" 
--authenticators="crammd5" --authorizers="local" 
--credentials="/tmp/vepf2X/credentials" --framework_sorter="drf" --help="false" 
--hostname_lookup="true" --http_authenticators="basic" 
--http_framework_authenticators="basic" --initialize_driver_logging="true" 
--log_auto_initialize="true" --logbufsecs="0" --logging_level="INFO" 
--max_completed_frameworks="50" --max_completed_tasks_per_framework="1000" 
--max_slave_ping_timeouts="5" --quiet="false" 
--recovery_slave_removal_limit="100%" --registry="replicated_log" 
--registry_fetch_timeout="1mins" --registry_store_timeout="100secs" 
--registry_strict="true" --root_submissions="true" 
--slave_ping_timeout="15secs" --slave_reregister_timeout="10mins" 
--user_sorter="drf" --version="false" 
--webui_dir="/usr/local/share/mesos/webui" --work_dir="/tmp/vepf2X/master" 
--zk_session_timeout="10secs"
[14:38:12]W:   [Step 10/10] I0509 14:38:12.789342  2406 master.cpp:434] Master 
only allowing authenticated frameworks to register
[14:38:12]W:   [Step 10/10] I0509 14:38:12.789348  2406 master.cpp:440] Master 
only allowing authenticated agents to register
[14:38:12]W:   [Step 10/10] I0509 14:38:12.789351  2406 master.cpp:446] Master 
only allowing authenticated HTTP frameworks to register
[14:38:12]W:   [Step 10/10] I0509 14:38:12.789355  2406 credentials.hpp:37] 
Loading credentials for authentication from '/tmp/vepf2X/credentials'
[14:38:12]W:   [Step 10/10] I0509 14:38:12.789466  2406 master.cpp:490] Using 
default 'crammd5' authenticator
[14:38:12]W:   [Step 10/10] I0509 14:38:12.789504  2406 master.cpp:561] Using 
default 'basic' HTTP authenticator
[14:38:12]W:   [Step 10/10] I0509 14:38:12.789540  2406 master.cpp:641] Using 
default 'basic' HTTP framework authenticator
[14:38:12]W:   [Step 10/10] I0509 14:38:12.789599  2406 master.cpp:688] 
Authorization enabled
[14:38:12]W:   [Step 10/10] I0509 14:38:12.789669  2402 hierarchical.cpp:142] 
Initialized hierarchical allocator process
[14:38:12]W:   [Step 10/10] I0509 14:38:12.789691  2407 
whitelist_watcher.cpp:77] No whitelist given
[14:38:12]W:   [Step 10/10] I0509 14:38:12.790190  2403 leveldb.cpp:304] 
Persisting metadata (8 bytes) to leveldb took 1.259226ms
[14:38:12]W:   [Step 10/10] I0509 14:38:12.790207  2403 replica.cpp:320] 
Persisted replica status to STARTING
[14:38:12]W:   [Step 10/10] I0509 14:38:12.790297  2406 master.cpp:1939] The 
newly elected leader is master@172.30.2.145:48816 with id 
6c04237d-91d6-4a05-849a-8b46fdeafe76
[14:38:12]W:   [Step 

[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess

2016-05-09 Thread Chris (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15277207#comment-15277207
 ] 

Chris commented on MESOS-5342:
--

[~kaysoky] where are design documents supposed to be posted? I've gone through 
the patch submission documentation and will review the testing documentation 
and style guides.

> CPU pinning/binding support for CgroupsCpushareIsolatorProcess
> --
>
> Key: MESOS-5342
> URL: https://issues.apache.org/jira/browse/MESOS-5342
> Project: Mesos
>  Issue Type: Improvement
>  Components: cgroups, containerization
>Affects Versions: 0.28.1
>Reporter: Chris
>
> The cgroups isolator currently lacks support for binding (also called 
> pinning) containers to a set of cores. The GNU/Linux kernel is known to make 
> sub-optimal core assignments for processes and threads. Poor assignments 
> impact program performance, specifically in terms of cache locality. 
> Applications requiring GPU resources can benefit from this feature by getting 
> access to cores closest to the GPU hardware, which reduces cpu-gpu copy 
> latency.
> Most cluster management systems from the HPC community (SLURM) provide both 
> cgroup isolation and cpu binding. This feature would provide similar 
> capabilities. The current interest in supporting Intel's Cache Allocation 
> Technology, and the advent of Intel's Knights-series processors, will require 
> making choices about where container's are going to run on the mesos-agent's 
> processor(s) cores - this feature is a step toward developing a robust 
> solution.
> The improvement in this JIRA ticket will handle hardware topology detection, 
> track container-to-core utilization in a histogram, and use a mathematical 
> optimization technique to select cores for container assignment based on 
> latency and the container-to-core utilization histogram.
> For GPU tasks, the improvement will prioritize selection of cores based on 
> latency between the GPU and cores in an effort to minimize copy latency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5350) Add asynchronous hook for validating docker containerizer tasks

2016-05-09 Thread Joseph Wu (JIRA)
Joseph Wu created MESOS-5350:


 Summary: Add asynchronous hook for validating docker containerizer 
tasks
 Key: MESOS-5350
 URL: https://issues.apache.org/jira/browse/MESOS-5350
 Project: Mesos
  Issue Type: Improvement
  Components: docker, modules
Reporter: Joseph Wu
Assignee: Joseph Wu
Priority: Minor


It is possible to plug in custom validation logic for the MesosContainerizer 
via an {{Isolator}} module, but the same is not true of the DockerContainerizer.

Basic logic can be plugged into the DockerContainerizer via {{Hooks}}, but this 
has some notable differences compared to isolators:
* Hooks are synchronous.
* Modifications to tasks via Hooks have lower priority compared to the task 
itself.  i.e. If both the {{TaskInfo}} and 
{{slaveExecutorEnvironmentDecorator}} define the same environment variable, the 
{{TaskInfo}} wins.
* Hooks have no effect if they fail (short of segfaulting)
i.e. The {{slavePreLaunchDockerHook}} has a return type of {{Try}}:
https://github.com/apache/mesos/blob/628ccd23501078b04fb21eee85060a6226a80ef8/include/mesos/hook.hpp#L90
But the effect of returning an {{Error}} is a log message:
https://github.com/apache/mesos/blob/628ccd23501078b04fb21eee85060a6226a80ef8/src/hook/manager.cpp#L227-L230

We should add a hook to the DockerContainerizer to narrow this gap.  This new 
hook would:
* Be called at roughly the same place as {{slavePreLaunchDockerHook}}
https://github.com/apache/mesos/blob/628ccd23501078b04fb21eee85060a6226a80ef8/src/slave/containerizer/docker.cpp#L1022
* Return a {{Future}} and require splitting up {{DockerContainerizer::launch}}.
* Prevent a task from launching if it returns a {{Failure}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess

2016-05-09 Thread Jie Yu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15277056#comment-15277056
 ] 

Jie Yu commented on MESOS-5342:
---

+1

Sending a patch to RB without a design (for a non trivial feature) should be 
avoided.


> CPU pinning/binding support for CgroupsCpushareIsolatorProcess
> --
>
> Key: MESOS-5342
> URL: https://issues.apache.org/jira/browse/MESOS-5342
> Project: Mesos
>  Issue Type: Improvement
>  Components: cgroups, containerization
>Affects Versions: 0.28.1
>Reporter: Chris
>
> The cgroups isolator currently lacks support for binding (also called 
> pinning) containers to a set of cores. The GNU/Linux kernel is known to make 
> sub-optimal core assignments for processes and threads. Poor assignments 
> impact program performance, specifically in terms of cache locality. 
> Applications requiring GPU resources can benefit from this feature by getting 
> access to cores closest to the GPU hardware, which reduces cpu-gpu copy 
> latency.
> Most cluster management systems from the HPC community (SLURM) provide both 
> cgroup isolation and cpu binding. This feature would provide similar 
> capabilities. The current interest in supporting Intel's Cache Allocation 
> Technology, and the advent of Intel's Knights-series processors, will require 
> making choices about where container's are going to run on the mesos-agent's 
> processor(s) cores - this feature is a step toward developing a robust 
> solution.
> The improvement in this JIRA ticket will handle hardware topology detection, 
> track container-to-core utilization in a histogram, and use a mathematical 
> optimization technique to select cores for container assignment based on 
> latency and the container-to-core utilization histogram.
> For GPU tasks, the improvement will prioritize selection of cores based on 
> latency between the GPU and cores in an effort to minimize copy latency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess

2016-05-09 Thread Joseph Wu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276992#comment-15276992
 ] 

Joseph Wu commented on MESOS-5342:
--

Ideally (and especially for new contributors), you should find a shepherd 
_before_ starting work on an issue, which will save you time in the long-run.

I would recommend taking some time and reading some of our contribution guides:
* http://mesos.apache.org/documentation/latest/c++-style-guide/
* http://mesos.apache.org/documentation/latest/submitting-a-patch/
* http://mesos.apache.org/documentation/latest/testing-patterns/

It would also help to have a design document that describes the goal and some 
implementation decisions you've made.

> CPU pinning/binding support for CgroupsCpushareIsolatorProcess
> --
>
> Key: MESOS-5342
> URL: https://issues.apache.org/jira/browse/MESOS-5342
> Project: Mesos
>  Issue Type: Improvement
>  Components: cgroups, containerization
>Affects Versions: 0.28.1
>Reporter: Chris
>
> The cgroups isolator currently lacks support for binding (also called 
> pinning) containers to a set of cores. The GNU/Linux kernel is known to make 
> sub-optimal core assignments for processes and threads. Poor assignments 
> impact program performance, specifically in terms of cache locality. 
> Applications requiring GPU resources can benefit from this feature by getting 
> access to cores closest to the GPU hardware, which reduces cpu-gpu copy 
> latency.
> Most cluster management systems from the HPC community (SLURM) provide both 
> cgroup isolation and cpu binding. This feature would provide similar 
> capabilities. The current interest in supporting Intel's Cache Allocation 
> Technology, and the advent of Intel's Knights-series processors, will require 
> making choices about where container's are going to run on the mesos-agent's 
> processor(s) cores - this feature is a step toward developing a robust 
> solution.
> The improvement in this JIRA ticket will handle hardware topology detection, 
> track container-to-core utilization in a histogram, and use a mathematical 
> optimization technique to select cores for container assignment based on 
> latency and the container-to-core utilization histogram.
> For GPU tasks, the improvement will prioritize selection of cores based on 
> latency between the GPU and cores in an effort to minimize copy latency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5349) A large number of tasks stuck in Staging state.

2016-05-09 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-5349:
--
Description: 
We saw a weird issue happening on one of our test clusters over the weekend. A 
large number of tasks from the example {{long running framework}} were stuck in 
staging. The executor was duly sending status updates for all the tasks and the 
slave successfully received the status update as seen from the logs but for 
some reason never got to checkpointing them.

>From the agent logs, it seems that it kept on retrying some backlogged status 
>updates starting with the 4xxx/6xxx range while the present tasks were 
>launched in the 8xxx range. (task ID)

The issue resolved itself after a few hours upon the agent (re-)registering 
with the master upon loosing its ZK session. 

Let's take a timeline of a particular task 8142.

Agent logs before restart
{code}
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.204941  2820 
slave.cpp:1522] Got assigned task 8142 for framework 
ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205142  2820 
slave.cpp:1641] Launching task 8142 for framework 
ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205656  2820 
slave.cpp:1880] Queuing task '8142' for executor 'default' of framework 
ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP)
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.206092  2818 
disk.cpp:169] Updating the disk resources for container 
f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.101; mem(*):33
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207093  2816 
mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 33MB for container 
f68f137c-b101-4f9f-8de4-f50eae27e969
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207293  2821 
cpushare.cpp:389] Updated 'cpu.shares' to 103 (cpus 0.101) for container 
f68f137c-b101-4f9f-8de4-f50eae27e969
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208742  2821 
cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' 
to 10100us (cpus 0.101) for container f68f137c-b101-4f9f-8de4-f50eae27e969
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208902  2818 
slave.cpp:2032] Sending queued task '8142' to executor 'default' of framework 
ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP)
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210290  2821 
http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210357  2821 
slave.cpp:3221] Handling status update TASK_RUNNING (UUID: 
85323c4f-e523-495e-9b49-39b0a7792303) for task 8142 of framework 
ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213770  2817 
http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921
May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213882  2817 
slave.cpp:3221] Handling status update TASK_FINISHED (UUID: 
285b73e1-7f5a-43e5-8385-7b76e0fbdad4) for task 8142 of framework 
ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.214787  2821 
disk.cpp:169] Updating the disk resources for container 
f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.1; mem(*):32
May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215365  2823 
cpushare.cpp:389] Updated 'cpu.shares' to 102 (cpus 0.1) for container 
f68f137c-b101-4f9f-8de4-f50eae27e969
May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215641  2820 
mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 32MB for container 
f68f137c-b101-4f9f-8de4-f50eae27e969
May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.216878  2823 
cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' 
to 10ms (cpus 0.1) for container f68f137c-b101-4f9f-8de4-f50eae27e969
{code}

Agent logs for this task upon restart:
{code}
May 09 15:22:14 ip-10-10-0-6 mesos-slave[14314]: W0509 15:22:14.083993 14318 
state.cpp:606] Failed to find status updates file 
'/var/lib/mesos/slave/meta/slaves/ad2ee74e-24f1-4381-be9a-1af70ba1ced0-S1/frameworks/ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003/executors/default/runs/f68f137c-b101-4f9f-8de4-f50eae27e969/tasks/8142/task.updates'
{code}

Things that need to be investigated:
- Why couldn't the agent get around to handling the status updates from the 
executor i.e. even checkpointing them?
- What made the agent get _so_ backlogged on the status updates i.e. why it 
kept resending the old status updates for the 4/6 tasks without getting 
around to the newer tasks.

PFA the agent/master logs. This is running against Mesos {{HEAD -> 
557cab591f35a6c3d2248d7af7f06cdf99726e92}}

  was:
We 

[jira] [Updated] (MESOS-5349) A large number of tasks stuck in Staging state.

2016-05-09 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-5349:
--
Attachment: log_agent_before_zk_disconnect.gz

> A large number of tasks stuck in Staging state.
> ---
>
> Key: MESOS-5349
> URL: https://issues.apache.org/jira/browse/MESOS-5349
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.29.0
>Reporter: Anand Mazumdar
>  Labels: mesosphere
> Attachments: agent-state.log, log_agent_after_zk_disconnect.gz, 
> log_agent_before_zk_disconnect.gz, master-state.log, mesos-master.WARNING, 
> mesos-slave.WARNING, staging_tasks.png
>
>
> We saw a weird issue happening on one of our test clusters over the weekend. 
> A large number of tasks from the example {{long running framework}} were 
> stuck in staging. The executor was duly sending status updates for all the 
> tasks and the slave successfully received the status update as seen from the 
> logs but for some reason never got to checkpointing them.
> From the agent logs, it seems that it kept on retrying some backlogged status 
> updates starting with the 4xxx/6xxx range while the present tasks were 
> launched in the 8xxx range. (task ID)
> The issue resolved itself after a few hours upon the agent (re-)registering 
> with the master upon loosing its ZK session. 
> Let's take a timeline of a particular task 8142.
> Agent logs before restart
> {code}
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.204941  2820 
> slave.cpp:1522] Got assigned task 8142 for framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205142  2820 
> slave.cpp:1641] Launching task 8142 for framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205656  2820 
> slave.cpp:1880] Queuing task '8142' for executor 'default' of framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP)
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.206092  2818 
> disk.cpp:169] Updating the disk resources for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.101; mem(*):33
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207093  2816 
> mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 33MB for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207293  2821 
> cpushare.cpp:389] Updated 'cpu.shares' to 103 (cpus 0.101) for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208742  2821 
> cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' 
> to 10100us (cpus 0.101) for container f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208902  2818 
> slave.cpp:2032] Sending queued task '8142' to executor 'default' of framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP)
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210290  2821 
> http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210357  2821 
> slave.cpp:3221] Handling status update TASK_RUNNING (UUID: 
> 85323c4f-e523-495e-9b49-39b0a7792303) for task 8142 of framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213770  2817 
> http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213882  2817 
> slave.cpp:3221] Handling status update TASK_FINISHED (UUID: 
> 285b73e1-7f5a-43e5-8385-7b76e0fbdad4) for task 8142 of framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.214787  2821 
> disk.cpp:169] Updating the disk resources for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.1; mem(*):32
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215365  2823 
> cpushare.cpp:389] Updated 'cpu.shares' to 102 (cpus 0.1) for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215641  2820 
> mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 32MB for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.216878  2823 
> cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' 
> to 10ms (cpus 0.1) for container f68f137c-b101-4f9f-8de4-f50eae27e969
> {code}
> Agent logs for this task upon restart:
> {code}
> May 09 15:22:14 ip-10-10-0-6 mesos-slave[14314]: W0509 

[jira] [Issue Comment Deleted] (MESOS-5349) A large number of tasks stuck in Staging state.

2016-05-09 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-5349:
--
Comment: was deleted

(was: Trying to upload the detailed master/agent logs somewhere else as they 
are rather large.)

> A large number of tasks stuck in Staging state.
> ---
>
> Key: MESOS-5349
> URL: https://issues.apache.org/jira/browse/MESOS-5349
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.29.0
>Reporter: Anand Mazumdar
>  Labels: mesosphere
> Attachments: agent-state.log, log_agent_after_zk_disconnect.gz, 
> log_agent_before_zk_disconnect.gz, master-state.log, mesos-master.WARNING, 
> mesos-slave.WARNING, staging_tasks.png
>
>
> We saw a weird issue happening on one of our test clusters over the weekend. 
> A large number of tasks from the example {{long running framework}} were 
> stuck in staging. The executor was duly sending status updates for all the 
> tasks and the slave successfully received the status update as seen from the 
> logs but for some reason never got to checkpointing them.
> From the agent logs, it seems that it kept on retrying some backlogged status 
> updates starting with the 4xxx/6xxx range while the present tasks were 
> launched in the 8xxx range. (task ID)
> The issue resolved itself after a few hours upon the agent (re-)registering 
> with the master upon loosing its ZK session. 
> Let's take a timeline of a particular task 8142.
> Agent logs before restart
> {code}
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.204941  2820 
> slave.cpp:1522] Got assigned task 8142 for framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205142  2820 
> slave.cpp:1641] Launching task 8142 for framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205656  2820 
> slave.cpp:1880] Queuing task '8142' for executor 'default' of framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP)
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.206092  2818 
> disk.cpp:169] Updating the disk resources for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.101; mem(*):33
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207093  2816 
> mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 33MB for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207293  2821 
> cpushare.cpp:389] Updated 'cpu.shares' to 103 (cpus 0.101) for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208742  2821 
> cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' 
> to 10100us (cpus 0.101) for container f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208902  2818 
> slave.cpp:2032] Sending queued task '8142' to executor 'default' of framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP)
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210290  2821 
> http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210357  2821 
> slave.cpp:3221] Handling status update TASK_RUNNING (UUID: 
> 85323c4f-e523-495e-9b49-39b0a7792303) for task 8142 of framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213770  2817 
> http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213882  2817 
> slave.cpp:3221] Handling status update TASK_FINISHED (UUID: 
> 285b73e1-7f5a-43e5-8385-7b76e0fbdad4) for task 8142 of framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.214787  2821 
> disk.cpp:169] Updating the disk resources for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.1; mem(*):32
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215365  2823 
> cpushare.cpp:389] Updated 'cpu.shares' to 102 (cpus 0.1) for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215641  2820 
> mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 32MB for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.216878  2823 
> cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' 
> to 10ms (cpus 0.1) for container f68f137c-b101-4f9f-8de4-f50eae27e969
> {code}
> Agent logs for this task upon restart:
> 

[jira] [Updated] (MESOS-5349) A large number of tasks stuck in Staging state.

2016-05-09 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-5349:
--
Attachment: log_agent_after_zk_disconnect.gz

> A large number of tasks stuck in Staging state.
> ---
>
> Key: MESOS-5349
> URL: https://issues.apache.org/jira/browse/MESOS-5349
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.29.0
>Reporter: Anand Mazumdar
>  Labels: mesosphere
> Attachments: agent-state.log, log_agent_after_zk_disconnect.gz, 
> master-state.log, mesos-master.WARNING, mesos-slave.WARNING, staging_tasks.png
>
>
> We saw a weird issue happening on one of our test clusters over the weekend. 
> A large number of tasks from the example {{long running framework}} were 
> stuck in staging. The executor was duly sending status updates for all the 
> tasks and the slave successfully received the status update as seen from the 
> logs but for some reason never got to checkpointing them.
> From the agent logs, it seems that it kept on retrying some backlogged status 
> updates starting with the 4xxx/6xxx range while the present tasks were 
> launched in the 8xxx range. (task ID)
> The issue resolved itself after a few hours upon the agent (re-)registering 
> with the master upon loosing its ZK session. 
> Let's take a timeline of a particular task 8142.
> Agent logs before restart
> {code}
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.204941  2820 
> slave.cpp:1522] Got assigned task 8142 for framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205142  2820 
> slave.cpp:1641] Launching task 8142 for framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205656  2820 
> slave.cpp:1880] Queuing task '8142' for executor 'default' of framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP)
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.206092  2818 
> disk.cpp:169] Updating the disk resources for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.101; mem(*):33
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207093  2816 
> mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 33MB for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207293  2821 
> cpushare.cpp:389] Updated 'cpu.shares' to 103 (cpus 0.101) for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208742  2821 
> cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' 
> to 10100us (cpus 0.101) for container f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208902  2818 
> slave.cpp:2032] Sending queued task '8142' to executor 'default' of framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP)
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210290  2821 
> http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210357  2821 
> slave.cpp:3221] Handling status update TASK_RUNNING (UUID: 
> 85323c4f-e523-495e-9b49-39b0a7792303) for task 8142 of framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213770  2817 
> http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213882  2817 
> slave.cpp:3221] Handling status update TASK_FINISHED (UUID: 
> 285b73e1-7f5a-43e5-8385-7b76e0fbdad4) for task 8142 of framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.214787  2821 
> disk.cpp:169] Updating the disk resources for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.1; mem(*):32
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215365  2823 
> cpushare.cpp:389] Updated 'cpu.shares' to 102 (cpus 0.1) for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215641  2820 
> mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 32MB for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.216878  2823 
> cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' 
> to 10ms (cpus 0.1) for container f68f137c-b101-4f9f-8de4-f50eae27e969
> {code}
> Agent logs for this task upon restart:
> {code}
> May 09 15:22:14 ip-10-10-0-6 mesos-slave[14314]: W0509 15:22:14.083993 14318 
> state.cpp:606] Failed 

[jira] [Updated] (MESOS-5349) A large number of tasks stuck in Staging state.

2016-05-09 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-5349:
--
Attachment: staging_tasks.png

> A large number of tasks stuck in Staging state.
> ---
>
> Key: MESOS-5349
> URL: https://issues.apache.org/jira/browse/MESOS-5349
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.29.0
>Reporter: Anand Mazumdar
>  Labels: mesosphere
> Attachments: agent-state.log, master-state.log, mesos-master.WARNING, 
> mesos-slave.WARNING, staging_tasks.png
>
>
> We saw a weird issue happening on one of our test clusters over the weekend. 
> A large number of tasks from the example {{long running framework}} were 
> stuck in staging. The executor was duly sending status updates for all the 
> tasks and the slave successfully received the status update as seen from the 
> logs but for some reason never got to checkpointing them.
> From the agent logs, it seems that it kept on retrying some backlogged status 
> updates starting with the 4xxx/6xxx range while the present tasks were 
> launched in the 8xxx range. (task ID)
> The issue resolved itself after a few hours upon the agent (re-)registering 
> with the master upon loosing its ZK session. 
> Let's take a timeline of a particular task 8142.
> Agent logs before restart
> {code}
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.204941  2820 
> slave.cpp:1522] Got assigned task 8142 for framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205142  2820 
> slave.cpp:1641] Launching task 8142 for framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205656  2820 
> slave.cpp:1880] Queuing task '8142' for executor 'default' of framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP)
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.206092  2818 
> disk.cpp:169] Updating the disk resources for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.101; mem(*):33
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207093  2816 
> mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 33MB for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207293  2821 
> cpushare.cpp:389] Updated 'cpu.shares' to 103 (cpus 0.101) for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208742  2821 
> cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' 
> to 10100us (cpus 0.101) for container f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208902  2818 
> slave.cpp:2032] Sending queued task '8142' to executor 'default' of framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP)
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210290  2821 
> http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210357  2821 
> slave.cpp:3221] Handling status update TASK_RUNNING (UUID: 
> 85323c4f-e523-495e-9b49-39b0a7792303) for task 8142 of framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213770  2817 
> http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213882  2817 
> slave.cpp:3221] Handling status update TASK_FINISHED (UUID: 
> 285b73e1-7f5a-43e5-8385-7b76e0fbdad4) for task 8142 of framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.214787  2821 
> disk.cpp:169] Updating the disk resources for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.1; mem(*):32
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215365  2823 
> cpushare.cpp:389] Updated 'cpu.shares' to 102 (cpus 0.1) for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215641  2820 
> mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 32MB for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.216878  2823 
> cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' 
> to 10ms (cpus 0.1) for container f68f137c-b101-4f9f-8de4-f50eae27e969
> {code}
> Agent logs for this task upon restart:
> {code}
> May 09 15:22:14 ip-10-10-0-6 mesos-slave[14314]: W0509 15:22:14.083993 14318 
> state.cpp:606] Failed to find status updates file 
> 

[jira] [Updated] (MESOS-5349) A large number of tasks stuck in Staging state.

2016-05-09 Thread Anand Mazumdar (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Anand Mazumdar updated MESOS-5349:
--
Attachment: mesos-slave.WARNING
mesos-master.WARNING
master-state.log
agent-state.log

Trying to upload the detailed master/agent logs somewhere else as they are 
rather large.

> A large number of tasks stuck in Staging state.
> ---
>
> Key: MESOS-5349
> URL: https://issues.apache.org/jira/browse/MESOS-5349
> Project: Mesos
>  Issue Type: Bug
>  Components: slave
>Affects Versions: 0.29.0
>Reporter: Anand Mazumdar
>  Labels: mesosphere
> Attachments: agent-state.log, master-state.log, mesos-master.WARNING, 
> mesos-slave.WARNING
>
>
> We saw a weird issue happening on one of our test clusters over the weekend. 
> A large number of tasks from the example {{long running framework}} were 
> stuck in staging. The executor was duly sending status updates for all the 
> tasks and the slave successfully received the status update as seen from the 
> logs but for some reason never got to checkpointing them.
> From the agent logs, it seems that it kept on retrying some backlogged status 
> updates starting with the 4xxx/6xxx range while the present tasks were 
> launched in the 8xxx range. (task ID)
> The issue resolved itself after a few hours upon the agent (re-)registering 
> with the master upon loosing its ZK session. 
> Let's take a timeline of a particular task 8142.
> Agent logs before restart
> {code}
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.204941  2820 
> slave.cpp:1522] Got assigned task 8142 for framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205142  2820 
> slave.cpp:1641] Launching task 8142 for framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205656  2820 
> slave.cpp:1880] Queuing task '8142' for executor 'default' of framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP)
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.206092  2818 
> disk.cpp:169] Updating the disk resources for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.101; mem(*):33
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207093  2816 
> mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 33MB for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207293  2821 
> cpushare.cpp:389] Updated 'cpu.shares' to 103 (cpus 0.101) for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208742  2821 
> cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' 
> to 10100us (cpus 0.101) for container f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208902  2818 
> slave.cpp:2032] Sending queued task '8142' to executor 'default' of framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP)
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210290  2821 
> http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921
> May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210357  2821 
> slave.cpp:3221] Handling status update TASK_RUNNING (UUID: 
> 85323c4f-e523-495e-9b49-39b0a7792303) for task 8142 of framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213770  2817 
> http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213882  2817 
> slave.cpp:3221] Handling status update TASK_FINISHED (UUID: 
> 285b73e1-7f5a-43e5-8385-7b76e0fbdad4) for task 8142 of framework 
> ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.214787  2821 
> disk.cpp:169] Updating the disk resources for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.1; mem(*):32
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215365  2823 
> cpushare.cpp:389] Updated 'cpu.shares' to 102 (cpus 0.1) for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215641  2820 
> mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 32MB for container 
> f68f137c-b101-4f9f-8de4-f50eae27e969
> May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.216878  2823 
> cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' 
> to 10ms (cpus 0.1) for container f68f137c-b101-4f9f-8de4-f50eae27e969
> {code}
> Agent logs for this task 

[jira] [Created] (MESOS-5349) A large number of tasks stuck in Staging state.

2016-05-09 Thread Anand Mazumdar (JIRA)
Anand Mazumdar created MESOS-5349:
-

 Summary: A large number of tasks stuck in Staging state.
 Key: MESOS-5349
 URL: https://issues.apache.org/jira/browse/MESOS-5349
 Project: Mesos
  Issue Type: Bug
  Components: slave
Affects Versions: 0.29.0
Reporter: Anand Mazumdar


We saw a weird issue happening on one of our test clusters over the weekend. A 
large number of tasks from the example {{long running framework}} were stuck in 
staging. The executor was duly sending status updates for all the tasks and the 
slave successfully received the status update as seen from the logs but for 
some reason never got to checkpointing them.

>From the agent logs, it seems that it kept on retrying some backlogged status 
>updates starting with the 4xxx/6xxx range while the present tasks were 
>launched in the 8xxx range. (task ID)

The issue resolved itself after a few hours upon the agent (re-)registering 
with the master upon loosing its ZK session. 

Let's take a timeline of a particular task 8142.

Agent logs before restart
{code}
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.204941  2820 
slave.cpp:1522] Got assigned task 8142 for framework 
ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205142  2820 
slave.cpp:1641] Launching task 8142 for framework 
ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.205656  2820 
slave.cpp:1880] Queuing task '8142' for executor 'default' of framework 
ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP)
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.206092  2818 
disk.cpp:169] Updating the disk resources for container 
f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.101; mem(*):33
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207093  2816 
mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 33MB for container 
f68f137c-b101-4f9f-8de4-f50eae27e969
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.207293  2821 
cpushare.cpp:389] Updated 'cpu.shares' to 103 (cpus 0.101) for container 
f68f137c-b101-4f9f-8de4-f50eae27e969
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208742  2821 
cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' 
to 10100us (cpus 0.101) for container f68f137c-b101-4f9f-8de4-f50eae27e969
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.208902  2818 
slave.cpp:2032] Sending queued task '8142' to executor 'default' of framework 
ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003 (via HTTP)
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210290  2821 
http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921
May 08 00:47:34 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:34.210357  2821 
slave.cpp:3221] Handling status update TASK_RUNNING (UUID: 
85323c4f-e523-495e-9b49-39b0a7792303) for task 8142 of framework 
ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213770  2817 
http.cpp:188] HTTP POST for /slave(1)/api/v1/executor from 10.10.0.6:60921
May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.213882  2817 
slave.cpp:3221] Handling status update TASK_FINISHED (UUID: 
285b73e1-7f5a-43e5-8385-7b76e0fbdad4) for task 8142 of framework 
ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003
May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.214787  2821 
disk.cpp:169] Updating the disk resources for container 
f68f137c-b101-4f9f-8de4-f50eae27e969 to cpus(*):0.1; mem(*):32
May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215365  2823 
cpushare.cpp:389] Updated 'cpu.shares' to 102 (cpus 0.1) for container 
f68f137c-b101-4f9f-8de4-f50eae27e969
May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.215641  2820 
mem.cpp:353] Updated 'memory.soft_limit_in_bytes' to 32MB for container 
f68f137c-b101-4f9f-8de4-f50eae27e969
May 08 00:47:40 ip-10-10-0-6 mesos-slave[2779]: I0508 00:47:40.216878  2823 
cpushare.cpp:411] Updated 'cpu.cfs_period_us' to 100ms and 'cpu.cfs_quota_us' 
to 10ms (cpus 0.1) for container f68f137c-b101-4f9f-8de4-f50eae27e969
{code}

Agent logs for this task upon restart:
{code}
May 09 15:22:14 ip-10-10-0-6 mesos-slave[14314]: W0509 15:22:14.083993 14318 
state.cpp:606] Failed to find status updates file 
'/var/lib/mesos/slave/meta/slaves/ad2ee74e-24f1-4381-be9a-1af70ba1ced0-S1/frameworks/ad2ee74e-24f1-4381-be9a-1af70ba1ced0-0003/executors/default/runs/f68f137c-b101-4f9f-8de4-f50eae27e969/tasks/8142/task.updates'
{code}

Things that need to be investigated:
- Why couldn't the agent get around to handling the status updates from the 
executor i.e. even checkpointing them?
- What made the agent get _so_ backlogged on the status updates i.e. why it 
kept resending the old status updates for the 

[jira] [Commented] (MESOS-3220) Offer ability to kill tasks from the API

2016-05-09 Thread Michael Gummelt (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276822#comment-15276822
 ] 

Michael Gummelt commented on MESOS-3220:


+1.

I'm implementing this behavior in Spark.  It would be more efficient if mesos 
offered it, so we wouldn't have to reimplement at the framework level.

> Offer ability to kill tasks from the API
> 
>
> Key: MESOS-3220
> URL: https://issues.apache.org/jira/browse/MESOS-3220
> Project: Mesos
>  Issue Type: Improvement
>  Components: master
>Reporter: Sunil Shah
>  Labels: mesosphere
>
> We are investigating adding a {{dcos task kill}} command to our DCOS (and 
> Mesos) command line interface. Currently the ability to kill tasks is only 
> offered via the scheduler API so it would be useful to have some ability to 
> kill tasks directly.
> This would complement the Maintenance Primitives, in that it would enable the 
> operator to terminate those tasks which, for whatever reasons, do not respond 
> to Inverse Offers events.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3243) Replace NULL with nullptr

2016-05-09 Thread Tomasz Janiszewski (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3243?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276758#comment-15276758
 ] 

Tomasz Janiszewski commented on MESOS-3243:
---

Ping

> Replace NULL with nullptr
> -
>
> Key: MESOS-3243
> URL: https://issues.apache.org/jira/browse/MESOS-3243
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Michael Park
>Assignee: Tomasz Janiszewski
>
> As part of the C++ upgrade, it would be nice to move our use of {{NULL}} over 
> to use {{nullptr}}. I think it would be an interesting exercise to do this 
> with {{clang-modernize}} using the [nullptr 
> transform|http://clang.llvm.org/extra/UseNullptrTransform.html] (although 
> it's probably just as easy to use {{sed}}).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5212) Allow any principal in ReservationInfo when HTTP authentication is off

2016-05-09 Thread Bernd Mathiske (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bernd Mathiske updated MESOS-5212:
--
Shepherd: Bernd Mathiske

> Allow any principal in ReservationInfo when HTTP authentication is off
> --
>
> Key: MESOS-5212
> URL: https://issues.apache.org/jira/browse/MESOS-5212
> Project: Mesos
>  Issue Type: Improvement
>Affects Versions: 0.28.1
>Reporter: Greg Mann
>Assignee: Greg Mann
>  Labels: mesosphere
> Fix For: 0.29.0
>
>
> Mesos currently provides no way for operators to pass their principal to HTTP 
> endpoints when HTTP authentication is off. Since we enforce that 
> {{ReservationInfo.principal}} be equal to the operator principal in requests 
> to {{/reserve}}, this means that when HTTP authentication is disabled, the 
> {{ReservationInfo.principal}} field cannot be set.
> To address this in the short-term, we should allow 
> {{ReservationInfo.principal}} to hold any value when HTTP authentication is 
> disabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3926) Modularize URI fetcher plugin interface.

2016-05-09 Thread Joseph Wu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph Wu updated MESOS-3926:
-
Sprint:   (was: Mesosphere Sprint 35)

> Modularize URI fetcher plugin interface.  
> --
>
> Key: MESOS-3926
> URL: https://issues.apache.org/jira/browse/MESOS-3926
> Project: Mesos
>  Issue Type: Task
>  Components: fetcher
>Reporter: Jie Yu
>Assignee: Shuai Lin
>  Labels: fetcher, mesosphere, module
>
> So that we can add custom URI fetcher plugins using modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5239) Persistent volume DockerContainerizer support assumes proper mount propagation setup on the host.

2016-05-09 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5239?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-5239:
--
  Sprint: Mesosphere Sprint 34
Story Points: 3

> Persistent volume DockerContainerizer support assumes proper mount 
> propagation setup on the host.
> -
>
> Key: MESOS-5239
> URL: https://issues.apache.org/jira/browse/MESOS-5239
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
>Affects Versions: 0.28.0, 0.28.1
>Reporter: Jie Yu
>Assignee: Jie Yu
>  Labels: mesosphere
> Fix For: 0.29.0, 0.28.2
>
>
> We recently added persistent volume support in DockerContainerizer 
> (MESOS-3413). To understand the problem, we first need to understand how 
> persistent volumes are supported in DockerContainerizer.
> To support persistent volumes in DockerContainerizer, we bind mount 
> persistent volumes under a container's sandbox ('container_path' has to be 
> relative for persistent volumes). When the Docker container is launched, 
> since we always add a volume (-v) for the sandbox, the persistent volumes 
> will be bind mounted into the container as well (since Docker does a 'rbind').
> The assumption that the above works is that the Docker daemon should see 
> those persistent volume mounts that Mesos mounts on the host mount table. 
> It's not a problem if Docker daemon itself is using the host mount namespace. 
> However, on systemd enabled systems, Docker daemon is running in a separate 
> mount namespace and all mounts in that mount namespace will be marked as 
> slave mounts due to this 
> [patch|https://github.com/docker/docker/commit/eb76cb2301fc883941bc4ca2d9ebc3a486ab8e0a].
> So what that means is that: in order for it to work, the parent mount of 
> agent's work_dir should be a shared mount when docker daemon starts. This is 
> typically true on CentOS7, CoreOS as all mounts are shared mounts by default.
> However, this causes an issue with the 'filesystem/linux' isolator. To 
> understand why, first I need to show you a typical problem when dealing with 
> shared mounts. Let me explain that using the following commands on a CentOS7 
> machine:
> {noformat}
> [root@core-dev run]# cat /proc/self/mountinfo
> 24 60 0:19 / /run rw,nosuid,nodev shared:22 - tmpfs tmpfs rw,seclabel,mode=755
> [root@core-dev run]# mkdir /run/netns
> [root@core-dev run]# mount --bind /run/netns /run/netns
> [root@core-dev run]# cat /proc/self/mountinfo
> 24 60 0:19 / /run rw,nosuid,nodev shared:22 - tmpfs tmpfs rw,seclabel,mode=755
> 121 24 0:19 /netns /run/netns rw,nosuid,nodev shared:22 - tmpfs tmpfs 
> rw,seclabel,mode=755
> [root@core-dev run]# ip netns add test
> [root@core-dev run]# cat /proc/self/mountinfo
> 24 60 0:19 / /run rw,nosuid,nodev shared:22 - tmpfs tmpfs rw,seclabel,mode=755
> 121 24 0:19 /netns /run/netns rw,nosuid,nodev shared:22 - tmpfs tmpfs 
> rw,seclabel,mode=755
> 162 121 0:3 / /run/netns/test rw,nosuid,nodev,noexec,relatime shared:5 - proc 
> proc rw
> 163 24 0:3 / /run/netns/test rw,nosuid,nodev,noexec,relatime shared:5 - proc 
> proc rw
> {noformat}
> As you can see above, there're two entries (/run/netns/test) in the mount 
> table (unexpected). This will confuse some systems sometimes. The reason is 
> because when we create a self bind mount (/run/netns -> /run/netns), the 
> mount will be put into the same shared mount peer group (shared:22) as its 
> parent (/run). Then, when you create another mount underneath that 
> (/run/netns/test), that mount operation will be propagated to all mounts in 
> the same peer group (shared:22), resulting an unexpected additional mount 
> being created.
> The reason we need to do a self bind mount in Mesos is that sometimes, we 
> need to make sure some mounts are shared so that it does not get copied when 
> a new mount namespace is created. However, on some systems, mounts are 
> private by default (e.g., Ubuntu 14.04). In those cases, since we cannot 
> change the system mounts, we have to do a self bind mount so that we can set 
> mount propagation to shared. For instance, in filesytem/linux isolator, we do 
> a self bind mount on agent's work_dir.
> To avoid the self bind mount pitfall mentioned above, in filesystem/linux 
> isolator, after we created the mount, we do a make-slave + make-shared so 
> that the mount is its own shared mount peer group. In that way, any mounts 
> underneath it will not be propagated back.
> However, that operation will break the assumption that the persistent volume 
> DockerContainerizer support makes. As a result, we're seeing problem with 
> persistent volumes in DockerContainerizer when filesystem/linux isolator is 
> turned on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5307) Sandbox mounts should not be in the host mount namespace.

2016-05-09 Thread Jie Yu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-5307:
--
Labels: mesosphere  (was: )

> Sandbox mounts should not be in the host mount namespace.
> -
>
> Key: MESOS-5307
> URL: https://issues.apache.org/jira/browse/MESOS-5307
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Jie Yu
>Assignee: Jie Yu
>  Labels: mesosphere
> Fix For: 0.29.0, 0.28.2
>
>
> Currently, if a container uses container image, we'll do a bind mount of its 
> sandbox ( -> /mnt/mesos/sandbox) in the host mount namespace.
> However, doing the mounts in the host mount table is not ideal. That 
> complicates both the cleanup path and the recovery path.
> Instead, we can do the sandbox bind mount in the container's mount namespace 
> so that cleanup and recovery will be greatly simplified. We can setup mount 
> propagation properly so that persistent volumes mounted at /xxx can 
> be propagated into the container.
> Here is a simple proof of concept:
> Console 1:
> {noformat}
> vagrant@vagrant-ubuntu-trusty-64:~/tmp/mesos$ ll .
> total 12
> drwxrwxr-x 3 vagrant vagrant 4096 Apr 25 16:05 ./
> drwxrwxr-x 6 vagrant vagrant 4096 Apr 25 23:17 ../
> drwxrwxr-x 5 vagrant vagrant 4096 Apr 25 23:17 slave/
> vagrant@vagrant-ubuntu-trusty-64:~/tmp/mesos$ ll slave/
> total 20
> drwxrwxr-x  5 vagrant vagrant 4096 Apr 25 23:17 ./
> drwxrwxr-x  3 vagrant vagrant 4096 Apr 25 16:05 ../
> drwxrwxr-x  6 vagrant vagrant 4096 Apr 26 21:06 directory/
> drwxr-xr-x 12 vagrant vagrant 4096 Apr 25 23:20 rootfs/
> drwxrwxr-x  2 vagrant vagrant 4096 Apr 25 16:09 volume/
> vagrant@vagrant-ubuntu-trusty-64:~/tmp/mesos$ sudo mount --bind slave/ slave/ 
>   
>   
>
> vagrant@vagrant-ubuntu-trusty-64:~/tmp/mesos$ sudo mount --make-shared slave/
> vagrant@vagrant-ubuntu-trusty-64:~/tmp/mesos$ cat /proc/self/mountinfo 
> 50 22 8:1 /home/vagrant/tmp/mesos/slave /home/vagrant/tmp/mesos/slave 
> rw,relatime shared:1 - ext4 
> /dev/disk/by-uuid/baf292e5-0bb6-4e58-8a71-5b912e0f09b6 rw,data=ordered
> {noformat}
> Console 2:
> {noformat}
> vagrant@vagrant-ubuntu-trusty-64:~/tmp/mesos$ cd slave/
> vagrant@vagrant-ubuntu-trusty-64:~/tmp/mesos/slave$ sudo unshare -m /bin/bash
> root@vagrant-ubuntu-trusty-64:~/tmp/mesos/slave# sudo mount --make-rslave .
> root@vagrant-ubuntu-trusty-64:~/tmp/mesos/slave# cat /proc/self/mountinfo
> 124 63 8:1 /home/vagrant/tmp/mesos/slave /home/vagrant/tmp/mesos/slave 
> rw,relatime master:1 - ext4 
> /dev/disk/by-uuid/baf292e5-0bb6-4e58-8a71-5b912e0f09b6 rw,data=ordered
> root@vagrant-ubuntu-trusty-64:~/tmp/mesos/slave# mount --rbind directory/ 
> rootfs/mnt/mesos/sandbox/ 
>   
>  
> root@vagrant-ubuntu-trusty-64:~/tmp/mesos/slave# mount --rbind rootfs/ rootfs/
> root@vagrant-ubuntu-trusty-64:~/tmp/mesos/slave# mount -t proc proc 
> rootfs/proc   
>   
>
> root@vagrant-ubuntu-trusty-64:~/tmp/mesos/slave# pivot_root rootfs 
> rootfs/tmp/.rootfs
>   
> 
> root@vagrant-ubuntu-trusty-64:~/tmp/mesos/slave# cd /
> root@vagrant-ubuntu-trusty-64:/# cat /proc/self/mountinfo
> 126 61 8:1 /home/vagrant/tmp/mesos/slave/rootfs / rw,relatime master:1 - ext4 
> /dev/disk/by-uuid/baf292e5-0bb6-4e58-8a71-5b912e0f09b6 rw,data=ordered
> 127 126 8:1 /home/vagrant/tmp/mesos/slave/directory /mnt/mesos/sandbox 
> rw,relatime master:1 - ext4 
> /dev/disk/by-uuid/baf292e5-0bb6-4e58-8a71-5b912e0f09b6 rw,data=ordered
> 128 126 0:3 / /proc rw,relatime - proc proc rw
> {noformat}
> Console 1:
> {noformat}
> agrant@vagrant-ubuntu-trusty-64:~/tmp/mesos$ cd slave/
> vagrant@vagrant-ubuntu-trusty-64:~/tmp/mesos/slave$ sudo mount --bind volume/ 
> directory/v1
> vagrant@vagrant-ubuntu-trusty-64:~/tmp/mesos/slave$ cat /proc/self/mountinfo
> 50 22 8:1 /home/vagrant/tmp/mesos/slave /home/vagrant/tmp/mesos/slave 
> rw,relatime shared:1 - ext4 
> /dev/disk/by-uuid/baf292e5-0bb6-4e58-8a71-5b912e0f09b6 rw,data=ordered
> 129 50 8:1 /home/vagrant/tmp/mesos/slave/volume 
> /home/vagrant/tmp/mesos/slave/directory/v1 rw,relatime 

[jira] [Updated] (MESOS-3926) Modularize URI fetcher plugin interface.

2016-05-09 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-3926:
-
Sprint: Mesosphere Sprint 35

> Modularize URI fetcher plugin interface.  
> --
>
> Key: MESOS-3926
> URL: https://issues.apache.org/jira/browse/MESOS-3926
> Project: Mesos
>  Issue Type: Task
>  Components: fetcher
>Reporter: Jie Yu
>Assignee: Shuai Lin
>  Labels: fetcher, mesosphere, module
>
> So that we can add custom URI fetcher plugins using modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3926) Modularize URI fetcher plugin interface.

2016-05-09 Thread Artem Harutyunyan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Artem Harutyunyan updated MESOS-3926:
-
Sprint:   (was: Mesosphere Sprint 34)

> Modularize URI fetcher plugin interface.  
> --
>
> Key: MESOS-3926
> URL: https://issues.apache.org/jira/browse/MESOS-3926
> Project: Mesos
>  Issue Type: Task
>  Components: fetcher
>Reporter: Jie Yu
>Assignee: Shuai Lin
>  Labels: fetcher, mesosphere, module
>
> So that we can add custom URI fetcher plugins using modules.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5277) Need to add REMOVE semantics to the copy backend

2016-05-09 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-5277:

Sprint: Mesosphere Sprint 35

> Need to add REMOVE semantics to the copy backend
> 
>
> Key: MESOS-5277
> URL: https://issues.apache.org/jira/browse/MESOS-5277
> Project: Mesos
>  Issue Type: Bug
>  Components: containerization
> Environment: linux
>Reporter: Avinash Sridharan
>Assignee: Gilbert Song
>  Labels: mesosphere
> Fix For: 0.29.0
>
>
> Some Dockerfile run the `rm` command to remove files from the base image 
> using the "RUN" directive in the Dockerfile. An example can be found here:
> https://github.com/ngineered/nginx-php-fpm.git
> In the final rootfs the removed files should not be present. Presence of 
> these files in the final image can make the container misbehave. For example, 
> the nginx-php-fpm docker image that is reference tries to remove the default 
> nginx config and replace it with it own config to point a different HTML 
> root. If the default nginx config is still present after the building the 
> image, nginx will start pointing to a different HTML root than the one set in 
> the Dockerfile.
> Currently the copy backend cannot handle removal of files from intermediate 
> layers. This can cause issues with docker images built using a Dockerfile 
> similar to the one listed here. Hence, we need to add REMOVE semantics to the 
> copy backend.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4771) Document the network/cni isolator.

2016-05-09 Thread Avinash Sridharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4771?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Avinash Sridharan updated MESOS-4771:
-
Sprint: Mesosphere Sprint 35

> Document the network/cni isolator.
> --
>
> Key: MESOS-4771
> URL: https://issues.apache.org/jira/browse/MESOS-4771
> Project: Mesos
>  Issue Type: Task
>Reporter: Jie Yu
>Assignee: Avinash Sridharan
>
> We need to document this isolator in mesos-containerizer.md (e.g., how to 
> configure it, what's the pre-requisite, etc.)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4823) Implement port forwarding in `network/cni` isolator

2016-05-09 Thread Avinash Sridharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4823?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Avinash Sridharan updated MESOS-4823:
-
Sprint: Mesosphere Sprint 30, Mesosphere Sprint 31, Mesosphere Sprint 35  
(was: Mesosphere Sprint 30, Mesosphere Sprint 31)

> Implement port forwarding in `network/cni` isolator
> ---
>
> Key: MESOS-4823
> URL: https://issues.apache.org/jira/browse/MESOS-4823
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
> Environment: linux
>Reporter: Avinash Sridharan
>Assignee: Avinash Sridharan
>Priority: Critical
>  Labels: mesosphere
>
> Most docker and appc images wish to expose ports that micro-services are 
> listening on, to the outside world. When containers are running on bridged 
> (or ptp) networking this can be achieved by installing port forwarding rules 
> on the agent (using iptables). This can be done in the `network/cni` 
> isolator. 
> The reason we would like this functionality to be implemented in the 
> `network/cni` isolator, and not a CNI plugin, is that the specifications 
> currently do not support specifying port forwarding rules. Further, to 
> install these rules the isolator needs two pieces of information, the exposed 
> ports and the IP address associated with the container. Bother are available 
> to the isolator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5066) Create an iptables interface in Mesos

2016-05-09 Thread Avinash Sridharan (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Avinash Sridharan updated MESOS-5066:
-
Sprint: Mesosphere Sprint 35

> Create an iptables interface in Mesos
> -
>
> Key: MESOS-5066
> URL: https://issues.apache.org/jira/browse/MESOS-5066
> Project: Mesos
>  Issue Type: Task
>  Components: containerization
>Reporter: Avinash Sridharan
>Assignee: Avinash Sridharan
>  Labels: mesosphere
>
> For support port mapping functionality in the network CNI isolator we need to 
> enable DNAT rules in iptables. We therefore need to create an iptables 
> interface in Mesos. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-4626) Support Nvidia GPUs with filesystem isolation enabled.

2016-05-09 Thread Kevin Klues (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Klues reassigned MESOS-4626:
--

Assignee: Kevin Klues

> Support Nvidia GPUs with filesystem isolation enabled.
> --
>
> Key: MESOS-4626
> URL: https://issues.apache.org/jira/browse/MESOS-4626
> Project: Mesos
>  Issue Type: Task
>  Components: isolation
>Reporter: Benjamin Mahler
>Assignee: Kevin Klues
>
> When filesystem isolation is enabled, containers that use Nvidia GPU 
> resources need access to GPU libraries residing on the host.
> We'll need to provide a means for operators to inject the necessary volumes 
> into *all* containers that use "gpus" resources.
> See the nvidia-docker project for more details:
> [nvidia-docker/tools/src/nvidia/volumes.go|https://github.com/NVIDIA/nvidia-docker/blob/fda10b2d27bf5578cc5337c23877f827e4d1ed77/tools/src/nvidia/volumes.go#L50-L103]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5257) Add autodiscovery for GPU resources

2016-05-09 Thread Kevin Klues (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Klues reassigned MESOS-5257:
--

Assignee: Kevin Klues

> Add autodiscovery for GPU resources
> ---
>
> Key: MESOS-5257
> URL: https://issues.apache.org/jira/browse/MESOS-5257
> Project: Mesos
>  Issue Type: Task
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>  Labels: isolator
>
> Right now, the only way to enumerate the available GPUs on an agent is to use 
> the `--nvidia_gpu_devices` flag and explicitly list them out.  Instead, we 
> should leverage NVML to autodiscover the GPUs that are available and only use 
> this flag as a way to explicitly list out the GPUs you want to make available 
> in order to restrict access to some of them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5258) Turn the Nvidia GPU isolator into a module

2016-05-09 Thread Kevin Klues (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Klues updated MESOS-5258:
---
  Sprint: Mesosphere Sprint 35
Story Points: 5  (was: 3)

> Turn the Nvidia GPU isolator into a module
> --
>
> Key: MESOS-5258
> URL: https://issues.apache.org/jira/browse/MESOS-5258
> Project: Mesos
>  Issue Type: Task
>Reporter: Kevin Klues
>
> The Nvidia GPU isolator has an external dependence on `libnvidia-ml.so`. As 
> it currently stands, this forces *all* binaries that link with `libmesos.so` 
> to also link with `libnvidia-ml.so` (including master, agents on machines 
> without GPUs, scheduler, exectors, etc.).
> By turning the Nvidia GPU isolator into a module, it will be loaded at 
> runtime only when an agent has explicitly including the the Nvidia GPU 
> isolator in its `--isolation` flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-5258) Turn the Nvidia GPU isolator into a module

2016-05-09 Thread Kevin Klues (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5258?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Klues reassigned MESOS-5258:
--

Assignee: Kevin Klues

> Turn the Nvidia GPU isolator into a module
> --
>
> Key: MESOS-5258
> URL: https://issues.apache.org/jira/browse/MESOS-5258
> Project: Mesos
>  Issue Type: Task
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>
> The Nvidia GPU isolator has an external dependence on `libnvidia-ml.so`. As 
> it currently stands, this forces *all* binaries that link with `libmesos.so` 
> to also link with `libnvidia-ml.so` (including master, agents on machines 
> without GPUs, scheduler, exectors, etc.).
> By turning the Nvidia GPU isolator into a module, it will be loaded at 
> runtime only when an agent has explicitly including the the Nvidia GPU 
> isolator in its `--isolation` flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-4626) Support Nvidia GPUs with filesystem isolation enabled.

2016-05-09 Thread Kevin Klues (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-4626?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Klues updated MESOS-4626:
---
Sprint: Mesosphere Sprint 35

> Support Nvidia GPUs with filesystem isolation enabled.
> --
>
> Key: MESOS-4626
> URL: https://issues.apache.org/jira/browse/MESOS-4626
> Project: Mesos
>  Issue Type: Task
>  Components: isolation
>Reporter: Benjamin Mahler
>
> When filesystem isolation is enabled, containers that use Nvidia GPU 
> resources need access to GPU libraries residing on the host.
> We'll need to provide a means for operators to inject the necessary volumes 
> into *all* containers that use "gpus" resources.
> See the nvidia-docker project for more details:
> [nvidia-docker/tools/src/nvidia/volumes.go|https://github.com/NVIDIA/nvidia-docker/blob/fda10b2d27bf5578cc5337c23877f827e4d1ed77/tools/src/nvidia/volumes.go#L50-L103]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5257) Add autodiscovery for GPU resources

2016-05-09 Thread Kevin Klues (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5257?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Klues updated MESOS-5257:
---
Sprint: Mesosphere Sprint 35

> Add autodiscovery for GPU resources
> ---
>
> Key: MESOS-5257
> URL: https://issues.apache.org/jira/browse/MESOS-5257
> Project: Mesos
>  Issue Type: Task
>Reporter: Kevin Klues
>  Labels: isolator
>
> Right now, the only way to enumerate the available GPUs on an agent is to use 
> the `--nvidia_gpu_devices` flag and explicitly list them out.  Instead, we 
> should leverage NVML to autodiscover the GPUs that are available and only use 
> this flag as a way to explicitly list out the GPUs you want to make available 
> in order to restrict access to some of them.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5221) Add Documentation for Nvidia GPU support

2016-05-09 Thread Kevin Klues (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5221?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Klues updated MESOS-5221:
---
Sprint: Mesosphere Sprint 33, Mesosphere Sprint 35  (was: Mesosphere Sprint 
33, Mesosphere Sprint 34)

> Add Documentation for Nvidia GPU support
> 
>
> Key: MESOS-5221
> URL: https://issues.apache.org/jira/browse/MESOS-5221
> Project: Mesos
>  Issue Type: Documentation
>Reporter: Kevin Klues
>Assignee: Kevin Klues
>Priority: Minor
>
> https://reviews.apache.org/r/46220/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5347) Enhance the log message when launching mesos containerizer.

2016-05-09 Thread Gilbert Song (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilbert Song updated MESOS-5347:

Fix Version/s: 0.29.0

> Enhance the log message when launching mesos containerizer.
> ---
>
> Key: MESOS-5347
> URL: https://issues.apache.org/jira/browse/MESOS-5347
> Project: Mesos
>  Issue Type: Improvement
>  Components: containerization
>Reporter: Gilbert Song
>Assignee: Guangya Liu
>  Labels: mesosphere
> Fix For: 0.29.0
>
>
> Log the launch flag which includes the executor command, pre-launch commands 
> and other information when launching the mesos containerizer. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5348) Enhance the log message when launching docker containerizer.

2016-05-09 Thread Gilbert Song (JIRA)
Gilbert Song created MESOS-5348:
---

 Summary: Enhance the log message when launching docker 
containerizer.
 Key: MESOS-5348
 URL: https://issues.apache.org/jira/browse/MESOS-5348
 Project: Mesos
  Issue Type: Improvement
  Components: containerization
Reporter: Gilbert Song
Assignee: Guangya Liu
 Fix For: 0.29.0


Log the launch flag which includes the executor command and other information 
when launching the docker containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess

2016-05-09 Thread Chris (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276658#comment-15276658
 ] 

Chris commented on MESOS-5342:
--

Forgot to mention, a shepard is needed to support integration of this feature!

> CPU pinning/binding support for CgroupsCpushareIsolatorProcess
> --
>
> Key: MESOS-5342
> URL: https://issues.apache.org/jira/browse/MESOS-5342
> Project: Mesos
>  Issue Type: Improvement
>  Components: cgroups, containerization
>Affects Versions: 0.28.1
>Reporter: Chris
>
> The cgroups isolator currently lacks support for binding (also called 
> pinning) containers to a set of cores. The GNU/Linux kernel is known to make 
> sub-optimal core assignments for processes and threads. Poor assignments 
> impact program performance, specifically in terms of cache locality. 
> Applications requiring GPU resources can benefit from this feature by getting 
> access to cores closest to the GPU hardware, which reduces cpu-gpu copy 
> latency.
> Most cluster management systems from the HPC community (SLURM) provide both 
> cgroup isolation and cpu binding. This feature would provide similar 
> capabilities. The current interest in supporting Intel's Cache Allocation 
> Technology, and the advent of Intel's Knights-series processors, will require 
> making choices about where container's are going to run on the mesos-agent's 
> processor(s) cores - this feature is a step toward developing a robust 
> solution.
> The improvement in this JIRA ticket will handle hardware topology detection, 
> track container-to-core utilization in a histogram, and use a mathematical 
> optimization technique to select cores for container assignment based on 
> latency and the container-to-core utilization histogram.
> For GPU tasks, the improvement will prioritize selection of cores based on 
> latency between the GPU and cores in an effort to minimize copy latency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5173) Allow master/agent to take multiple modules manifest files

2016-05-09 Thread Kapil Arya (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kapil Arya updated MESOS-5173:
--
Shepherd: Till Toenshoff
 Summary: Allow master/agent to take multiple modules manifest files  (was: 
Allow master/agent to take multiple --modules flags)

> Allow master/agent to take multiple modules manifest files
> --
>
> Key: MESOS-5173
> URL: https://issues.apache.org/jira/browse/MESOS-5173
> Project: Mesos
>  Issue Type: Task
>Reporter: Kapil Arya
>Assignee: Kapil Arya
>  Labels: mesosphere
> Fix For: 0.29.0
>
>
> When loading multiple modules into master/agent, one has to merge all module 
> metadata (library name, module name, parameters, etc.) into a single json 
> file which is then passed on to the --modules flag. This quickly becomes 
> cumbersome especially if the modules are coming from different 
> vendors/developers.
> An alternate would be to allow multiple invocations of --modules flag that 
> can then be passed on to the module manager. That way, each flag corresponds 
> to just one module library and modules from that library.
> Another approach is to create a new flag (e.g., --modules-dir) that contains 
> a path to a directory that would contain multiple json files. One can think 
> of it as an analogous to systemd units. The operator that drops a new file 
> into this directory and the file would automatically be picked up by the 
> master/agent module manager. Further, the naming scheme can also be inherited 
> to prefix the filename with an "NN_" to signify oad order.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess

2016-05-09 Thread Chris (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276521#comment-15276521
 ] 

Chris commented on MESOS-5342:
--

For information about submodular functions (and why it was selected for this 
problem), strongly suggest reviewing at least this youtube lecture/video 
(ideally the entire series of videos) publicly available from MLSS Iceland 
2014: https://youtu.be/6ThMzlHdKsI


> CPU pinning/binding support for CgroupsCpushareIsolatorProcess
> --
>
> Key: MESOS-5342
> URL: https://issues.apache.org/jira/browse/MESOS-5342
> Project: Mesos
>  Issue Type: Improvement
>  Components: cgroups, containerization
>Affects Versions: 0.28.1
>Reporter: Chris
>
> The cgroups isolator currently lacks support for binding (also called 
> pinning) containers to a set of cores. The GNU/Linux kernel is known to make 
> sub-optimal core assignments for processes and threads. Poor assignments 
> impact program performance, specifically in terms of cache locality. 
> Applications requiring GPU resources can benefit from this feature by getting 
> access to cores closest to the GPU hardware, which reduces cpu-gpu copy 
> latency.
> Most cluster management systems from the HPC community (SLURM) provide both 
> cgroup isolation and cpu binding. This feature would provide similar 
> capabilities. The current interest in supporting Intel's Cache Allocation 
> Technology, and the advent of Intel's Knights-series processors, will require 
> making choices about where container's are going to run on the mesos-agent's 
> processor(s) cores - this feature is a step toward developing a robust 
> solution.
> The improvement in this JIRA ticket will handle hardware topology detection, 
> track container-to-core utilization in a histogram, and use a mathematical 
> optimization technique to select cores for container assignment based on 
> latency and the container-to-core utilization histogram.
> For GPU tasks, the improvement will prioritize selection of cores based on 
> latency between the GPU and cores in an effort to minimize copy latency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess

2016-05-09 Thread Chris (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris updated MESOS-5342:
-
Comment: was deleted

(was: For information about submodular functions (and why it was selected for 
this problem), strongly suggest reviewing at least this youtube lecture/video 
(ideally the entire series of videos) publicly available from MLSS Iceland 
2014: https://youtu.be/6ThMzlHdKsI)

> CPU pinning/binding support for CgroupsCpushareIsolatorProcess
> --
>
> Key: MESOS-5342
> URL: https://issues.apache.org/jira/browse/MESOS-5342
> Project: Mesos
>  Issue Type: Improvement
>  Components: cgroups, containerization
>Affects Versions: 0.28.1
>Reporter: Chris
>
> The cgroups isolator currently lacks support for binding (also called 
> pinning) containers to a set of cores. The GNU/Linux kernel is known to make 
> sub-optimal core assignments for processes and threads. Poor assignments 
> impact program performance, specifically in terms of cache locality. 
> Applications requiring GPU resources can benefit from this feature by getting 
> access to cores closest to the GPU hardware, which reduces cpu-gpu copy 
> latency.
> Most cluster management systems from the HPC community (SLURM) provide both 
> cgroup isolation and cpu binding. This feature would provide similar 
> capabilities. The current interest in supporting Intel's Cache Allocation 
> Technology, and the advent of Intel's Knights-series processors, will require 
> making choices about where container's are going to run on the mesos-agent's 
> processor(s) cores - this feature is a step toward developing a robust 
> solution.
> The improvement in this JIRA ticket will handle hardware topology detection, 
> track container-to-core utilization in a histogram, and use a mathematical 
> optimization technique to select cores for container assignment based on 
> latency and the container-to-core utilization histogram.
> For GPU tasks, the improvement will prioritize selection of cores based on 
> latency between the GPU and cores in an effort to minimize copy latency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess

2016-05-09 Thread Chris (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276518#comment-15276518
 ] 

Chris commented on MESOS-5342:
--

For information about submodular functions (and why it was selected for this 
problem), strongly suggest reviewing at least this youtube lecture/video 
(ideally the entire series of videos) publicly available from MLSS Iceland 
2014: https://youtu.be/6ThMzlHdKsI

> CPU pinning/binding support for CgroupsCpushareIsolatorProcess
> --
>
> Key: MESOS-5342
> URL: https://issues.apache.org/jira/browse/MESOS-5342
> Project: Mesos
>  Issue Type: Improvement
>  Components: cgroups, containerization
>Affects Versions: 0.28.1
>Reporter: Chris
>
> The cgroups isolator currently lacks support for binding (also called 
> pinning) containers to a set of cores. The GNU/Linux kernel is known to make 
> sub-optimal core assignments for processes and threads. Poor assignments 
> impact program performance, specifically in terms of cache locality. 
> Applications requiring GPU resources can benefit from this feature by getting 
> access to cores closest to the GPU hardware, which reduces cpu-gpu copy 
> latency.
> Most cluster management systems from the HPC community (SLURM) provide both 
> cgroup isolation and cpu binding. This feature would provide similar 
> capabilities. The current interest in supporting Intel's Cache Allocation 
> Technology, and the advent of Intel's Knights-series processors, will require 
> making choices about where container's are going to run on the mesos-agent's 
> processor(s) cores - this feature is a step toward developing a robust 
> solution.
> The improvement in this JIRA ticket will handle hardware topology detection, 
> track container-to-core utilization in a histogram, and use a mathematical 
> optimization technique to select cores for container assignment based on 
> latency and the container-to-core utilization histogram.
> For GPU tasks, the improvement will prioritize selection of cores based on 
> latency between the GPU and cores in an effort to minimize copy latency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess

2016-05-09 Thread Chris (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris updated MESOS-5342:
-
Comment: was deleted

(was: For information about submodular functions (and why it was selected for 
this problem), strongly suggest reviewing this youtube video: 
https://youtu.be/6ThMzlHdKsI)

> CPU pinning/binding support for CgroupsCpushareIsolatorProcess
> --
>
> Key: MESOS-5342
> URL: https://issues.apache.org/jira/browse/MESOS-5342
> Project: Mesos
>  Issue Type: Improvement
>  Components: cgroups, containerization
>Affects Versions: 0.28.1
>Reporter: Chris
>
> The cgroups isolator currently lacks support for binding (also called 
> pinning) containers to a set of cores. The GNU/Linux kernel is known to make 
> sub-optimal core assignments for processes and threads. Poor assignments 
> impact program performance, specifically in terms of cache locality. 
> Applications requiring GPU resources can benefit from this feature by getting 
> access to cores closest to the GPU hardware, which reduces cpu-gpu copy 
> latency.
> Most cluster management systems from the HPC community (SLURM) provide both 
> cgroup isolation and cpu binding. This feature would provide similar 
> capabilities. The current interest in supporting Intel's Cache Allocation 
> Technology, and the advent of Intel's Knights-series processors, will require 
> making choices about where container's are going to run on the mesos-agent's 
> processor(s) cores - this feature is a step toward developing a robust 
> solution.
> The improvement in this JIRA ticket will handle hardware topology detection, 
> track container-to-core utilization in a histogram, and use a mathematical 
> optimization technique to select cores for container assignment based on 
> latency and the container-to-core utilization histogram.
> For GPU tasks, the improvement will prioritize selection of cores based on 
> latency between the GPU and cores in an effort to minimize copy latency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess

2016-05-09 Thread Chris (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276514#comment-15276514
 ] 

Chris commented on MESOS-5342:
--

For information about submodular functions (and why it was selected for this 
problem), strongly suggest reviewing this youtube video: 
https://youtu.be/6ThMzlHdKsI

> CPU pinning/binding support for CgroupsCpushareIsolatorProcess
> --
>
> Key: MESOS-5342
> URL: https://issues.apache.org/jira/browse/MESOS-5342
> Project: Mesos
>  Issue Type: Improvement
>  Components: cgroups, containerization
>Affects Versions: 0.28.1
>Reporter: Chris
>
> The cgroups isolator currently lacks support for binding (also called 
> pinning) containers to a set of cores. The GNU/Linux kernel is known to make 
> sub-optimal core assignments for processes and threads. Poor assignments 
> impact program performance, specifically in terms of cache locality. 
> Applications requiring GPU resources can benefit from this feature by getting 
> access to cores closest to the GPU hardware, which reduces cpu-gpu copy 
> latency.
> Most cluster management systems from the HPC community (SLURM) provide both 
> cgroup isolation and cpu binding. This feature would provide similar 
> capabilities. The current interest in supporting Intel's Cache Allocation 
> Technology, and the advent of Intel's Knights-series processors, will require 
> making choices about where container's are going to run on the mesos-agent's 
> processor(s) cores - this feature is a step toward developing a robust 
> solution.
> The improvement in this JIRA ticket will handle hardware topology detection, 
> track container-to-core utilization in a histogram, and use a mathematical 
> optimization technique to select cores for container assignment based on 
> latency and the container-to-core utilization histogram.
> For GPU tasks, the improvement will prioritize selection of cores based on 
> latency between the GPU and cores in an effort to minimize copy latency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess

2016-05-09 Thread Chris (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris updated MESOS-5342:
-
Comment: was deleted

(was: Fixed a small bug in the greedy submodular subset selection algorithm. 
The "submodular cost" of selecting a core was being used in the knapsack budget 
test (cores currently have an at-most-budget-cost of "1.0"). The correct cost 
is now being used in the test.)

> CPU pinning/binding support for CgroupsCpushareIsolatorProcess
> --
>
> Key: MESOS-5342
> URL: https://issues.apache.org/jira/browse/MESOS-5342
> Project: Mesos
>  Issue Type: Improvement
>  Components: cgroups, containerization
>Affects Versions: 0.28.1
>Reporter: Chris
>
> The cgroups isolator currently lacks support for binding (also called 
> pinning) containers to a set of cores. The GNU/Linux kernel is known to make 
> sub-optimal core assignments for processes and threads. Poor assignments 
> impact program performance, specifically in terms of cache locality. 
> Applications requiring GPU resources can benefit from this feature by getting 
> access to cores closest to the GPU hardware, which reduces cpu-gpu copy 
> latency.
> Most cluster management systems from the HPC community (SLURM) provide both 
> cgroup isolation and cpu binding. This feature would provide similar 
> capabilities. The current interest in supporting Intel's Cache Allocation 
> Technology, and the advent of Intel's Knights-series processors, will require 
> making choices about where container's are going to run on the mesos-agent's 
> processor(s) cores - this feature is a step toward developing a robust 
> solution.
> The improvement in this JIRA ticket will handle hardware topology detection, 
> track container-to-core utilization in a histogram, and use a mathematical 
> optimization technique to select cores for container assignment based on 
> latency and the container-to-core utilization histogram.
> For GPU tasks, the improvement will prioritize selection of cores based on 
> latency between the GPU and cores in an effort to minimize copy latency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess

2016-05-09 Thread Chris (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276500#comment-15276500
 ] 

Chris commented on MESOS-5342:
--

Fixed a small bug in the greedy submodular subset selection algorithm. The 
"submodular cost" of selecting a core was being used in the knapsack budget 
test (cores currently have an at-most-budget-cost of "1.0"). The correct cost 
is now being used in the test.

> CPU pinning/binding support for CgroupsCpushareIsolatorProcess
> --
>
> Key: MESOS-5342
> URL: https://issues.apache.org/jira/browse/MESOS-5342
> Project: Mesos
>  Issue Type: Improvement
>  Components: cgroups, containerization
>Affects Versions: 0.28.1
>Reporter: Chris
>
> The cgroups isolator currently lacks support for binding (also called 
> pinning) containers to a set of cores. The GNU/Linux kernel is known to make 
> sub-optimal core assignments for processes and threads. Poor assignments 
> impact program performance, specifically in terms of cache locality. 
> Applications requiring GPU resources can benefit from this feature by getting 
> access to cores closest to the GPU hardware, which reduces cpu-gpu copy 
> latency.
> Most cluster management systems from the HPC community (SLURM) provide both 
> cgroup isolation and cpu binding. This feature would provide similar 
> capabilities. The current interest in supporting Intel's Cache Allocation 
> Technology, and the advent of Intel's Knights-series processors, will require 
> making choices about where container's are going to run on the mesos-agent's 
> processor(s) cores - this feature is a step toward developing a robust 
> solution.
> The improvement in this JIRA ticket will handle hardware topology detection, 
> track container-to-core utilization in a histogram, and use a mathematical 
> optimization technique to select cores for container assignment based on 
> latency and the container-to-core utilization histogram.
> For GPU tasks, the improvement will prioritize selection of cores based on 
> latency between the GPU and cores in an effort to minimize copy latency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5168) Benchmark overhead of authorization based filtering.

2016-05-09 Thread Joerg Schad (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276477#comment-15276477
 ] 

Joerg Schad commented on MESOS-5168:


and the current prototype used to benchmark here: 
https://github.com/joerg84/mesos/tree/filterPrototype

> Benchmark overhead of authorization based filtering.
> 
>
> Key: MESOS-5168
> URL: https://issues.apache.org/jira/browse/MESOS-5168
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Joerg Schad
>Assignee: Joerg Schad
>  Labels: authorization, mesosphere, security
> Fix For: 0.29.0
>
>
> When adding authorization based filtering as outlined in MESOS-4931 we need 
> to be careful especially for performance critical endpoints such as /state.
> We should ensure via a benchmark that performance does not degreade below an 
> acceptable state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5168) Benchmark overhead of authorization based filtering.

2016-05-09 Thread Joerg Schad (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5168?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276475#comment-15276475
 ] 

Joerg Schad commented on MESOS-5168:


Benchmark results can be found here:
https://docs.google.com/document/d/1Ojq55I_2iyYWMSxnq9TvshVn9JGwEX4PfRBkwgZTiuY

> Benchmark overhead of authorization based filtering.
> 
>
> Key: MESOS-5168
> URL: https://issues.apache.org/jira/browse/MESOS-5168
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Joerg Schad
>Assignee: Joerg Schad
>  Labels: authorization, mesosphere, security
> Fix For: 0.29.0
>
>
> When adding authorization based filtering as outlined in MESOS-4931 we need 
> to be careful especially for performance critical endpoints such as /state.
> We should ensure via a benchmark that performance does not degreade below an 
> acceptable state.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5346) Some endpoints do not specify their allowed request methods.

2016-05-09 Thread Jan Schlicht (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5346?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jan Schlicht updated MESOS-5346:

Priority: Minor  (was: Major)

> Some endpoints do not specify their allowed request methods.
> 
>
> Key: MESOS-5346
> URL: https://issues.apache.org/jira/browse/MESOS-5346
> Project: Mesos
>  Issue Type: Bug
>  Components: security, technical debt
>Reporter: Jan Schlicht
>Priority: Minor
>  Labels: http, security, tech-debt
>
> Some HTTP endpoints (for example "/flags" or "/state") create a response 
> regardless of what the request method is. For example an HTTP POST to the 
> "/state" endpoint will create the same response as an HTTP GET.
> While this inconsistency isn't harmful at the moment, it will get problematic 
> when authorization is implemented, using separate ACLs for endpoints that can 
> be GETed and endpoints that can be POSTed to.
> Validation of the request method should be added to all endpoints, e.g. 
> "/state" should return a 405 (Method Not Allowed) when POSTed to.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5340) libevent builds may prevent new connections

2016-05-09 Thread Till Toenshoff (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Toenshoff updated MESOS-5340:
--
Description: 
When using an SSL-enabled build of Mesos in combination with SSL-downgrading 
support, any connection that does not actually transmit data will hang the 
runnable (e.g. master).

For reproducing the issue (on any platform)...

Spin up a master with enabled SSL-downgrading:
{noformat}
$ export SSL_ENABLED=true
$ export SSL_SUPPORT_DOWNGRADE=true
$ export SSL_KEY_FILE=/path/to/your/foo.key
$ export SSL_CERT_FILE=/path/to/your/foo.crt
$ export SSL_CA_FILE=/path/to/your/ca.crt
$ ./bin/mesos-master.sh --work_dir=/tmp/foo
{noformat}

Create some artificial HTTP request load for quickly spotting the problem in 
both, the master logs as well as the output of CURL itself:
{noformat}
$ while true; do sleep 0.1; echo $( date +">%H:%M:%S.%3N"; curl -s -k -A "SSL 
Debug" http://localhost:5050/master/slaves; echo ;date +"<%H:%M:%S.%3N"; echo); 
done
{noformat}

Now create a connection to the master that does not transmit any data:
{noformat}
$ telnet localhost 5050
{noformat}

You should now see the CURL requests hanging, the master stops responding to 
new connections. This will persist until either some data is transmitted via 
the above telnet connection or it is closed.

This problem has initially been observed when running Mesos on an AWS cluster 
with enabled load-balancer (which uses an idle, persistent connection) for the 
master node. Such connection does naturally not transmit any data as long as 
there are no external requests routed via the load-balancer. AWS allows setting 
up a timeout for those connections and in our test environment, this duration 
was set to 60 seconds and hence we were seeing our master getting repetitively 
unresponsive for 60 seconds, then getting "unstuck" for a brief period until it 
got stuck again.


  was:
When using an SSL-enabled build of Mesos in combination with SSL-downgrading 
support, any connection that does not actually transmit data will hang the 
runnable (e.g. master).

For reproducing the issue (on any platform)...

Spin up a master with enabled SSL-downgrading:
{noformat}
$ export SSL_ENABLED=true
$ export SSL_SUPPORT_DOWNGRADE=true
$ export SSL_KEY_FILE=/path/to/your/foo.key
$ export SSL_CERT_FILE=/path/to/your/foo.crt
$ export SSL_CA_FILE=/path/to/your/ca.crt
$ ./bin/mesos-master.sh --work_dir=/tmp/foo
{noformat}

Create some artificial HTTP request load for quickly spotting the problem in 
both, the master logs as well as the output of CURL itself:
{noformat}
$ while true; do sleep 0.1; echo $( date +">%H:%M:%S.%3N"; curl -s -k -A "SSL 
Debug" http://localhost:5050/master/slaves; echo ;date +"<%H:%M:%S.%3N"; echo); 
done
{noformat}

Now create a connection to the master that does not transmit any data:
{noformat}
$ telnet localhost 5050
{noformat}

You should now see the CURL requests hanging, the master stops responding to 
new connections. This will persist until either some data is transmitted via 
the above telnet connection or it is closed.

This problem has initially been observed when running Mesos on an AWS cluster 
with enabled internal ELB health-checks for the master node. Those 
health-checks are using long-lasting connections that do not transmit any data 
and are closed after a configurable duration. In our test environment, this 
duration was set to 60 seconds and hence we were seeing our master getting 
repetitively unresponsive for 60 seconds, then getting "unstuck" for a brief 
period until it got stuck again.



> libevent builds may prevent new connections
> ---
>
> Key: MESOS-5340
> URL: https://issues.apache.org/jira/browse/MESOS-5340
> Project: Mesos
>  Issue Type: Bug
>  Components: security
>Affects Versions: 0.29.0, 0.28.1
>Reporter: Till Toenshoff
>Priority: Blocker
>  Labels: mesosphere, security, ssl
>
> When using an SSL-enabled build of Mesos in combination with SSL-downgrading 
> support, any connection that does not actually transmit data will hang the 
> runnable (e.g. master).
> For reproducing the issue (on any platform)...
> Spin up a master with enabled SSL-downgrading:
> {noformat}
> $ export SSL_ENABLED=true
> $ export SSL_SUPPORT_DOWNGRADE=true
> $ export SSL_KEY_FILE=/path/to/your/foo.key
> $ export SSL_CERT_FILE=/path/to/your/foo.crt
> $ export SSL_CA_FILE=/path/to/your/ca.crt
> $ ./bin/mesos-master.sh --work_dir=/tmp/foo
> {noformat}
> Create some artificial HTTP request load for quickly spotting the problem in 
> both, the master logs as well as the output of CURL itself:
> {noformat}
> $ while true; do sleep 0.1; echo $( date +">%H:%M:%S.%3N"; curl -s -k -A "SSL 
> Debug" http://localhost:5050/master/slaves; echo ;date +"<%H:%M:%S.%3N"; 
> echo); done
> {noformat}
> 

[jira] [Commented] (MESOS-5345) Design doc for TASK_GONE

2016-05-09 Thread Neil Conway (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5345?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276305#comment-15276305
 ] 

Neil Conway commented on MESOS-5345:


Initial design doc: 
https://docs.google.com/document/d/1AWpb-tXb53FEaPSzRAdaAS3aTBQ9b_wAZx1L0pQ_s-s

> Design doc for TASK_GONE
> 
>
> Key: MESOS-5345
> URL: https://issues.apache.org/jira/browse/MESOS-5345
> Project: Mesos
>  Issue Type: Task
>  Components: master
>Reporter: Neil Conway
>  Labels: mesosphere
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5345) Design doc for TASK_GONE

2016-05-09 Thread Neil Conway (JIRA)
Neil Conway created MESOS-5345:
--

 Summary: Design doc for TASK_GONE
 Key: MESOS-5345
 URL: https://issues.apache.org/jira/browse/MESOS-5345
 Project: Mesos
  Issue Type: Task
  Components: master
Reporter: Neil Conway






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5344) Introduce TASK_GONE task status

2016-05-09 Thread Neil Conway (JIRA)
Neil Conway created MESOS-5344:
--

 Summary: Introduce TASK_GONE task status
 Key: MESOS-5344
 URL: https://issues.apache.org/jira/browse/MESOS-5344
 Project: Mesos
  Issue Type: Epic
  Components: master
Reporter: Neil Conway


The TASK_LOST task status describes two different situations: (a) the task was 
not launched because of an error (e.g., insufficient available resources), or 
(b) the master lost contact with a running task (e.g., due to a network 
partition); the master will kill the task when it can (e.g., when the network 
partition heals), but in the meantime the task may still be running.

This has two problems:
1. Using the same task status for two fairly different situations is confusing.
2. In the partitioned-but-still-running case, frameworks have no easy way to 
determine when a task has truly terminated.

To address these problems, we propose introducing a new task status, TASK_GONE, 
which would be used whenever a task can be guaranteed to not be running.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-5343) Behavior of custom HTTP authenticators with disabled HTTP authentication is inconsistent between master and agent

2016-05-09 Thread Benjamin Bannier (JIRA)
Benjamin Bannier created MESOS-5343:
---

 Summary: Behavior of custom HTTP authenticators with disabled HTTP 
authentication is inconsistent between master and agent
 Key: MESOS-5343
 URL: https://issues.apache.org/jira/browse/MESOS-5343
 Project: Mesos
  Issue Type: Bug
Reporter: Benjamin Bannier


When setting a custom authenticator with {{http_authenticators}} and also 
specifying {{authenticate_http=false}} currently agents refuse to start with
{code}
A custom HTTP authenticator was specified with the '--http_authenticators' 
flag, but HTTP authentication was not enabled via '--authenticate_http'
{code}

Masters on the other hand accept this setting.

Having differing behavior between master and agents is confusing, and we should 
decide on whether we want to accept these settings or not, and make the 
implementations consistent.




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess

2016-05-09 Thread Chris (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris updated MESOS-5342:
-
Description: 
The cgroups isolator currently lacks support for binding (also called pinning) 
containers to a set of cores. The GNU/Linux kernel is known to make sub-optimal 
core assignments for processes and threads. Poor assignments impact program 
performance, specifically in terms of cache locality. Applications requiring 
GPU resources can benefit from this feature by getting access to cores closest 
to the GPU hardware, which reduces cpu-gpu copy latency.

Most cluster management systems from the HPC community (SLURM) provide both 
cgroup isolation and cpu binding. This feature would provide similar 
capabilities. The current interest in supporting Intel's Cache Allocation 
Technology, and the advent of Intel's Knights-series processors, will require 
making choices about where container's are going to run on the mesos-agent's 
processor(s) cores - this feature is a step toward developing a robust solution.

The improvement in this JIRA ticket will handle hardware topology detection, 
track container-to-core utilization in a histogram, and use a mathematical 
optimization technique to select cores for container assignment based on 
latency and the container-to-core utilization histogram.

For GPU tasks, the improvement will prioritize selection of cores based on 
latency between the GPU and cores in an effort to minimize copy latency.

  was:
The cgroups isolator currently lacks support for binding (also called pinning) 
containers to a set of cores. The GNU/Linux kernel is known to make sub-optimal 
core assignments for processes and threads. Poor assignments impact program 
performance,specifically in terms of cache locality. Applications requiring GPU 
resources can benefit from this feature by getting access to cores closest to 
the GPU hardware, which reduces cpu-gpu copy latency.

Most cluster management systems from the HPC community (SLURM) provide both 
cgroup isolation and cpu binding. This feature would provide similar 
capabilities. The current interest in supporting Intel's Cache Allocation 
Technology will require making choices about where container's are going to run 
on the mesos-agent's processor(s) - this feature is a step toward developing a 
robust solution.

The improvement in this JIRA ticket will handle hardware topology detection, 
track container-to-core utilization in a histogram, and use a mathematical 
optimization technique to select cores for container assignment based on 
latency and the container-to-core utilization histogram.

For GPU tasks, the improvement will prioritize selection of cores based on 
latency between the GPU and cores in an effort to minimize copy latency.


> CPU pinning/binding support for CgroupsCpushareIsolatorProcess
> --
>
> Key: MESOS-5342
> URL: https://issues.apache.org/jira/browse/MESOS-5342
> Project: Mesos
>  Issue Type: Improvement
>  Components: cgroups, containerization
>Affects Versions: 0.28.1
>Reporter: Chris
>
> The cgroups isolator currently lacks support for binding (also called 
> pinning) containers to a set of cores. The GNU/Linux kernel is known to make 
> sub-optimal core assignments for processes and threads. Poor assignments 
> impact program performance, specifically in terms of cache locality. 
> Applications requiring GPU resources can benefit from this feature by getting 
> access to cores closest to the GPU hardware, which reduces cpu-gpu copy 
> latency.
> Most cluster management systems from the HPC community (SLURM) provide both 
> cgroup isolation and cpu binding. This feature would provide similar 
> capabilities. The current interest in supporting Intel's Cache Allocation 
> Technology, and the advent of Intel's Knights-series processors, will require 
> making choices about where container's are going to run on the mesos-agent's 
> processor(s) cores - this feature is a step toward developing a robust 
> solution.
> The improvement in this JIRA ticket will handle hardware topology detection, 
> track container-to-core utilization in a histogram, and use a mathematical 
> optimization technique to select cores for container assignment based on 
> latency and the container-to-core utilization histogram.
> For GPU tasks, the improvement will prioritize selection of cores based on 
> latency between the GPU and cores in an effort to minimize copy latency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess

2016-05-09 Thread Chris (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris updated MESOS-5342:
-
Description: 
The cgroups isolator currently lacks support for binding (also called pinning) 
containers to a set of cores. The GNU/Linux kernel is known to make sub-optimal 
core assignments for processes and threads. Poor assignments impact program 
performance,specifically in terms of cache locality. Applications requiring GPU 
resources can benefit from this feature by getting access to cores closest to 
the GPU hardware, which reduces cpu-gpu copy latency.

Most cluster management systems from the HPC community (SLURM) provide both 
cgroup isolation and cpu binding. This feature would provide similar 
capabilities. The current interest in supporting Intel's Cache Allocation 
Technology will require making choices about where container's are going to run 
on the mesos-agent's processor(s) - this feature is a step toward developing a 
robust solution.

The improvement in this JIRA ticket will handle hardware topology detection, 
track container-to-core utilization in a histogram, and use a mathematical 
optimization technique to select cores for container assignment based on 
latency and the container-to-core utilization histogram.

For GPU tasks, the improvement will prioritize selection of cores based on 
latency between the GPU and cores in an effort to minimize copy latency.

  was:
The cgroups isolator currently lacks support for binding (also called pinning) 
containers to a set of cores. The GNU/Linux kernel is known to make sub-optimal 
core assignments for processes and threads. Poor assignments impact program 
performance, particularly in the case of applications requiring GPU resources. 

Most cluster management systems from the HPC community (SLURM) provide both 
cgroup isolation and cpu binding. This feature would provide similar 
capabilities. The current interest in supporting Intel's Cache Allocation 
Technology will require making choices about where container's are going to run 
on the mesos-agent's processor(s) - this feature is a step toward developing a 
robust solution.

The improvement in this JIRA ticket will handle hardware topology detection, 
track container-to-core utilization in a histogram, and use a mathematical 
optimization technique to select cores for container assignment based on 
latency and the container-to-core utilization histogram.

For GPU tasks, the improvement will prioritize selection of cores based on 
latency between the GPU and cores in an effort to minimize copy latency.


> CPU pinning/binding support for CgroupsCpushareIsolatorProcess
> --
>
> Key: MESOS-5342
> URL: https://issues.apache.org/jira/browse/MESOS-5342
> Project: Mesos
>  Issue Type: Improvement
>  Components: cgroups, containerization
>Affects Versions: 0.28.1
>Reporter: Chris
>
> The cgroups isolator currently lacks support for binding (also called 
> pinning) containers to a set of cores. The GNU/Linux kernel is known to make 
> sub-optimal core assignments for processes and threads. Poor assignments 
> impact program performance,specifically in terms of cache locality. 
> Applications requiring GPU resources can benefit from this feature by getting 
> access to cores closest to the GPU hardware, which reduces cpu-gpu copy 
> latency.
> Most cluster management systems from the HPC community (SLURM) provide both 
> cgroup isolation and cpu binding. This feature would provide similar 
> capabilities. The current interest in supporting Intel's Cache Allocation 
> Technology will require making choices about where container's are going to 
> run on the mesos-agent's processor(s) - this feature is a step toward 
> developing a robust solution.
> The improvement in this JIRA ticket will handle hardware topology detection, 
> track container-to-core utilization in a histogram, and use a mathematical 
> optimization technique to select cores for container assignment based on 
> latency and the container-to-core utilization histogram.
> For GPU tasks, the improvement will prioritize selection of cores based on 
> latency between the GPU and cores in an effort to minimize copy latency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess

2016-05-09 Thread Chris (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276195#comment-15276195
 ] 

Chris commented on MESOS-5342:
--

implementation has been posted to review board. successfully passed 'make 
check'. requires use of the hwloc library to perform machine hardware topology 
discovery and cpu binding. updates were made to configure.ac and Makefile.am.

implementation is a new "device" under the cgroups isolator directory called 
"hwloc". Implementation detects topology, computes total number of cores 
required by the container (also checks if the container requires gpu). if the 
container requires gpu, the topology information is used to find the "closest" 
cores based on latency. if the container only requires cpu, a histogram of task 
assignment to cores is checked. 

if the histogram is "empty" (all cores have a value of 1.0) then a random core 
is selected and the latency matrix is used to find cores that are "closest" to 
the random core. the histogram is updated. If the histogram is "not empty" then 
a greedy submodular subset selection algorithm is used to select N cores using 
the latency matrix and a "per-core" cost value. the "per-core" cost value is a 
normalized version of the histogram divided by the number of processing units 
available on each core. greedy submodular subset selection algorithms use a 
"diminishing returns property" to find an optimal subset of items under a 
knapsack constraint.

when the list of cores is returned, a bit vector representing a cpuset is bound 
to the container's pid_t. when the container is cleaned up, the histogram is 
updated by reducing the current task counts on each core assigned to the pid_t 
by -1.0.

> CPU pinning/binding support for CgroupsCpushareIsolatorProcess
> --
>
> Key: MESOS-5342
> URL: https://issues.apache.org/jira/browse/MESOS-5342
> Project: Mesos
>  Issue Type: Improvement
>  Components: cgroups, containerization
>Affects Versions: 0.28.1
>Reporter: Chris
>
> The cgroups isolator currently lacks support for binding (also called 
> pinning) containers to a set of cores. The GNU/Linux kernel is known to make 
> sub-optimal core assignments for processes and threads. Poor assignments 
> impact program performance, particularly in the case of applications 
> requiring GPU resources. 
> Most cluster management systems from the HPC community (SLURM) provide both 
> cgroup isolation and cpu binding. This feature would provide similar 
> capabilities. The current interest in supporting Intel's Cache Allocation 
> Technology will require making choices about where container's are going to 
> run on the mesos-agent's processor(s) - this feature is a step toward 
> developing a robust solution.
> The improvement in this JIRA ticket will handle hardware topology detection, 
> track container-to-core utilization in a histogram, and use a mathematical 
> optimization technique to select cores for container assignment based on 
> latency and the container-to-core utilization histogram.
> For GPU tasks, the improvement will prioritize selection of cores based on 
> latency between the GPU and cores in an effort to minimize copy latency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess

2016-05-09 Thread Chris (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris updated MESOS-5342:
-
Comment: was deleted

(was: implementation has been posted to review board. successfully passed 'make 
check'. requires use of the hwloc library to perform machine hardware topology 
discovery and cpu binding. updates were made to configure.ac and Makefile.am.

implementation is a new "device" under the cgroups isolator directory called 
"hwloc". Implementation detects topology, computes total number of cores 
required by the container (also checks if the container requires gpu). if the 
container requires gpu, the topology information is used to find the "closest" 
cores based on latency. If the container only requires cpu, a histogram of task 
assignment to cores is checked. If the histogram is "empty" (all cores have a 
value of 1.0) then a random core is selected and the latency matrix is used to 
find cores that are "closest" to the random core. The histogram is updated. If 
the histogram is "not empty" then a greedy submodular subset selection 
algorithm is used to select N cores using the latency matrix and a "per-core" 
cost value. The "per-core" cost value is a normalized version of the histogram 
divided by the number of processing units available on each core.  Greedy 
submodular subset selection algorithms use a "diminishing returns property" to 
find an optimal subset of items under a knapsack constraint.

When the list of cores is returned, a bit vector representing a cpuset is bound 
to the container's pid_t. When the container is cleaned up, the histogram is 
updated by reducing the current task counts on each core assigned to the pid_t 
by -1.0.)

> CPU pinning/binding support for CgroupsCpushareIsolatorProcess
> --
>
> Key: MESOS-5342
> URL: https://issues.apache.org/jira/browse/MESOS-5342
> Project: Mesos
>  Issue Type: Improvement
>  Components: cgroups, containerization
>Affects Versions: 0.28.1
>Reporter: Chris
>
> The cgroups isolator currently lacks support for binding (also called 
> pinning) containers to a set of cores. The GNU/Linux kernel is known to make 
> sub-optimal core assignments for processes and threads. Poor assignments 
> impact program performance, particularly in the case of applications 
> requiring GPU resources. 
> Most cluster management systems from the HPC community (SLURM) provide both 
> cgroup isolation and cpu binding. This feature would provide similar 
> capabilities. The current interest in supporting Intel's Cache Allocation 
> Technology will require making choices about where container's are going to 
> run on the mesos-agent's processor(s) - this feature is a step toward 
> developing a robust solution.
> The improvement in this JIRA ticket will handle hardware topology detection, 
> track container-to-core utilization in a histogram, and use a mathematical 
> optimization technique to select cores for container assignment based on 
> latency and the container-to-core utilization histogram.
> For GPU tasks, the improvement will prioritize selection of cores based on 
> latency between the GPU and cores in an effort to minimize copy latency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess

2016-05-09 Thread Chris (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276194#comment-15276194
 ] 

Chris commented on MESOS-5342:
--

implementation has been posted to review board. successfully passed 'make 
check'. requires use of the hwloc library to perform machine hardware topology 
discovery and cpu binding. updates were made to configure.ac and Makefile.am.

implementation is a new "device" under the cgroups isolator directory called 
"hwloc". Implementation detects topology, computes total number of cores 
required by the container (also checks if the container requires gpu). if the 
container requires gpu, the topology information is used to find the "closest" 
cores based on latency. If the container only requires cpu, a histogram of task 
assignment to cores is checked. If the histogram is "empty" (all cores have a 
value of 1.0) then a random core is selected and the latency matrix is used to 
find cores that are "closest" to the random core. The histogram is updated. If 
the histogram is "not empty" then a greedy submodular subset selection 
algorithm is used to select N cores using the latency matrix and a "per-core" 
cost value. The "per-core" cost value is a normalized version of the histogram 
divided by the number of processing units available on each core.  Greedy 
submodular subset selection algorithms use a "diminishing returns property" to 
find an optimal subset of items under a knapsack constraint.

When the list of cores is returned, a bit vector representing a cpuset is bound 
to the container's pid_t. When the container is cleaned up, the histogram is 
updated by reducing the current task counts on each core assigned to the pid_t 
by -1.0.

> CPU pinning/binding support for CgroupsCpushareIsolatorProcess
> --
>
> Key: MESOS-5342
> URL: https://issues.apache.org/jira/browse/MESOS-5342
> Project: Mesos
>  Issue Type: Improvement
>  Components: cgroups, containerization
>Affects Versions: 0.28.1
>Reporter: Chris
>
> The cgroups isolator currently lacks support for binding (also called 
> pinning) containers to a set of cores. The GNU/Linux kernel is known to make 
> sub-optimal core assignments for processes and threads. Poor assignments 
> impact program performance, particularly in the case of applications 
> requiring GPU resources. 
> Most cluster management systems from the HPC community (SLURM) provide both 
> cgroup isolation and cpu binding. This feature would provide similar 
> capabilities. The current interest in supporting Intel's Cache Allocation 
> Technology will require making choices about where container's are going to 
> run on the mesos-agent's processor(s) - this feature is a step toward 
> developing a robust solution.
> The improvement in this JIRA ticket will handle hardware topology detection, 
> track container-to-core utilization in a histogram, and use a mathematical 
> optimization technique to select cores for container assignment based on 
> latency and the container-to-core utilization histogram.
> For GPU tasks, the improvement will prioritize selection of cores based on 
> latency between the GPU and cores in an effort to minimize copy latency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5292) Website references non-existing file NewbieContributionOverview.jpg

2016-05-09 Thread Joerg Schad (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5292?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joerg Schad updated MESOS-5292:
---
Attachment: Screen Shot 2016-05-09 at 12.07.09.png

Broken rendered page.

> Website references non-existing file NewbieContributionOverview.jpg
> ---
>
> Key: MESOS-5292
> URL: https://issues.apache.org/jira/browse/MESOS-5292
> Project: Mesos
>  Issue Type: Bug
>  Components: project website
>Reporter: Benjamin Bannier
>  Labels: site
> Attachments: Screen Shot 2016-05-09 at 12.07.09.png
>
>
> The website references the non-existing file 
> {{NewbieContributionOverview.jpg}} in {{docs/newbie-guide.md}}. Looking at 
> the commit adding this documentation it appears this file was never added to 
> the repository. We should either provide the file or rewrite the section to 
> work without needing to reference it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5342) CPU pinning/binding support for CgroupsCpushareIsolatorProcess

2016-05-09 Thread Chris (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5342?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris updated MESOS-5342:
-
Description: 
The cgroups isolator currently lacks support for binding (also called pinning) 
containers to a set of cores. The GNU/Linux kernel is known to make sub-optimal 
core assignments for processes and threads. Poor assignments impact program 
performance, particularly in the case of applications requiring GPU resources. 

Most cluster management systems from the HPC community (SLURM) provide both 
cgroup isolation and cpu binding. This feature would provide similar 
capabilities. The current interest in supporting Intel's Cache Allocation 
Technology will require making choices about where container's are going to run 
on the mesos-agent's processor(s) - this feature is a step toward developing a 
robust solution.

The improvement in this JIRA ticket will handle hardware topology detection, 
track container-to-core utilization in a histogram, and use a mathematical 
optimization technique to select cores for container assignment based on 
latency and the container-to-core utilization histogram.

For GPU tasks, the improvement will prioritize selection of cores based on 
latency between the GPU and cores in an effort to minimize copy latency.

  was:
The cgroups isolator currently lacks support for binding (also called pinning) 
containers to a set of cores. The GNU/Linux kernel is known to make sub-optimal 
core assignments for processes and threads. Poor assignments impact program 
performance; particularly in the case of applications requiring GPU resources. 

Most cluster management systems from the HPC community (SLURM) provide both 
cgroup isolation and cpu binding. This feature would provide similar 
capabilities. The current interest in supporting Intel's Cache Allocation 
Technology will require making choices about where container's are going to run 
on the mesos-agent's processor(s) - this feature is a step toward developing a 
robust solution.

The improvement in this JIRA ticket will handle hardware topology detection, 
track container-to-core utilization in a histogram, and use a mathematical 
optimization technique to select cores for container assignment based on 
latency and the container-to-core utilization histogram.

For GPU tasks, the improvement will prioritize selection of cores based on 
latency between the GPU and cores in an effort to minimize copy latency.


> CPU pinning/binding support for CgroupsCpushareIsolatorProcess
> --
>
> Key: MESOS-5342
> URL: https://issues.apache.org/jira/browse/MESOS-5342
> Project: Mesos
>  Issue Type: Improvement
>  Components: cgroups, containerization
>Affects Versions: 0.28.1
>Reporter: Chris
>
> The cgroups isolator currently lacks support for binding (also called 
> pinning) containers to a set of cores. The GNU/Linux kernel is known to make 
> sub-optimal core assignments for processes and threads. Poor assignments 
> impact program performance, particularly in the case of applications 
> requiring GPU resources. 
> Most cluster management systems from the HPC community (SLURM) provide both 
> cgroup isolation and cpu binding. This feature would provide similar 
> capabilities. The current interest in supporting Intel's Cache Allocation 
> Technology will require making choices about where container's are going to 
> run on the mesos-agent's processor(s) - this feature is a step toward 
> developing a robust solution.
> The improvement in this JIRA ticket will handle hardware topology detection, 
> track container-to-core utilization in a histogram, and use a mathematical 
> optimization technique to select cores for container assignment based on 
> latency and the container-to-core utilization histogram.
> For GPU tasks, the improvement will prioritize selection of cores based on 
> latency between the GPU and cores in an effort to minimize copy latency.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-5292) Website references non-existing file NewbieContributionOverview.jpg

2016-05-09 Thread Joerg Schad (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-5292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276172#comment-15276172
 ] 

Joerg Schad commented on MESOS-5292:


I created https://reviews.apache.org/r/47115/ to remove the reference to the 
image for now (feel free to re-add after the image has been added).
The review also fixes some other style issue  with the more strict website 
markdown renderer (e.g., broken links etc).
When doing changes to the documentation please check the documentation with the 
website docker generator 
(https://github.com/apache/mesos/blob/master/support/site-docker/README.md).

> Website references non-existing file NewbieContributionOverview.jpg
> ---
>
> Key: MESOS-5292
> URL: https://issues.apache.org/jira/browse/MESOS-5292
> Project: Mesos
>  Issue Type: Bug
>  Components: project website
>Reporter: Benjamin Bannier
>  Labels: site
>
> The website references the non-existing file 
> {{NewbieContributionOverview.jpg}} in {{docs/newbie-guide.md}}. Looking at 
> the commit adding this documentation it appears this file was never added to 
> the repository. We should either provide the file or rewrite the section to 
> work without needing to reference it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-5340) libevent builds may prevent new connections

2016-05-09 Thread Till Toenshoff (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-5340?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Till Toenshoff updated MESOS-5340:
--
Summary: libevent builds may prevent new connections  (was: SSL-downgrading 
support may prevent new connections)

> libevent builds may prevent new connections
> ---
>
> Key: MESOS-5340
> URL: https://issues.apache.org/jira/browse/MESOS-5340
> Project: Mesos
>  Issue Type: Bug
>  Components: security
>Affects Versions: 0.29.0, 0.28.1
>Reporter: Till Toenshoff
>Priority: Blocker
>  Labels: mesosphere, security, ssl
>
> When using an SSL-enabled build of Mesos in combination with SSL-downgrading 
> support, any connection that does not actually transmit data will hang the 
> runnable (e.g. master).
> For reproducing the issue (on any platform)...
> Spin up a master with enabled SSL-downgrading:
> {noformat}
> $ export SSL_ENABLED=true
> $ export SSL_SUPPORT_DOWNGRADE=true
> $ export SSL_KEY_FILE=/path/to/your/foo.key
> $ export SSL_CERT_FILE=/path/to/your/foo.crt
> $ export SSL_CA_FILE=/path/to/your/ca.crt
> $ ./bin/mesos-master.sh --work_dir=/tmp/foo
> {noformat}
> Create some artificial HTTP request load for quickly spotting the problem in 
> both, the master logs as well as the output of CURL itself:
> {noformat}
> $ while true; do sleep 0.1; echo $( date +">%H:%M:%S.%3N"; curl -s -k -A "SSL 
> Debug" http://localhost:5050/master/slaves; echo ;date +"<%H:%M:%S.%3N"; 
> echo); done
> {noformat}
> Now create a connection to the master that does not transmit any data:
> {noformat}
> $ telnet localhost 5050
> {noformat}
> You should now see the CURL requests hanging, the master stops responding to 
> new connections. This will persist until either some data is transmitted via 
> the above telnet connection or it is closed.
> This problem has initially been observed when running Mesos on an AWS cluster 
> with enabled internal ELB health-checks for the master node. Those 
> health-checks are using long-lasting connections that do not transmit any 
> data and are closed after a configurable duration. In our test environment, 
> this duration was set to 60 seconds and hence we were seeing our master 
> getting repetitively unresponsive for 60 seconds, then getting "unstuck" for 
> a brief period until it got stuck again.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3235) FetcherCacheHttpTest.HttpCachedSerialized and FetcherCacheHttpTest.HttpCachedConcurrent are flaky

2016-05-09 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276131#comment-15276131
 ] 

haosdent commented on MESOS-3235:
-

Got it, {{FetcherCacheHttpTest.HttpCachedConcurrent}} always could finished in 
1s in my machine. But it take more time(more than 15 seconds) in my slow vm. 
According to log, it spent most time in the subprocess call, e.g. launch 
{{mesos-fetcher}} and {{mesos-executor}}. I have not yet investigate why 
{{subprocess}} so slow. Let me got more details about this and rework the 
patch. Thanks a lot for your comments!
 

> FetcherCacheHttpTest.HttpCachedSerialized and 
> FetcherCacheHttpTest.HttpCachedConcurrent are flaky
> -
>
> Key: MESOS-3235
> URL: https://issues.apache.org/jira/browse/MESOS-3235
> Project: Mesos
>  Issue Type: Bug
>  Components: fetcher, tests
>Affects Versions: 0.23.0
>Reporter: Joseph Wu
>Assignee: haosdent
>  Labels: mesosphere
> Fix For: 0.27.0
>
> Attachments: fetchercache_log_centos_6.txt
>
>
> On OSX, {{make clean && make -j8 V=0 check}}:
> {code}
> [--] 3 tests from FetcherCacheHttpTest
> [ RUN  ] FetcherCacheHttpTest.HttpCachedSerialized
> HTTP/1.1 200 OK
> Date: Fri, 07 Aug 2015 17:23:05 GMT
> Content-Length: 30
> I0807 10:23:05.673596 2085372672 exec.cpp:133] Version: 0.24.0
> E0807 10:23:05.675884 184373248 socket.hpp:173] Shutdown failed on fd=18: 
> Socket is not connected [57]
> I0807 10:23:05.675897 182226944 exec.cpp:207] Executor registered on slave 
> 20150807-102305-139395082-52338-52313-S0
> E0807 10:23:05.683980 184373248 socket.hpp:173] Shutdown failed on fd=18: 
> Socket is not connected [57]
> Registered executor on 10.0.79.8
> Starting task 0
> Forked command at 54363
> sh -c './mesos-fetcher-test-cmd 0'
> E0807 10:23:05.694953 184373248 socket.hpp:173] Shutdown failed on fd=18: 
> Socket is not connected [57]
> Command exited with status 0 (pid: 54363)
> E0807 10:23:05.793927 184373248 socket.hpp:173] Shutdown failed on fd=18: 
> Socket is not connected [57]
> I0807 10:23:06.590008 2085372672 exec.cpp:133] Version: 0.24.0
> E0807 10:23:06.592244 355938304 socket.hpp:173] Shutdown failed on fd=18: 
> Socket is not connected [57]
> I0807 10:23:06.592243 353255424 exec.cpp:207] Executor registered on slave 
> 20150807-102305-139395082-52338-52313-S0
> E0807 10:23:06.597995 355938304 socket.hpp:173] Shutdown failed on fd=18: 
> Socket is not connected [57]
> Registered executor on 10.0.79.8
> Starting task 1
> Forked command at 54411
> sh -c './mesos-fetcher-test-cmd 1'
> E0807 10:23:06.608708 355938304 socket.hpp:173] Shutdown failed on fd=18: 
> Socket is not connected [57]
> Command exited with status 0 (pid: 54411)
> E0807 10:23:06.707649 355938304 socket.hpp:173] Shutdown failed on fd=18: 
> Socket is not connected [57]
> ../../src/tests/fetcher_cache_tests.cpp:860: Failure
> Failed to wait 15secs for awaitFinished(task.get())
> *** Aborted at 1438968214 (unix time) try "date -d @1438968214" if you are 
> using GNU date ***
> [  FAILED  ] FetcherCacheHttpTest.HttpCachedSerialized (28685 ms)
> [ RUN  ] FetcherCacheHttpTest.HttpCachedConcurrent
> PC: @0x113723618 process::Owned<>::get()
> *** SIGSEGV (@0x0) received by PID 52313 (TID 0x118d59000) stack trace: ***
> @ 0x7fff8fcacf1a _sigtramp
> @ 0x7f9bc3109710 (unknown)
> @0x1136f07e2 mesos::internal::slave::Fetcher::fetch()
> @0x113862f9d 
> mesos::internal::slave::MesosContainerizerProcess::fetch()
> @0x1138f1b5d 
> _ZZN7process8dispatchI7NothingN5mesos8internal5slave25MesosContainerizerProcessERKNS2_11ContainerIDERKNS2_11CommandInfoERKNSt3__112basic_stringIcNSC_11char_traitsIcEENSC_9allocatorIcRK6OptionISI_ERKNS2_7SlaveIDES6_S9_SI_SM_SP_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSW_FSU_T1_T2_T3_T4_T5_ET6_T7_T8_T9_T10_ENKUlPNS_11ProcessBaseEE_clES1D_
> @0x1138f18cf 
> _ZNSt3__110__function6__funcIZN7process8dispatchI7NothingN5mesos8internal5slave25MesosContainerizerProcessERKNS5_11ContainerIDERKNS5_11CommandInfoERKNS_12basic_stringIcNS_11char_traitsIcEENS_9allocatorIcRK6OptionISK_ERKNS5_7SlaveIDES9_SC_SK_SO_SR_EENS2_6FutureIT_EERKNS2_3PIDIT0_EEMSY_FSW_T1_T2_T3_T4_T5_ET6_T7_T8_T9_T10_EUlPNS2_11ProcessBaseEE_NSI_IS1G_EEFvS1F_EEclEOS1F_
> @0x1143768cf std::__1::function<>::operator()()
> @0x11435ca7f process::ProcessBase::visit()
> @0x1143ed6fe process::DispatchEvent::visit()
> @0x11271 process::ProcessBase::serve()
> @0x114343b4e process::ProcessManager::resume()
> @0x1143431ca process::internal::schedule()
> @0x1143da646 _ZNSt3__114__thread_proxyINS_5tupleIJPFvvEEPvS5_
> @ 0x7fff95090268 

[jira] [Commented] (MESOS-3235) FetcherCacheHttpTest.HttpCachedSerialized and FetcherCacheHttpTest.HttpCachedConcurrent are flaky

2016-05-09 Thread Bernd Mathiske (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3235?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15276116#comment-15276116
 ] 

Bernd Mathiske commented on MESOS-3235:
---

It seems doubtful that lengthening the wait time for task completion solves 
much, since successful runs are way shorter than the default 15 seconds, 
typically in the low single digit second range. Machines aren't that slow, are 
they? And we get these failures on machines that are known to be fast 
occasionally as well. I suspect something else is wrong here.

What I have seen in failure logs is that one task somehow has not produced 
status updates all the way up to the AWAIT statement in question - although it 
must have reached the contention barrier which asserts that all tasks have been 
launched as the fetcher has been observed downloading every script. So one 
guess is that something is blocking/eating/delaying status updates at some 
stage - occasionally. In all the cases I have seen the tasks are not launched 
in serial order. And that's exactly why I wrote this test! So we can see if we 
are dealing with concurrency correctly. Too bad we don't know what's failing 
yet. 

If we had a way to reproduce this behavior more often, we could switch on more 
logging and just repeat the test often enough to find something. But repeating 
the test tends to make the problem go away.

Ideas?

> FetcherCacheHttpTest.HttpCachedSerialized and 
> FetcherCacheHttpTest.HttpCachedConcurrent are flaky
> -
>
> Key: MESOS-3235
> URL: https://issues.apache.org/jira/browse/MESOS-3235
> Project: Mesos
>  Issue Type: Bug
>  Components: fetcher, tests
>Affects Versions: 0.23.0
>Reporter: Joseph Wu
>Assignee: haosdent
>  Labels: mesosphere
> Fix For: 0.27.0
>
> Attachments: fetchercache_log_centos_6.txt
>
>
> On OSX, {{make clean && make -j8 V=0 check}}:
> {code}
> [--] 3 tests from FetcherCacheHttpTest
> [ RUN  ] FetcherCacheHttpTest.HttpCachedSerialized
> HTTP/1.1 200 OK
> Date: Fri, 07 Aug 2015 17:23:05 GMT
> Content-Length: 30
> I0807 10:23:05.673596 2085372672 exec.cpp:133] Version: 0.24.0
> E0807 10:23:05.675884 184373248 socket.hpp:173] Shutdown failed on fd=18: 
> Socket is not connected [57]
> I0807 10:23:05.675897 182226944 exec.cpp:207] Executor registered on slave 
> 20150807-102305-139395082-52338-52313-S0
> E0807 10:23:05.683980 184373248 socket.hpp:173] Shutdown failed on fd=18: 
> Socket is not connected [57]
> Registered executor on 10.0.79.8
> Starting task 0
> Forked command at 54363
> sh -c './mesos-fetcher-test-cmd 0'
> E0807 10:23:05.694953 184373248 socket.hpp:173] Shutdown failed on fd=18: 
> Socket is not connected [57]
> Command exited with status 0 (pid: 54363)
> E0807 10:23:05.793927 184373248 socket.hpp:173] Shutdown failed on fd=18: 
> Socket is not connected [57]
> I0807 10:23:06.590008 2085372672 exec.cpp:133] Version: 0.24.0
> E0807 10:23:06.592244 355938304 socket.hpp:173] Shutdown failed on fd=18: 
> Socket is not connected [57]
> I0807 10:23:06.592243 353255424 exec.cpp:207] Executor registered on slave 
> 20150807-102305-139395082-52338-52313-S0
> E0807 10:23:06.597995 355938304 socket.hpp:173] Shutdown failed on fd=18: 
> Socket is not connected [57]
> Registered executor on 10.0.79.8
> Starting task 1
> Forked command at 54411
> sh -c './mesos-fetcher-test-cmd 1'
> E0807 10:23:06.608708 355938304 socket.hpp:173] Shutdown failed on fd=18: 
> Socket is not connected [57]
> Command exited with status 0 (pid: 54411)
> E0807 10:23:06.707649 355938304 socket.hpp:173] Shutdown failed on fd=18: 
> Socket is not connected [57]
> ../../src/tests/fetcher_cache_tests.cpp:860: Failure
> Failed to wait 15secs for awaitFinished(task.get())
> *** Aborted at 1438968214 (unix time) try "date -d @1438968214" if you are 
> using GNU date ***
> [  FAILED  ] FetcherCacheHttpTest.HttpCachedSerialized (28685 ms)
> [ RUN  ] FetcherCacheHttpTest.HttpCachedConcurrent
> PC: @0x113723618 process::Owned<>::get()
> *** SIGSEGV (@0x0) received by PID 52313 (TID 0x118d59000) stack trace: ***
> @ 0x7fff8fcacf1a _sigtramp
> @ 0x7f9bc3109710 (unknown)
> @0x1136f07e2 mesos::internal::slave::Fetcher::fetch()
> @0x113862f9d 
> mesos::internal::slave::MesosContainerizerProcess::fetch()
> @0x1138f1b5d 
> _ZZN7process8dispatchI7NothingN5mesos8internal5slave25MesosContainerizerProcessERKNS2_11ContainerIDERKNS2_11CommandInfoERKNSt3__112basic_stringIcNSC_11char_traitsIcEENSC_9allocatorIcRK6OptionISI_ERKNS2_7SlaveIDES6_S9_SI_SM_SP_EENS_6FutureIT_EERKNS_3PIDIT0_EEMSW_FSU_T1_T2_T3_T4_T5_ET6_T7_T8_T9_T10_ENKUlPNS_11ProcessBaseEE_clES1D_
> @0x1138f18cf 
>