[jira] [Resolved] (MESOS-2284) Slave cannot be registered while masters keep switching to another one.

2015-01-28 Thread Hou Xiaokun (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hou Xiaokun resolved MESOS-2284.

   Resolution: Fixed
Fix Version/s: 0.21.0

hi, I changed the quorum to 1. Slave can be displayed now!

Thanks!

 Slave cannot be registered while masters keep switching to another one.
 ---

 Key: MESOS-2284
 URL: https://issues.apache.org/jira/browse/MESOS-2284
 Project: Mesos
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.20.1
 Environment: Ubuntu14.04
Reporter: Hou Xiaokun
Priority: Blocker
 Fix For: 0.21.0


 I followed the instruction in page 
 http://mesosphere.com/docs/getting-started/datacenter/install/.
 Setup two masters and one slave. And quorum value is 2. Configured ip 
 addresses in hostname files separately.
 Here is the log from slave node,
 I0127 22:37:26.762953  1966 slave.cpp:627] No credentials provided. 
 Attempting to register without authentication
 I0127 22:37:26.762985  1966 slave.cpp:638] Detecting new master
 I0127 22:37:26.763022  1966 status_update_manager.cpp:171] Pausing sending 
 status updates
 I0127 22:38:06.683840  1962 slave.cpp:3321] Current usage 16.98%. Max allowed 
 age: 5.111732713224155days
 I0127 22:38:26.986556  1966 slave.cpp:2623] master@10.27.17.135:5050 exited
 W0127 22:38:26.986675  1966 slave.cpp:2626] Master disconnected! Waiting for 
 a new master to be elected
 I0127 22:38:34.909605  1963 detector.cpp:138] Detected a new leader: 
 (id='2028')
 I0127 22:38:34.909811  1963 group.cpp:659] Trying to get 
 '/mesos/info_002028' in ZooKeeper
 I0127 22:38:34.910909  1963 detector.cpp:433] A new leading master 
 (UPID=master@10.27.16.214:5050) is detected
 I0127 22:38:34.910989  1963 slave.cpp:602] New master detected at 
 master@10.27.16.214:5050
 I0127 22:38:34.93  1963 slave.cpp:627] No credentials provided. 
 Attempting to register without authentication
 I0127 22:38:34.911144  1963 slave.cpp:638] Detecting new master
 I0127 22:38:34.911183  1963 status_update_manager.cpp:171] Pausing sending 
 status updates
 I0127 22:39:06.684526  1964 slave.cpp:3321] Current usage 16.98%. Max allowed 
 age: 5.111731773610567days
 I0127 22:39:35.231653  1963 slave.cpp:2623] master@10.27.16.214:5050 exited
 W0127 22:39:35.231869  1963 slave.cpp:2626] Master disconnected! Waiting for 
 a new master to be elected
 I0127 22:39:42.761540  1964 detector.cpp:138] Detected a new leader: 
 (id='2029')
 I0127 22:39:42.761732  1964 group.cpp:659] Trying to get 
 '/mesos/info_002029' in ZooKeeper
 I0127 22:39:42.762914  1964 detector.cpp:433] A new leading master 
 (UPID=master@10.27.17.135:5050) is detected
 I0127 22:39:42.762984  1964 slave.cpp:602] New master detected at 
 master@10.27.17.135:5050
 I0127 22:39:42.763089  1964 slave.cpp:627] No credentials provided. 
 Attempting to register without authentication
 I0127 22:39:42.763118  1964 slave.cpp:638] Detecting new master
 I0127 22:39:42.763155  1964 status_update_manager.cpp:171] Pausing sending 
 status updates



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2276) Mesos-slave refuses to startup with many stopped docker containers

2015-01-28 Thread Dr. Stefan Schimanski (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14294907#comment-14294907
 ] 

Dr. Stefan Schimanski commented on MESOS-2276:
--

I have changed the topic of this issue. As the original issue is resolved, it 
is left that mesos-slave should behave much more forgiving in the situation of 
many stopped containers. Moreover, a proper error message would help to 
identify the problem.

 Mesos-slave refuses to startup with many stopped docker containers
 --

 Key: MESOS-2276
 URL: https://issues.apache.org/jira/browse/MESOS-2276
 Project: Mesos
  Issue Type: Bug
  Components: docker, slave
Affects Versions: 0.21.0, 0.21.1
 Environment: Ubuntu 14.04LTS, Mesosphere packages
Reporter: Dr. Stefan Schimanski

 The mesos-slave is launched as
 # /usr/local/sbin/mesos-slave 
 --master=zk://10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181/mesos --ip=10.0.0.2 
 --log_dir=/var/log/mesos --attributes=node_id:srv002 --checkpoint 
 --containerizers=docker --executor_registration_timeout=5mins 
 --logging_level=INFO
 giving this output:
 I0127 19:26:32.674113 19880 logging.cpp:172] INFO level logging started!
 I0127 19:26:32.674741 19880 main.cpp:142] Build: 2014-11-22 05:29:57 by root
 I0127 19:26:32.674774 19880 main.cpp:144] Version: 0.21.0
 I0127 19:26:32.674799 19880 main.cpp:147] Git tag: 0.21.0
 I0127 19:26:32.674824 19880 main.cpp:151] Git SHA: 
 ab8fa655d34e8e15a4290422df38a18db1c09b5b
 I0127 19:26:32.786731 19880 main.cpp:165] Starting Mesos slave
 2015-01-27 19:26:32,786:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@712: Client 
 environment:zookeeper.version=zookeeper C client 3.4.5
 2015-01-27 19:26:32,786:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@716: Client 
 environment:host.name=srv002
 2015-01-27 19:26:32,787:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@723: Client 
 environment:os.name=Linux
 2015-01-27 19:26:32,787:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@724: Client 
 environment:os.arch=3.13.0-44-generic
 2015-01-27 19:26:32,787:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@725: Client 
 environment:os.version=#73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014
 2015-01-27 19:26:32,788:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@733: Client 
 environment:user.name=root
 2015-01-27 19:26:32,788:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@741: Client 
 environment:user.home=/root
 2015-01-27 19:26:32,788:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@753: Client 
 environment:user.dir=/root
 2015-01-27 19:26:32,789:19880(0x7fcf0cf9f700):ZOO_INFO@zookeeper_init@786: 
 Initiating client connection, host=10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181 
 sessionTimeout=1 watcher=0x7fcf13592a0a sessionId=0 sessionPasswd=null 
 context=0x7fceec0009e0 flags=0
 I0127 19:26:32.796588 19880 slave.cpp:169] Slave started on 1)@10.0.0.2:5051
 I0127 19:26:32.797345 19880 slave.cpp:289] Slave resources: cpus(*):8; 
 mem(*):6960; disk(*):246731; ports(*):[31000-32000]
 I0127 19:26:32.798017 19880 slave.cpp:318] Slave hostname: srv002
 I0127 19:26:32.798076 19880 slave.cpp:319] Slave checkpoint: true
 2015-01-27 19:26:32,800:19880(0x7fcf08f5c700):ZOO_INFO@check_events@1703: 
 initiated connection to server [10.0.0.1:2181]
 I0127 19:26:32.808229 19886 state.cpp:33] Recovering state from 
 '/tmp/mesos/meta'
 I0127 19:26:32.809090 19882 status_update_manager.cpp:197] Recovering status 
 update manager
 I0127 19:26:32.809677 19887 docker.cpp:767] Recovering Docker containers
 2015-01-27 19:26:32,821:19880(0x7fcf08f5c700):ZOO_INFO@check_events@1750: 
 session establishment complete on server [10.0.0.1:2181], 
 sessionId=0x14b2adf7a560106, negotiated timeout=1
 I0127 19:26:32.823292 19885 group.cpp:313] Group process 
 (group(1)@10.0.0.2:5051) connected to ZooKeeper
 I0127 19:26:32.823443 19885 group.cpp:790] Syncing group operations: queue 
 size (joins, cancels, datas) = (0, 0, 0)
 I0127 19:26:32.823484 19885 group.cpp:385] Trying to create path '/mesos' in 
 ZooKeeper
 I0127 19:26:32.829711 19882 detector.cpp:138] Detected a new leader: 
 (id='143')
 I0127 19:26:32.830559 19882 group.cpp:659] Trying to get 
 '/mesos/info_000143' in ZooKeeper
 I0127 19:26:32.837913 19886 detector.cpp:433] A new leading master 
 (UPID=master@10.0.0.1:5050) is detected
 Failed to perform recovery: Collect failed: Failed to create pipe: Too many 
 open files
 To remedy this do as follows:
 Step 1: rm -f /tmp/mesos/meta/slaves/latest
 This ensures slave doesn't recover old live executors.
 Step 2: Restart the slave.
 At /tmp/mesos/meta/slaves/latest there is nothing.
 The slave was part of a 3 node cluster before.
 When started as an upstart service, the process is relaunched all the time 
 and a large number of defunct processes appear, like these ones:
 root 30321  0.0  0.0  13000   440 ?S

[jira] [Updated] (MESOS-2276) Mesos-slave refuses to startup with many stopped docker containers

2015-01-28 Thread Dr. Stefan Schimanski (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2276?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dr. Stefan Schimanski updated MESOS-2276:
-
Summary: Mesos-slave refuses to startup with many stopped docker containers 
 (was: Mesos-slave with containerizer Docker doesn't startup anymore)

 Mesos-slave refuses to startup with many stopped docker containers
 --

 Key: MESOS-2276
 URL: https://issues.apache.org/jira/browse/MESOS-2276
 Project: Mesos
  Issue Type: Bug
  Components: docker, slave
Affects Versions: 0.21.0, 0.21.1
 Environment: Ubuntu 14.04LTS, Mesosphere packages
Reporter: Dr. Stefan Schimanski

 The mesos-slave is launched as
 # /usr/local/sbin/mesos-slave 
 --master=zk://10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181/mesos --ip=10.0.0.2 
 --log_dir=/var/log/mesos --attributes=node_id:srv002 --checkpoint 
 --containerizers=docker --executor_registration_timeout=5mins 
 --logging_level=INFO
 giving this output:
 I0127 19:26:32.674113 19880 logging.cpp:172] INFO level logging started!
 I0127 19:26:32.674741 19880 main.cpp:142] Build: 2014-11-22 05:29:57 by root
 I0127 19:26:32.674774 19880 main.cpp:144] Version: 0.21.0
 I0127 19:26:32.674799 19880 main.cpp:147] Git tag: 0.21.0
 I0127 19:26:32.674824 19880 main.cpp:151] Git SHA: 
 ab8fa655d34e8e15a4290422df38a18db1c09b5b
 I0127 19:26:32.786731 19880 main.cpp:165] Starting Mesos slave
 2015-01-27 19:26:32,786:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@712: Client 
 environment:zookeeper.version=zookeeper C client 3.4.5
 2015-01-27 19:26:32,786:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@716: Client 
 environment:host.name=srv002
 2015-01-27 19:26:32,787:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@723: Client 
 environment:os.name=Linux
 2015-01-27 19:26:32,787:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@724: Client 
 environment:os.arch=3.13.0-44-generic
 2015-01-27 19:26:32,787:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@725: Client 
 environment:os.version=#73-Ubuntu SMP Tue Dec 16 00:22:43 UTC 2014
 2015-01-27 19:26:32,788:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@733: Client 
 environment:user.name=root
 2015-01-27 19:26:32,788:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@741: Client 
 environment:user.home=/root
 2015-01-27 19:26:32,788:19880(0x7fcf0cf9f700):ZOO_INFO@log_env@753: Client 
 environment:user.dir=/root
 2015-01-27 19:26:32,789:19880(0x7fcf0cf9f700):ZOO_INFO@zookeeper_init@786: 
 Initiating client connection, host=10.0.0.1:2181,10.0.0.2:2181,10.0.0.3:2181 
 sessionTimeout=1 watcher=0x7fcf13592a0a sessionId=0 sessionPasswd=null 
 context=0x7fceec0009e0 flags=0
 I0127 19:26:32.796588 19880 slave.cpp:169] Slave started on 1)@10.0.0.2:5051
 I0127 19:26:32.797345 19880 slave.cpp:289] Slave resources: cpus(*):8; 
 mem(*):6960; disk(*):246731; ports(*):[31000-32000]
 I0127 19:26:32.798017 19880 slave.cpp:318] Slave hostname: srv002
 I0127 19:26:32.798076 19880 slave.cpp:319] Slave checkpoint: true
 2015-01-27 19:26:32,800:19880(0x7fcf08f5c700):ZOO_INFO@check_events@1703: 
 initiated connection to server [10.0.0.1:2181]
 I0127 19:26:32.808229 19886 state.cpp:33] Recovering state from 
 '/tmp/mesos/meta'
 I0127 19:26:32.809090 19882 status_update_manager.cpp:197] Recovering status 
 update manager
 I0127 19:26:32.809677 19887 docker.cpp:767] Recovering Docker containers
 2015-01-27 19:26:32,821:19880(0x7fcf08f5c700):ZOO_INFO@check_events@1750: 
 session establishment complete on server [10.0.0.1:2181], 
 sessionId=0x14b2adf7a560106, negotiated timeout=1
 I0127 19:26:32.823292 19885 group.cpp:313] Group process 
 (group(1)@10.0.0.2:5051) connected to ZooKeeper
 I0127 19:26:32.823443 19885 group.cpp:790] Syncing group operations: queue 
 size (joins, cancels, datas) = (0, 0, 0)
 I0127 19:26:32.823484 19885 group.cpp:385] Trying to create path '/mesos' in 
 ZooKeeper
 I0127 19:26:32.829711 19882 detector.cpp:138] Detected a new leader: 
 (id='143')
 I0127 19:26:32.830559 19882 group.cpp:659] Trying to get 
 '/mesos/info_000143' in ZooKeeper
 I0127 19:26:32.837913 19886 detector.cpp:433] A new leading master 
 (UPID=master@10.0.0.1:5050) is detected
 Failed to perform recovery: Collect failed: Failed to create pipe: Too many 
 open files
 To remedy this do as follows:
 Step 1: rm -f /tmp/mesos/meta/slaves/latest
 This ensures slave doesn't recover old live executors.
 Step 2: Restart the slave.
 At /tmp/mesos/meta/slaves/latest there is nothing.
 The slave was part of a 3 node cluster before.
 When started as an upstart service, the process is relaunched all the time 
 and a large number of defunct processes appear, like these ones:
 root 30321  0.0  0.0  13000   440 ?S19:28   0:00 iptables 
 --wait -L -n
 root 30322  0.0  0.0      396 ?S19:28   0:00 sh -c docker 
 inspect 

[jira] [Reopened] (MESOS-2284) Slave cannot be registered while masters keep switching to another one.

2015-01-28 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov reopened MESOS-2284:


 Slave cannot be registered while masters keep switching to another one.
 ---

 Key: MESOS-2284
 URL: https://issues.apache.org/jira/browse/MESOS-2284
 Project: Mesos
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.20.1
 Environment: Ubuntu14.04
Reporter: Hou Xiaokun
Priority: Blocker
 Fix For: 0.21.0


 I followed the instruction in page 
 http://mesosphere.com/docs/getting-started/datacenter/install/.
 Setup two masters and one slave. And quorum value is 2. Configured ip 
 addresses in hostname files separately.
 Here is the log from slave node,
 I0127 22:37:26.762953  1966 slave.cpp:627] No credentials provided. 
 Attempting to register without authentication
 I0127 22:37:26.762985  1966 slave.cpp:638] Detecting new master
 I0127 22:37:26.763022  1966 status_update_manager.cpp:171] Pausing sending 
 status updates
 I0127 22:38:06.683840  1962 slave.cpp:3321] Current usage 16.98%. Max allowed 
 age: 5.111732713224155days
 I0127 22:38:26.986556  1966 slave.cpp:2623] master@10.27.17.135:5050 exited
 W0127 22:38:26.986675  1966 slave.cpp:2626] Master disconnected! Waiting for 
 a new master to be elected
 I0127 22:38:34.909605  1963 detector.cpp:138] Detected a new leader: 
 (id='2028')
 I0127 22:38:34.909811  1963 group.cpp:659] Trying to get 
 '/mesos/info_002028' in ZooKeeper
 I0127 22:38:34.910909  1963 detector.cpp:433] A new leading master 
 (UPID=master@10.27.16.214:5050) is detected
 I0127 22:38:34.910989  1963 slave.cpp:602] New master detected at 
 master@10.27.16.214:5050
 I0127 22:38:34.93  1963 slave.cpp:627] No credentials provided. 
 Attempting to register without authentication
 I0127 22:38:34.911144  1963 slave.cpp:638] Detecting new master
 I0127 22:38:34.911183  1963 status_update_manager.cpp:171] Pausing sending 
 status updates
 I0127 22:39:06.684526  1964 slave.cpp:3321] Current usage 16.98%. Max allowed 
 age: 5.111731773610567days
 I0127 22:39:35.231653  1963 slave.cpp:2623] master@10.27.16.214:5050 exited
 W0127 22:39:35.231869  1963 slave.cpp:2626] Master disconnected! Waiting for 
 a new master to be elected
 I0127 22:39:42.761540  1964 detector.cpp:138] Detected a new leader: 
 (id='2029')
 I0127 22:39:42.761732  1964 group.cpp:659] Trying to get 
 '/mesos/info_002029' in ZooKeeper
 I0127 22:39:42.762914  1964 detector.cpp:433] A new leading master 
 (UPID=master@10.27.17.135:5050) is detected
 I0127 22:39:42.762984  1964 slave.cpp:602] New master detected at 
 master@10.27.17.135:5050
 I0127 22:39:42.763089  1964 slave.cpp:627] No credentials provided. 
 Attempting to register without authentication
 I0127 22:39:42.763118  1964 slave.cpp:638] Detecting new master
 I0127 22:39:42.763155  1964 status_update_manager.cpp:171] Pausing sending 
 status updates



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Resolved] (MESOS-2284) Slave cannot be registered while masters keep switching to another one.

2015-01-28 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2284?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov resolved MESOS-2284.

Resolution: Not a Problem

 Slave cannot be registered while masters keep switching to another one.
 ---

 Key: MESOS-2284
 URL: https://issues.apache.org/jira/browse/MESOS-2284
 Project: Mesos
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.20.1
 Environment: Ubuntu14.04
Reporter: Hou Xiaokun
Priority: Blocker
 Fix For: 0.21.0


 I followed the instruction in page 
 http://mesosphere.com/docs/getting-started/datacenter/install/.
 Setup two masters and one slave. And quorum value is 2. Configured ip 
 addresses in hostname files separately.
 Here is the log from slave node,
 I0127 22:37:26.762953  1966 slave.cpp:627] No credentials provided. 
 Attempting to register without authentication
 I0127 22:37:26.762985  1966 slave.cpp:638] Detecting new master
 I0127 22:37:26.763022  1966 status_update_manager.cpp:171] Pausing sending 
 status updates
 I0127 22:38:06.683840  1962 slave.cpp:3321] Current usage 16.98%. Max allowed 
 age: 5.111732713224155days
 I0127 22:38:26.986556  1966 slave.cpp:2623] master@10.27.17.135:5050 exited
 W0127 22:38:26.986675  1966 slave.cpp:2626] Master disconnected! Waiting for 
 a new master to be elected
 I0127 22:38:34.909605  1963 detector.cpp:138] Detected a new leader: 
 (id='2028')
 I0127 22:38:34.909811  1963 group.cpp:659] Trying to get 
 '/mesos/info_002028' in ZooKeeper
 I0127 22:38:34.910909  1963 detector.cpp:433] A new leading master 
 (UPID=master@10.27.16.214:5050) is detected
 I0127 22:38:34.910989  1963 slave.cpp:602] New master detected at 
 master@10.27.16.214:5050
 I0127 22:38:34.93  1963 slave.cpp:627] No credentials provided. 
 Attempting to register without authentication
 I0127 22:38:34.911144  1963 slave.cpp:638] Detecting new master
 I0127 22:38:34.911183  1963 status_update_manager.cpp:171] Pausing sending 
 status updates
 I0127 22:39:06.684526  1964 slave.cpp:3321] Current usage 16.98%. Max allowed 
 age: 5.111731773610567days
 I0127 22:39:35.231653  1963 slave.cpp:2623] master@10.27.16.214:5050 exited
 W0127 22:39:35.231869  1963 slave.cpp:2626] Master disconnected! Waiting for 
 a new master to be elected
 I0127 22:39:42.761540  1964 detector.cpp:138] Detected a new leader: 
 (id='2029')
 I0127 22:39:42.761732  1964 group.cpp:659] Trying to get 
 '/mesos/info_002029' in ZooKeeper
 I0127 22:39:42.762914  1964 detector.cpp:433] A new leading master 
 (UPID=master@10.27.17.135:5050) is detected
 I0127 22:39:42.762984  1964 slave.cpp:602] New master detected at 
 master@10.27.17.135:5050
 I0127 22:39:42.763089  1964 slave.cpp:627] No credentials provided. 
 Attempting to register without authentication
 I0127 22:39:42.763118  1964 slave.cpp:638] Detecting new master
 I0127 22:39:42.763155  1964 status_update_manager.cpp:171] Pausing sending 
 status updates



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-354) oversubscribe resources

2015-01-28 Thread Niklas Quarfot Nielsen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-354?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14295839#comment-14295839
 ] 

Niklas Quarfot Nielsen commented on MESOS-354:
--

Oversubscription means many things and can be considered as a subset of the 
currently ongoing effort in optimistic offers:
Where optimistic offers lets the allocator to offer resources:

 - In multiple frameworks to increase 'parallelism' (opposed to the 
conservative/pessimistic scheme) and **increase task throughput**.
 - Preemptable resources from unallocated but reserved resources, to **limit 
reservation slack** (difference between reserverd and allocated resources).

A third (and equally important) case, which expands these scenarios is 
oversubscription of _allocated_ resources which limits the **usage slack** 
(difference between allocated and used resources).
There has been a lot of recent research which shows the ability to reduce usage 
slack with 60% while maintaining the Service Level Objective (SLO) of latency 
critical workloads(1). However, this kind of oversubscription needs policies 
and fine-tuning to make sure that best-effort tasks doesn't interfere with 
latency critical ones. Therefore, we'd like to start a discussion on how such a 
system would look in Mesos. I will create a JIRA ticket (linking to this one) 
to start the conversation.

(1) 
http://static.googleusercontent.com/media/research.google.com/en/us/pubs/archive/43017.pdf

 oversubscribe resources
 ---

 Key: MESOS-354
 URL: https://issues.apache.org/jira/browse/MESOS-354
 Project: Mesos
  Issue Type: Story
  Components: isolation, master, slave
Reporter: brian wickman
Priority: Minor
 Attachments: mesos_virtual_offers.pdf


 This proposal is predicated upon offer revocation.
 The idea would be to add a new revoked status either by (1) piggybacking 
 off an existing status update (TASK_LOST or TASK_KILLED) or (2) introducing a 
 new status update TASK_REVOKED.
 In order to augment an offer with metadata about revocability, there are 
 options:
   1) Add a revocable boolean to the Offer and
 a) offer only one type of Offer per slave at a particular time
 b) offer both revocable and non-revocable resources at the same time but 
 require frameworks to understand that Offers can contain overlapping resources
   2) Add a revocable_resources field on the Offer which is a superset of the 
 regular resources field.  By consuming  resources = revocable_resources in 
 a launchTask, the Task becomes a revocable task.  If launching a task with  
 resources, the Task is non-revocable.
 The use cases for revocable tasks are batch tasks (e.g. hadoop/pig/mapreduce) 
 and non-revocable tasks are online higher-SLA tasks (e.g. services.)
 Consider a non-revocable that asks for 4 cores, 8 GB RAM and 20 GB of disk.  
 One of these resources is a rate (4 cpu seconds per second) and two of them 
 are fixed values (8GB and 20GB respectively, though disk resources can be 
 further broken down into spindles - fixed - and iops - a rate.)  In practice, 
 these are the maximum resources in the respective dimensions that this task 
 will use.  In reality, we provision tasks at some factor below peak, and only 
 hit peak resource consumption in rare circumstances or perhaps at a diurnal 
 peak.  
 In the meantime, we stand to gain from offering the some constant factor of 
 the difference between (reserved - actual) of non-revocable tasks as 
 revocable resources, depending upon our tolerance for revocable task churn.  
 The main challenge is coming up with an accurate short / medium / long-term 
 prediction of resource consumption based upon current behavior.
 In many cases it would be OK to be sloppy:
   * CPU / iops / network IO are rates (compressible) and can often be OK 
 below guarantees for brief periods of time while task revocation takes place
   * Memory slack can be provided by enabling swap and dynamically setting 
 swap paging boundaries.  Should swap ever be activated, that would be a 
 signal to revoke.
 The master / allocator would piggyback on the slave heartbeat mechanism to 
 learn of the amount of revocable resources available at any point in time.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2232) Suppress MockAllocator::transformAllocation() warnings.

2015-01-28 Thread Benjamin Mahler (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296089#comment-14296089
 ] 

Benjamin Mahler commented on MESOS-2232:


First two are committed:
{noformat}
commit ccd697df0b7e05b07dee75d53e0ff55d6884ba2f
Author: Benjamin Mahler benjamin.mah...@gmail.com
Date:   Fri Jan 16 12:13:01 2015 -0800

Renamed MockAllocatorProcess to TestAllocatorProcess.

Review: https://reviews.apache.org/r/29989
{noformat}
{noformat}
commit b7bb6696b5a78dbc896b4756b7d4123e86c01635
Author: Benjamin Mahler benjamin.mah...@gmail.com
Date:   Fri Jan 16 14:10:05 2015 -0800

Updated TestAllocatorProcess to avoid the test warnings.

Review: https://reviews.apache.org/r/29990
{noformat}

 Suppress MockAllocator::transformAllocation() warnings.
 ---

 Key: MESOS-2232
 URL: https://issues.apache.org/jira/browse/MESOS-2232
 Project: Mesos
  Issue Type: Bug
  Components: test
Reporter: Alexander Rukletsov
Assignee: Benjamin Mahler
Priority: Minor

 After transforming allocated resources feature was added to allocator, a 
 number of warnings are popping out for allocator tests. Commits leading to 
 this behaviour:
 {{dacc88292cc13d4b08fe8cda4df71110a96cb12a}}
 {{5a02d5bdc75d3b1149dcda519016374be06ec6bd}}
 corresponding reviews:
 https://reviews.apache.org/r/29083
 https://reviews.apache.org/r/29084
 Here is an example:
 {code}
 [ RUN ] MasterAllocatorTest/0.FrameworkReregistersFirst GMOCK WARNING: 
 Uninteresting mock function call - taking default action specified at: 
 ../../../src/tests/mesos.hpp:719: Function call: 
 transformAllocation(@0x7fd3bb5274d8 
 20150115-185632-1677764800-59671-44186-, @0x7fd3bb5274f8 
 20150115-185632-1677764800-59671-44186-S0, @0x1119140e0 16-byte object F0-5E 
 52-BB D3-7F 00-00 C0-5F 52-BB D3-7F 00-00) Stack trace: [ OK ] 
 MasterAllocatorTest/0.FrameworkReregistersFirst (204 ms)
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-2183) docker containerizer doesn't work when mesos-slave is running in a container

2015-01-28 Thread Jay Buffington (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296106#comment-14296106
 ] 

Jay Buffington edited comment on MESOS-2183 at 1/29/15 12:10 AM:
-

Hey [~tnachen], I read your doc at 
https://docs.google.com/document/d/1_1oLHXg_aHj_fYCzsjYwox9xvIYNAKIeVjO5BFxsUGI/edit#
 and it's not clear you address the issue I encountered.  In my mesos-slave 
running in coreos I have it:

* running inside a pid namespace
* using the mounted /var/run/docker.sock to start a sibling container
* running docker inspect to get the pid it just launched
* it sees that the pid docker inspect reports 
* it tries to determine the libprocess port based on that pid
* it doesn't see that pid since the pid docker inspect returns is only 
visible in the root namespace
* it does docker stop/kill because it incorrectly thinks the executor 
failed to start since it couldn't see the pid

I don't understand how your patch addresses that issue.  Can you give me a 
summary of how it fixes this problem I've described?


was (Author: jaybuff):
Hey [~tnachen], I read your doc at 
https://docs.google.com/document/d/1_1oLHXg_aHj_fYCzsjYwox9xvIYNAKIeVjO5BFxsUGI/edit#
 and it's not clear you address the issue I encountered.  In my mesos-slave 
running in coreos I have it:

* running inside a pid namespace
* useing the mounted /var/run/docker.sock to start a sibling container
* running docker inspect to get the pid it just launched
* it sees that the pid docker inspect reports 
* it tries to determine the libprocess port based on that pid
* it does see that pid since the pid docker inspect returns is only visible 
in the root namespace
* it does docker stop/kill because it incorrectly thinks the executor 
failed to start since it couldn't see the pid

I don't understand how your patch addresses that issue.  Can you give me a 
summary of how it fixes this problem I've described?

 docker containerizer doesn't work when mesos-slave is running in a container
 

 Key: MESOS-2183
 URL: https://issues.apache.org/jira/browse/MESOS-2183
 Project: Mesos
  Issue Type: Bug
  Components: containerization, docker
Reporter: Jay Buffington
Assignee: Timothy Chen

 I've started running the mesos-slave process itself inside a docker 
 container.  I bind mount in the dockerd socket, so there is only one docker 
 daemon running on the system.
 The mesos-slave process uses docker run to start an executor in another, 
 sibling, container.  It asks docker inspect what the pid of the executor 
 running in the container is.  Since the mesos-slave process is in its own pid 
 namespace, it cannot see the pid for the executor in /proc.  Therefore, it 
 thinks the executor died and it does a docker kill.
 It looks like the executor pid is also used to determine what port the 
 executor is listening on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2144) Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread

2015-01-28 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296128#comment-14296128
 ] 

Cody Maloney commented on MESOS-2144:
-

Based on the addresses being at the low end of the address range I'm guessing 
it is happening while running __cxa_exit (global static destruction), or some 
other system cleanup symbol and this is during glibc doing something on mesos' 
behalf. Likely whatever that library is doesn't have symbols / is stripped if 
it is coming from the Linux distribution.

Side note: 
Backtraces from our code don't use the debugging info. But yea, definitely 
looks like debugging is enabled. And functions shouldn't be optimized, binary 
isn't stripped of symbols, so stack traces should have all the function symbols.

 Segmentation Fault in ExamplesTest.LowLevelSchedulerPthread
 ---

 Key: MESOS-2144
 URL: https://issues.apache.org/jira/browse/MESOS-2144
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.21.0
Reporter: Cody Maloney
Priority: Minor
  Labels: flaky

 Occured on review bot review of: 
 https://reviews.apache.org/r/28262/#review62333
 The review doesn't touch code related to the test (And doesn't break 
 libprocess in general)
 [ RUN  ] ExamplesTest.LowLevelSchedulerPthread
 ../../src/tests/script.cpp:83: Failure
 Failed
 low_level_scheduler_pthread_test.sh terminated with signal Segmentation fault
 [  FAILED  ] ExamplesTest.LowLevelSchedulerPthread (7561 ms)
 The test 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2183) docker containerizer doesn't work when mesos-slave is running in a container

2015-01-28 Thread Jay Buffington (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296106#comment-14296106
 ] 

Jay Buffington commented on MESOS-2183:
---

Hey [~tnachen], I read your doc at 
https://docs.google.com/document/d/1_1oLHXg_aHj_fYCzsjYwox9xvIYNAKIeVjO5BFxsUGI/edit#
 and it's not clear you address the issue I encountered.  In my mesos-slave 
running in coreos I have it:

* running inside a pid namespace
* useing the mounted /var/run/docker.sock to start a sibling container
* running docker inspect to get the pid it just launched
* it sees that the pid docker inspect reports 
* it tries to determine the libprocess port based on that pid
* it does see that pid since the pid docker inspect returns is only visible 
in the root namespace
* it does docker stop/kill because it incorrectly thinks the executor 
failed to start since it couldn't see the pid

I don't understand how your patch addresses that issue.  Can you give me a 
summary of how it fixes this problem I've described?

 docker containerizer doesn't work when mesos-slave is running in a container
 

 Key: MESOS-2183
 URL: https://issues.apache.org/jira/browse/MESOS-2183
 Project: Mesos
  Issue Type: Bug
  Components: containerization, docker
Reporter: Jay Buffington
Assignee: Timothy Chen

 I've started running the mesos-slave process itself inside a docker 
 container.  I bind mount in the dockerd socket, so there is only one docker 
 daemon running on the system.
 The mesos-slave process uses docker run to start an executor in another, 
 sibling, container.  It asks docker inspect what the pid of the executor 
 running in the container is.  Since the mesos-slave process is in its own pid 
 namespace, it cannot see the pid for the executor in /proc.  Therefore, it 
 thinks the executor died and it does a docker kill.
 It looks like the executor pid is also used to determine what port the 
 executor is listening on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec

2015-01-28 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14295983#comment-14295983
 ] 

Steven Schlansker commented on MESOS-2162:
--

I would love to help out in any way I can, but I am not much of a C++ guy. But 
at the very least I would happily test it, or if you have other suggestions for 
how I can help...

 Consider a C++ implementation of CoreOS AppContainer spec
 -

 Key: MESOS-2162
 URL: https://issues.apache.org/jira/browse/MESOS-2162
 Project: Mesos
  Issue Type: Story
  Components: containerization
Reporter: Dominic Hamon
  Labels: mesosphere, twitter

 CoreOS have released a 
 [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md]
  for a container abstraction as an alternative to Docker. They have also 
 released a reference implementation, [rocket|https://coreos.com/blog/rocket/].
 We should consider a C++ implementation of the specification to have parity 
 with the community and then use this implementation for our containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2289) Design doc for the HTTP API

2015-01-28 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-2289:
-

 Summary: Design doc for the HTTP API
 Key: MESOS-2289
 URL: https://issues.apache.org/jira/browse/MESOS-2289
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone


This tracks the design of the HTTP API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2288) HTTP API for interacting with Mesos

2015-01-28 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2288?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-2288:
--
Epic Name: HTTP API  (was: http api)

 HTTP API for interacting with Mesos
 ---

 Key: MESOS-2288
 URL: https://issues.apache.org/jira/browse/MESOS-2288
 Project: Mesos
  Issue Type: Epic
Reporter: Vinod Kone

 Currently Mesos frameworks (schedulers and executors) interact with Mesos 
 (masters and slaves) via drivers provided by Mesos. While the driver helped 
 in providing some common functionality for all frameworks (master detection, 
 authentication, validation etc), it has several drawbacks.
 -- Frameworks need to depend on a native library which makes their 
 build/deploy process cumbersome.
 -- Pure language frameworks cannot use off the shelf libraries to interact 
 with the undocumented API used by the driver.
 -- Makes it hard for developers to implement new APIs (lot of boiler plate 
 code to write).
 This proposal is for Mesos to provide a well documented public HTTP API that 
 frameworks (and maybe operators) can use to interact with Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec

2015-01-28 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14295944#comment-14295944
 ] 

Steven Schlansker commented on MESOS-2162:
--

This library may be a good starting point: https://github.com/cdaylward/libappc/

 Consider a C++ implementation of CoreOS AppContainer spec
 -

 Key: MESOS-2162
 URL: https://issues.apache.org/jira/browse/MESOS-2162
 Project: Mesos
  Issue Type: Story
  Components: containerization
Reporter: Dominic Hamon
  Labels: mesosphere, twitter

 CoreOS have released a 
 [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md]
  for a container abstraction as an alternative to Docker. They have also 
 released a reference implementation, [rocket|https://coreos.com/blog/rocket/].
 We should consider a C++ implementation of the specification to have parity 
 with the community and then use this implementation for our containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec

2015-01-28 Thread Timothy Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14295978#comment-14295978
 ] 

Timothy Chen commented on MESOS-2162:
-

Hi Steven, that's what I think too.

It's my plan to work on this but this quarter I won't have much time to do so.

Are you interested in this? We could work together.

 Consider a C++ implementation of CoreOS AppContainer spec
 -

 Key: MESOS-2162
 URL: https://issues.apache.org/jira/browse/MESOS-2162
 Project: Mesos
  Issue Type: Story
  Components: containerization
Reporter: Dominic Hamon
  Labels: mesosphere, twitter

 CoreOS have released a 
 [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md]
  for a container abstraction as an alternative to Docker. They have also 
 released a reference implementation, [rocket|https://coreos.com/blog/rocket/].
 We should consider a C++ implementation of the specification to have parity 
 with the community and then use this implementation for our containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec

2015-01-28 Thread Ian Downes (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14295996#comment-14295996
 ] 

Ian Downes commented on MESOS-2162:
---

I'll be working on this too, development and/or shepherding. 

 Consider a C++ implementation of CoreOS AppContainer spec
 -

 Key: MESOS-2162
 URL: https://issues.apache.org/jira/browse/MESOS-2162
 Project: Mesos
  Issue Type: Story
  Components: containerization
Reporter: Dominic Hamon
  Labels: mesosphere, twitter

 CoreOS have released a 
 [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md]
  for a container abstraction as an alternative to Docker. They have also 
 released a reference implementation, [rocket|https://coreos.com/blog/rocket/].
 We should consider a C++ implementation of the specification to have parity 
 with the community and then use this implementation for our containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2291) Move executor driver validations to slave

2015-01-28 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-2291:
-

 Summary: Move executor driver validations to slave
 Key: MESOS-2291
 URL: https://issues.apache.org/jira/browse/MESOS-2291
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone


With HTTP API, the executor driver will no longer exist and hence all the 
validations should move to the slave. 




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2297) Add authentication support for HTTP API

2015-01-28 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-2297:
-

 Summary: Add authentication support for HTTP API
 Key: MESOS-2297
 URL: https://issues.apache.org/jira/browse/MESOS-2297
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone


To start with, we will only support basic http auth.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2295) Implement the Call endpoint on Slave

2015-01-28 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-2295:
-

 Summary: Implement the Call endpoint on Slave
 Key: MESOS-2295
 URL: https://issues.apache.org/jira/browse/MESOS-2295
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2298) Provide master detection library/libraries for pure schedulers

2015-01-28 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-2298:
-

 Summary: Provide master detection library/libraries for pure 
schedulers
 Key: MESOS-2298
 URL: https://issues.apache.org/jira/browse/MESOS-2298
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone


When schedulers start interacting with Mesos master via HTTP endpoints, they 
need a way to detect masters. Ideally, Mesos provides master detection 
library/libraries in supported languages (java and python to start with) to 
make this easy for frameworks.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec

2015-01-28 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14295939#comment-14295939
 ] 

Steven Schlansker commented on MESOS-2162:
--

Any possibility of getting this scheduled for an upcoming release?

 Consider a C++ implementation of CoreOS AppContainer spec
 -

 Key: MESOS-2162
 URL: https://issues.apache.org/jira/browse/MESOS-2162
 Project: Mesos
  Issue Type: Story
  Components: containerization
Reporter: Dominic Hamon
  Labels: mesosphere, twitter

 CoreOS have released a 
 [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md]
  for a container abstraction as an alternative to Docker. They have also 
 released a reference implementation, [rocket|https://coreos.com/blog/rocket/].
 We should consider a C++ implementation of the specification to have parity 
 with the community and then use this implementation for our containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec

2015-01-28 Thread Steven Schlansker (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14295981#comment-14295981
 ] 

Steven Schlansker commented on MESOS-2162:
--

I would love to help out in any way I can, but I am not much of a C++ guy.  But 
at the very least I would happily test it, or if you have other suggestions for 
how I can help...

 Consider a C++ implementation of CoreOS AppContainer spec
 -

 Key: MESOS-2162
 URL: https://issues.apache.org/jira/browse/MESOS-2162
 Project: Mesos
  Issue Type: Story
  Components: containerization
Reporter: Dominic Hamon
  Labels: mesosphere, twitter

 CoreOS have released a 
 [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md]
  for a container abstraction as an alternative to Docker. They have also 
 released a reference implementation, [rocket|https://coreos.com/blog/rocket/].
 We should consider a C++ implementation of the specification to have parity 
 with the community and then use this implementation for our containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Issue Comment Deleted] (MESOS-2162) Consider a C++ implementation of CoreOS AppContainer spec

2015-01-28 Thread Steven Schlansker (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2162?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steven Schlansker updated MESOS-2162:
-
Comment: was deleted

(was: I would love to help out in any way I can, but I am not much of a C++ 
guy.  But at the very least I would happily test it, or if you have other 
suggestions for how I can help...)

 Consider a C++ implementation of CoreOS AppContainer spec
 -

 Key: MESOS-2162
 URL: https://issues.apache.org/jira/browse/MESOS-2162
 Project: Mesos
  Issue Type: Story
  Components: containerization
Reporter: Dominic Hamon
  Labels: mesosphere, twitter

 CoreOS have released a 
 [specification|https://github.com/coreos/rocket/blob/master/app-container/SPEC.md]
  for a container abstraction as an alternative to Docker. They have also 
 released a reference implementation, [rocket|https://coreos.com/blog/rocket/].
 We should consider a C++ implementation of the specification to have parity 
 with the community and then use this implementation for our containerizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1127) Expose lower-level scheduler/executor API

2015-01-28 Thread Vinod Kone (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-1127:
--
 Epic Name:   (was: HTTP API)
Issue Type: Task  (was: Epic)

 Expose lower-level scheduler/executor API
 -

 Key: MESOS-1127
 URL: https://issues.apache.org/jira/browse/MESOS-1127
 Project: Mesos
  Issue Type: Task
  Components: framework
Reporter: Benjamin Hindman
Assignee: Benjamin Hindman
  Labels: twitter

 The default scheduler/executor interface and implementation in Mesos have a 
 few drawbacks:
 (1) The interface is fairly high-level which makes it hard to do certain 
 things, for example, handle events (callbacks) in batch. This can have a big 
 impact on the performance of schedulers (for example, writing task updates 
 that need to be persisted).
 (2) The implementation requires writing a lot of boilerplate JNI and native 
 Python wrappers when adding additional API components.
 The plan is to provide a lower-level API that can easily be used to implement 
 the higher-level API that is currently provided. This will also open the door 
 to more easily building native-language Mesos libraries (i.e., not needing 
 the C++ shim layer) and building new higher-level abstractions on top of the 
 lower-level API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2293) Implement the Call endpoint on master

2015-01-28 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-2293:
-

 Summary: Implement the Call endpoint on master
 Key: MESOS-2293
 URL: https://issues.apache.org/jira/browse/MESOS-2293
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2292) Implement Call/Event protobufs for Executor

2015-01-28 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-2292:
-

 Summary: Implement Call/Event protobufs for Executor
 Key: MESOS-2292
 URL: https://issues.apache.org/jira/browse/MESOS-2292
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2294) Implement the Events endpoint on master

2015-01-28 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-2294:
-

 Summary: Implement the Events endpoint on master
 Key: MESOS-2294
 URL: https://issues.apache.org/jira/browse/MESOS-2294
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2215) The Docker containerizer attempts to recover any task when checkpointing is enabled, not just docker tasks.

2015-01-28 Thread Steve Niemitz (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Steve Niemitz updated MESOS-2215:
-
Description: 
Once the slave restarts and recovers the task, I see this error in the log for 
all tasks that were recovered every second or so.  Note, these were NOT docker 
tasks:

W0113 16:01:00.790323 773142 monitor.cpp:213] Failed to get resource usage for  
container 7b729b89-dc7e-4d08-af97-8cd1af560a21 for executor 
thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd
 of framework 20150109-161713-715350282-5050-290797-: Failed to 'docker 
inspect mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21': exit status = exited with 
status 1 stderr = Error: No such image or container: 
mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21
However the tasks themselves are still healthy and running.

The slave was launched with --containerizers=mesos,docker

-
More info: it looks like the docker containerizer is a little too ambitious 
about recovering containers, again this was not a docker task:
I0113 15:59:59.476145 773142 docker.cpp:814] Recovering container 
'7b729b89-dc7e-4d08-af97-8cd1af560a21' for executor 
'thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd'
 of framework 20150109-161713-715350282-5050-290797-

Looking into the source, it looks like the problem is that the 
ComposingContainerize runs recover in parallel, but neither the docker 
containerizer nor mesos containerizer check if they should recover the task or 
not (because they were the ones that launched it).  Perhaps this needs to be 
written into the checkpoint somewhere?

  was:
Once the slave restarts and recovers the task, I see this error in the log for 
all tasks that were recovered every second or so.  Note, these were NOT docker 
tasks:

W0113 16:01:00.790323 773142 monitor.cpp:213] Failed to get resource usage for  
container 7b729b89-dc7e-4d08-af97-8cd1af560a21 for executor 
thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd
 of framework 20150109-161713-715350282-5050-290797-: Failed to 'docker 
inspect mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21': exit status = exited with 
status 1 stderr = Error: No such image or container: 
mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21
However the tasks themselves are still healthy and running.

The slave was launched with --containerizers=mesos,docker

-
More info: it looks like the docker containerizer is a little too ambitious 
about recovering containers, again this was not a docker task:
I0113 15:59:59.476145 773142 docker.cpp:814] Recovering container 
'7b729b89-dc7e-4d08-af97-8cd1af560a21' for executor 
'thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd'
 of framework 20150109-161713-715350282-5050-290797-

Looking into the source, it looks like the problem is that the 
ComposingContainerize runs recover in parallel, but neither the docker 
containerizer not mesos containerizer check if they should recover the task or 
not (because they were the ones that launched it).  Perhaps this needs to be 
written into the checkpoint somewhere?


 The Docker containerizer attempts to recover any task when checkpointing is 
 enabled, not just docker tasks.
 ---

 Key: MESOS-2215
 URL: https://issues.apache.org/jira/browse/MESOS-2215
 Project: Mesos
  Issue Type: Bug
  Components: docker
Affects Versions: 0.21.0
Reporter: Steve Niemitz
Assignee: Timothy Chen

 Once the slave restarts and recovers the task, I see this error in the log 
 for all tasks that were recovered every second or so.  Note, these were NOT 
 docker tasks:
 W0113 16:01:00.790323 773142 monitor.cpp:213] Failed to get resource usage 
 for  container 7b729b89-dc7e-4d08-af97-8cd1af560a21 for executor 
 thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd
  of framework 20150109-161713-715350282-5050-290797-: Failed to 'docker 
 inspect mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21': exit status = exited 
 with status 1 stderr = Error: No such image or container: 
 mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21
 However the tasks themselves are still healthy and running.
 The slave was launched with --containerizers=mesos,docker
 -
 More info: it looks like the docker containerizer is a little too ambitious 
 about recovering containers, again this was not a docker task:
 I0113 15:59:59.476145 773142 docker.cpp:814] Recovering container 
 '7b729b89-dc7e-4d08-af97-8cd1af560a21' for executor 
 'thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd'
  of framework 20150109-161713-715350282-5050-290797-
 Looking into the source, it looks like the problem is 

[jira] [Created] (MESOS-2290) Move all scheduler driver validations to master

2015-01-28 Thread Vinod Kone (JIRA)
Vinod Kone created MESOS-2290:
-

 Summary: Move all scheduler driver validations to master
 Key: MESOS-2290
 URL: https://issues.apache.org/jira/browse/MESOS-2290
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone


With HTTP API, the scheduler driver will no longer exist and hence all the 
validations should move to the master.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1806) Substituting etcd or ReplicatedLog for Zookeeper

2015-01-28 Thread Cody Maloney (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296345#comment-14296345
 ] 

Cody Maloney commented on MESOS-1806:
-

https://reviews.apache.org/r/30194/
https://reviews.apache.org/r/30195/
https://reviews.apache.org/r/30393/
https://reviews.apache.org/r/30394/
https://reviews.apache.org/r/30395/
https://reviews.apache.org/r/30396/
https://reviews.apache.org/r/30397/
https://reviews.apache.org/r/30398/


 Substituting etcd or ReplicatedLog for Zookeeper
 

 Key: MESOS-1806
 URL: https://issues.apache.org/jira/browse/MESOS-1806
 Project: Mesos
  Issue Type: Task
Reporter: Ed Ropple
Assignee: Cody Maloney
Priority: Minor

 adam_mesos   eropple: Could you also file a new JIRA for Mesos to drop ZK 
 in favor of etcd or ReplicatedLog? Would love to get some momentum going on 
 that one.
 --
 Consider it filed. =)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-1825) Support the webui over HTTPS.

2015-01-28 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-1825:
---
Summary: Support the webui over HTTPS.  (was: support https link)

 Support the webui over HTTPS.
 -

 Key: MESOS-1825
 URL: https://issues.apache.org/jira/browse/MESOS-1825
 Project: Mesos
  Issue Type: Bug
  Components: webui
Reporter: Kien Pham
Priority: Minor
  Labels: newbie

 Right now at Mesos UI, link are hardcoded to http:// . It should not be 
 hardcoded so that it can support https link.
 Ex:
 https://github.com/apache/mesos/blob/master/src/webui/master/static/js/controllers.js#L17



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-1825) Support the webui over HTTPS.

2015-01-28 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-1825?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296276#comment-14296276
 ] 

ASF GitHub Bot commented on MESOS-1825:
---

Github user bmahler commented on the pull request:

https://github.com/apache/mesos/pull/34#issuecomment-71957729
  
Thanks Arnaud! Nice, there will be built-in HTTPS support in Mesos at some 
point, you may want to chime in here:

https://issues.apache.org/jira/browse/MESOS-1825


 Support the webui over HTTPS.
 -

 Key: MESOS-1825
 URL: https://issues.apache.org/jira/browse/MESOS-1825
 Project: Mesos
  Issue Type: Bug
  Components: webui
Reporter: Kien Pham
Priority: Minor
  Labels: newbie

 Right now at Mesos UI, link are hardcoded to http:// . It should not be 
 hardcoded so that it can support https link.
 Ex:
 https://github.com/apache/mesos/blob/master/src/webui/master/static/js/controllers.js#L17



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-2183) docker containerizer doesn't work when mesos-slave is running in a container

2015-01-28 Thread Timothy Chen (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14296300#comment-14296300
 ] 

Timothy Chen commented on MESOS-2183:
-

So I'm planning to leverage the --pid=host flag in docker 1.5, which won't 
clone a new pid namespace. With this you won't see the problems you are seeing.

What I described in my doc is to handle recovery,


 docker containerizer doesn't work when mesos-slave is running in a container
 

 Key: MESOS-2183
 URL: https://issues.apache.org/jira/browse/MESOS-2183
 Project: Mesos
  Issue Type: Bug
  Components: containerization, docker
Reporter: Jay Buffington
Assignee: Timothy Chen

 I've started running the mesos-slave process itself inside a docker 
 container.  I bind mount in the dockerd socket, so there is only one docker 
 daemon running on the system.
 The mesos-slave process uses docker run to start an executor in another, 
 sibling, container.  It asks docker inspect what the pid of the executor 
 running in the container is.  Since the mesos-slave process is in its own pid 
 namespace, it cannot see the pid for the executor in /proc.  Therefore, it 
 thinks the executor died and it does a docker kill.
 It looks like the executor pid is also used to determine what port the 
 executor is listening on.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-2228) SlaveTest.MesosExecutorGracefulShutdown is flaky

2015-01-28 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler reassigned MESOS-2228:
--

Assignee: Benjamin Mahler  (was: Alexander Rukletsov)

{quote}
(or is not being reaped)
{quote}

From the output, we're not seeing 'Terminated' in the output, which means that 
it's the SIGKILL reaching the pid, not the SIGTERM, no? Because of this, it 
doesn't seem like it's a reaping issue, anything I'm missing?

{quote}
From the logs it looks like a simple sleep task doesn't terminate
{quote}

Looks like this to me as well, these are VMs and we sometimes see strange 
blocking behavior. I've bumped the timeout for now and included a nicer error 
message. Please take a look:

https://reviews.apache.org/r/30402/

 SlaveTest.MesosExecutorGracefulShutdown is flaky
 

 Key: MESOS-2228
 URL: https://issues.apache.org/jira/browse/MESOS-2228
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.22.0
Reporter: Vinod Kone
Assignee: Benjamin Mahler
  Labels: twitter

 Observed this on internal CI
 {noformat}
 [ RUN  ] SlaveTest.MesosExecutorGracefulShutdown
 Using temporary directory 
 '/tmp/SlaveTest_MesosExecutorGracefulShutdown_AWdtVJ'
 I0124 08:14:04.399211  7926 leveldb.cpp:176] Opened db in 27.364056ms
 I0124 08:14:04.402632  7926 leveldb.cpp:183] Compacted db in 3.357646ms
 I0124 08:14:04.402691  7926 leveldb.cpp:198] Created db iterator in 23822ns
 I0124 08:14:04.402708  7926 leveldb.cpp:204] Seeked to beginning of db in 
 1913ns
 I0124 08:14:04.402716  7926 leveldb.cpp:273] Iterated through 0 keys in the 
 db in 458ns
 I0124 08:14:04.402767  7926 replica.cpp:744] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0124 08:14:04.403728  7951 recover.cpp:449] Starting replica recovery
 I0124 08:14:04.404011  7951 recover.cpp:475] Replica is in EMPTY status
 I0124 08:14:04.407765  7950 replica.cpp:641] Replica in EMPTY status received 
 a broadcasted recover request
 I0124 08:14:04.408710  7951 recover.cpp:195] Received a recover response from 
 a replica in EMPTY status
 I0124 08:14:04.419666  7951 recover.cpp:566] Updating replica status to 
 STARTING
 I0124 08:14:04.429719  7953 master.cpp:262] Master 
 20150124-081404-16842879-47787-7926 (utopic) started on 127.0.1.1:47787
 I0124 08:14:04.429790  7953 master.cpp:308] Master only allowing 
 authenticated frameworks to register
 I0124 08:14:04.429802  7953 master.cpp:313] Master only allowing 
 authenticated slaves to register
 I0124 08:14:04.429826  7953 credentials.hpp:36] Loading credentials for 
 authentication from 
 '/tmp/SlaveTest_MesosExecutorGracefulShutdown_AWdtVJ/credentials'
 I0124 08:14:04.430277  7953 master.cpp:357] Authorization enabled
 I0124 08:14:04.432682  7953 master.cpp:1219] The newly elected leader is 
 master@127.0.1.1:47787 with id 20150124-081404-16842879-47787-7926
 I0124 08:14:04.432816  7953 master.cpp:1232] Elected as the leading master!
 I0124 08:14:04.432894  7953 master.cpp:1050] Recovering from registrar
 I0124 08:14:04.433212  7950 registrar.cpp:313] Recovering registrar
 I0124 08:14:04.434226  7951 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 14.323302ms
 I0124 08:14:04.434270  7951 replica.cpp:323] Persisted replica status to 
 STARTING
 I0124 08:14:04.434489  7951 recover.cpp:475] Replica is in STARTING status
 I0124 08:14:04.436164  7951 replica.cpp:641] Replica in STARTING status 
 received a broadcasted recover request
 I0124 08:14:04.439368  7947 recover.cpp:195] Received a recover response from 
 a replica in STARTING status
 I0124 08:14:04.440626  7947 recover.cpp:566] Updating replica status to VOTING
 I0124 08:14:04.443667  7947 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 2.698664ms
 I0124 08:14:04.443759  7947 replica.cpp:323] Persisted replica status to 
 VOTING
 I0124 08:14:04.443925  7947 recover.cpp:580] Successfully joined the Paxos 
 group
 I0124 08:14:04.444160  7947 recover.cpp:464] Recover process terminated
 I0124 08:14:04.444543  7949 log.cpp:660] Attempting to start the writer
 I0124 08:14:04.446331  7949 replica.cpp:477] Replica received implicit 
 promise request with proposal 1
 I0124 08:14:04.449329  7949 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 2.690453ms
 I0124 08:14:04.449388  7949 replica.cpp:345] Persisted promised to 1
 I0124 08:14:04.450637  7947 coordinator.cpp:230] Coordinator attemping to 
 fill missing position
 I0124 08:14:04.452271  7949 replica.cpp:378] Replica received explicit 
 promise request for position 0 with proposal 2
 I0124 08:14:04.455124  7949 leveldb.cpp:343] Persisting action (8 bytes) to 
 leveldb took 2.593522ms
 I0124 08:14:04.455157  7949 replica.cpp:679] Persisted action at 0
 I0124 08:14:04.456594  

[jira] [Updated] (MESOS-2228) SlaveTest.MesosExecutorGracefulShutdown is flaky

2015-01-28 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-2228:
---
Labels: twitter  (was: )

 SlaveTest.MesosExecutorGracefulShutdown is flaky
 

 Key: MESOS-2228
 URL: https://issues.apache.org/jira/browse/MESOS-2228
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.22.0
Reporter: Vinod Kone
Assignee: Benjamin Mahler
  Labels: twitter

 Observed this on internal CI
 {noformat}
 [ RUN  ] SlaveTest.MesosExecutorGracefulShutdown
 Using temporary directory 
 '/tmp/SlaveTest_MesosExecutorGracefulShutdown_AWdtVJ'
 I0124 08:14:04.399211  7926 leveldb.cpp:176] Opened db in 27.364056ms
 I0124 08:14:04.402632  7926 leveldb.cpp:183] Compacted db in 3.357646ms
 I0124 08:14:04.402691  7926 leveldb.cpp:198] Created db iterator in 23822ns
 I0124 08:14:04.402708  7926 leveldb.cpp:204] Seeked to beginning of db in 
 1913ns
 I0124 08:14:04.402716  7926 leveldb.cpp:273] Iterated through 0 keys in the 
 db in 458ns
 I0124 08:14:04.402767  7926 replica.cpp:744] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0124 08:14:04.403728  7951 recover.cpp:449] Starting replica recovery
 I0124 08:14:04.404011  7951 recover.cpp:475] Replica is in EMPTY status
 I0124 08:14:04.407765  7950 replica.cpp:641] Replica in EMPTY status received 
 a broadcasted recover request
 I0124 08:14:04.408710  7951 recover.cpp:195] Received a recover response from 
 a replica in EMPTY status
 I0124 08:14:04.419666  7951 recover.cpp:566] Updating replica status to 
 STARTING
 I0124 08:14:04.429719  7953 master.cpp:262] Master 
 20150124-081404-16842879-47787-7926 (utopic) started on 127.0.1.1:47787
 I0124 08:14:04.429790  7953 master.cpp:308] Master only allowing 
 authenticated frameworks to register
 I0124 08:14:04.429802  7953 master.cpp:313] Master only allowing 
 authenticated slaves to register
 I0124 08:14:04.429826  7953 credentials.hpp:36] Loading credentials for 
 authentication from 
 '/tmp/SlaveTest_MesosExecutorGracefulShutdown_AWdtVJ/credentials'
 I0124 08:14:04.430277  7953 master.cpp:357] Authorization enabled
 I0124 08:14:04.432682  7953 master.cpp:1219] The newly elected leader is 
 master@127.0.1.1:47787 with id 20150124-081404-16842879-47787-7926
 I0124 08:14:04.432816  7953 master.cpp:1232] Elected as the leading master!
 I0124 08:14:04.432894  7953 master.cpp:1050] Recovering from registrar
 I0124 08:14:04.433212  7950 registrar.cpp:313] Recovering registrar
 I0124 08:14:04.434226  7951 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 14.323302ms
 I0124 08:14:04.434270  7951 replica.cpp:323] Persisted replica status to 
 STARTING
 I0124 08:14:04.434489  7951 recover.cpp:475] Replica is in STARTING status
 I0124 08:14:04.436164  7951 replica.cpp:641] Replica in STARTING status 
 received a broadcasted recover request
 I0124 08:14:04.439368  7947 recover.cpp:195] Received a recover response from 
 a replica in STARTING status
 I0124 08:14:04.440626  7947 recover.cpp:566] Updating replica status to VOTING
 I0124 08:14:04.443667  7947 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 2.698664ms
 I0124 08:14:04.443759  7947 replica.cpp:323] Persisted replica status to 
 VOTING
 I0124 08:14:04.443925  7947 recover.cpp:580] Successfully joined the Paxos 
 group
 I0124 08:14:04.444160  7947 recover.cpp:464] Recover process terminated
 I0124 08:14:04.444543  7949 log.cpp:660] Attempting to start the writer
 I0124 08:14:04.446331  7949 replica.cpp:477] Replica received implicit 
 promise request with proposal 1
 I0124 08:14:04.449329  7949 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 2.690453ms
 I0124 08:14:04.449388  7949 replica.cpp:345] Persisted promised to 1
 I0124 08:14:04.450637  7947 coordinator.cpp:230] Coordinator attemping to 
 fill missing position
 I0124 08:14:04.452271  7949 replica.cpp:378] Replica received explicit 
 promise request for position 0 with proposal 2
 I0124 08:14:04.455124  7949 leveldb.cpp:343] Persisting action (8 bytes) to 
 leveldb took 2.593522ms
 I0124 08:14:04.455157  7949 replica.cpp:679] Persisted action at 0
 I0124 08:14:04.456594  7951 replica.cpp:511] Replica received write request 
 for position 0
 I0124 08:14:04.456657  7951 leveldb.cpp:438] Reading position from leveldb 
 took 30358ns
 I0124 08:14:04.464860  7951 leveldb.cpp:343] Persisting action (14 bytes) to 
 leveldb took 8.164646ms
 I0124 08:14:04.464903  7951 replica.cpp:679] Persisted action at 0
 I0124 08:14:04.465947  7949 replica.cpp:658] Replica received learned notice 
 for position 0
 I0124 08:14:04.471567  7949 leveldb.cpp:343] Persisting action (16 bytes) to 
 leveldb took 5.587838ms
 I0124 08:14:04.471601  7949 replica.cpp:679] Persisted action at 0
 

[jira] [Updated] (MESOS-2228) SlaveTest.MesosExecutorGracefulShutdown is flaky

2015-01-28 Thread Benjamin Mahler (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benjamin Mahler updated MESOS-2228:
---
Sprint: Twitter Mesos Q1 Sprint 1

 SlaveTest.MesosExecutorGracefulShutdown is flaky
 

 Key: MESOS-2228
 URL: https://issues.apache.org/jira/browse/MESOS-2228
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.22.0
Reporter: Vinod Kone
Assignee: Benjamin Mahler
  Labels: twitter

 Observed this on internal CI
 {noformat}
 [ RUN  ] SlaveTest.MesosExecutorGracefulShutdown
 Using temporary directory 
 '/tmp/SlaveTest_MesosExecutorGracefulShutdown_AWdtVJ'
 I0124 08:14:04.399211  7926 leveldb.cpp:176] Opened db in 27.364056ms
 I0124 08:14:04.402632  7926 leveldb.cpp:183] Compacted db in 3.357646ms
 I0124 08:14:04.402691  7926 leveldb.cpp:198] Created db iterator in 23822ns
 I0124 08:14:04.402708  7926 leveldb.cpp:204] Seeked to beginning of db in 
 1913ns
 I0124 08:14:04.402716  7926 leveldb.cpp:273] Iterated through 0 keys in the 
 db in 458ns
 I0124 08:14:04.402767  7926 replica.cpp:744] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0124 08:14:04.403728  7951 recover.cpp:449] Starting replica recovery
 I0124 08:14:04.404011  7951 recover.cpp:475] Replica is in EMPTY status
 I0124 08:14:04.407765  7950 replica.cpp:641] Replica in EMPTY status received 
 a broadcasted recover request
 I0124 08:14:04.408710  7951 recover.cpp:195] Received a recover response from 
 a replica in EMPTY status
 I0124 08:14:04.419666  7951 recover.cpp:566] Updating replica status to 
 STARTING
 I0124 08:14:04.429719  7953 master.cpp:262] Master 
 20150124-081404-16842879-47787-7926 (utopic) started on 127.0.1.1:47787
 I0124 08:14:04.429790  7953 master.cpp:308] Master only allowing 
 authenticated frameworks to register
 I0124 08:14:04.429802  7953 master.cpp:313] Master only allowing 
 authenticated slaves to register
 I0124 08:14:04.429826  7953 credentials.hpp:36] Loading credentials for 
 authentication from 
 '/tmp/SlaveTest_MesosExecutorGracefulShutdown_AWdtVJ/credentials'
 I0124 08:14:04.430277  7953 master.cpp:357] Authorization enabled
 I0124 08:14:04.432682  7953 master.cpp:1219] The newly elected leader is 
 master@127.0.1.1:47787 with id 20150124-081404-16842879-47787-7926
 I0124 08:14:04.432816  7953 master.cpp:1232] Elected as the leading master!
 I0124 08:14:04.432894  7953 master.cpp:1050] Recovering from registrar
 I0124 08:14:04.433212  7950 registrar.cpp:313] Recovering registrar
 I0124 08:14:04.434226  7951 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 14.323302ms
 I0124 08:14:04.434270  7951 replica.cpp:323] Persisted replica status to 
 STARTING
 I0124 08:14:04.434489  7951 recover.cpp:475] Replica is in STARTING status
 I0124 08:14:04.436164  7951 replica.cpp:641] Replica in STARTING status 
 received a broadcasted recover request
 I0124 08:14:04.439368  7947 recover.cpp:195] Received a recover response from 
 a replica in STARTING status
 I0124 08:14:04.440626  7947 recover.cpp:566] Updating replica status to VOTING
 I0124 08:14:04.443667  7947 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 2.698664ms
 I0124 08:14:04.443759  7947 replica.cpp:323] Persisted replica status to 
 VOTING
 I0124 08:14:04.443925  7947 recover.cpp:580] Successfully joined the Paxos 
 group
 I0124 08:14:04.444160  7947 recover.cpp:464] Recover process terminated
 I0124 08:14:04.444543  7949 log.cpp:660] Attempting to start the writer
 I0124 08:14:04.446331  7949 replica.cpp:477] Replica received implicit 
 promise request with proposal 1
 I0124 08:14:04.449329  7949 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 2.690453ms
 I0124 08:14:04.449388  7949 replica.cpp:345] Persisted promised to 1
 I0124 08:14:04.450637  7947 coordinator.cpp:230] Coordinator attemping to 
 fill missing position
 I0124 08:14:04.452271  7949 replica.cpp:378] Replica received explicit 
 promise request for position 0 with proposal 2
 I0124 08:14:04.455124  7949 leveldb.cpp:343] Persisting action (8 bytes) to 
 leveldb took 2.593522ms
 I0124 08:14:04.455157  7949 replica.cpp:679] Persisted action at 0
 I0124 08:14:04.456594  7951 replica.cpp:511] Replica received write request 
 for position 0
 I0124 08:14:04.456657  7951 leveldb.cpp:438] Reading position from leveldb 
 took 30358ns
 I0124 08:14:04.464860  7951 leveldb.cpp:343] Persisting action (14 bytes) to 
 leveldb took 8.164646ms
 I0124 08:14:04.464903  7951 replica.cpp:679] Persisted action at 0
 I0124 08:14:04.465947  7949 replica.cpp:658] Replica received learned notice 
 for position 0
 I0124 08:14:04.471567  7949 leveldb.cpp:343] Persisting action (16 bytes) to 
 leveldb took 5.587838ms
 I0124 08:14:04.471601  7949 replica.cpp:679] Persisted 

[jira] [Created] (MESOS-2286) Simplify the allocator architecture

2015-01-28 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-2286:
--

 Summary: Simplify the allocator architecture
 Key: MESOS-2286
 URL: https://issues.apache.org/jira/browse/MESOS-2286
 Project: Mesos
  Issue Type: Improvement
Reporter: Alexander Rukletsov






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2285) Eliminate dependency on master::Flags in Allocator

2015-01-28 Thread Alexander Rukletsov (JIRA)
Alexander Rukletsov created MESOS-2285:
--

 Summary: Eliminate dependency on master::Flags in Allocator
 Key: MESOS-2285
 URL: https://issues.apache.org/jira/browse/MESOS-2285
 Project: Mesos
  Issue Type: Improvement
  Components: allocation
Reporter: Alexander Rukletsov
Priority: Minor


{{Allocator}} extracts parameters from {{master::Flags}} during initialization. 
Currently, only {{allocation_interval}} key from {{master::Flags}} is used. It 
makes sense to introduce a separate structure {{allocator::Options}} with 
values relevant for allocation and eliminate dependency on {{master::Flags}}.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (MESOS-2287) Document undocumented tests

2015-01-28 Thread Niklas Quarfot Nielsen (JIRA)
Niklas Quarfot Nielsen created MESOS-2287:
-

 Summary: Document undocumented tests
 Key: MESOS-2287
 URL: https://issues.apache.org/jira/browse/MESOS-2287
 Project: Mesos
  Issue Type: Improvement
Reporter: Niklas Quarfot Nielsen
Priority: Trivial


We have a inconsistency in the way we document tests. It has become a rule of 
thumb to include a small blob about the test. For example:

{code}
// This tests the 'active' field in slave entries from state.json. We
// first verify an active slave, deactivate it and verify that the
// 'active' field is false.
TEST_F(MasterTest, SlaveActiveEndpoint)
{
  // Start a master.
  TryPIDMaster master = StartMaster();
  ASSERT_SOME(master);
  ...
{code}

However, we still have many tests that haven't been documented. For example: 

{code}
}


TEST_F(MasterTest, MetricsInStatsEndpoint)
{
  TryPIDMaster  master = StartMaster();
  ASSERT_SOME(master);

  Futureprocess::http::Response response =
process::http::get(master.get(), stats.json);
  ...
{code}

It would be great to do a scan and make sure all the tests are documented.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-2286) Simplify the allocator architecture

2015-01-28 Thread Alexander Rukletsov (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2286?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alexander Rukletsov updated MESOS-2286:
---
Component/s: allocation
Description: Allocator refactor 
[https://issues.apache.org/jira/browse/MESOS-2213] will distinguish between 
general allocators and Process-based ones. This introduces a chain of 
inheritance with a single real allocator at the bottom. Consider simplifying 
this architecture without impacting adding new allocators.
   Priority: Minor  (was: Major)

 Simplify the allocator architecture
 ---

 Key: MESOS-2286
 URL: https://issues.apache.org/jira/browse/MESOS-2286
 Project: Mesos
  Issue Type: Improvement
  Components: allocation
Reporter: Alexander Rukletsov
Priority: Minor

 Allocator refactor [https://issues.apache.org/jira/browse/MESOS-2213] will 
 distinguish between general allocators and Process-based ones. This 
 introduces a chain of inheritance with a single real allocator at the bottom. 
 Consider simplifying this architecture without impacting adding new 
 allocators.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)