date:20140918


[ 
https://issues.apache.org/jira/browse/MESOS-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139077#comment-14139077
 ] 

Timothy St. Clair commented on MESOS-1806:
--

[~tnachen] got a branch?  
I'm game for assist, and I'm sure the folks on your end are looking to resolve 
the delta between kube. 


 Substituting etcd or ReplicatedLog for Zookeeper
 

 Key: MESOS-1806
 URL: https://issues.apache.org/jira/browse/MESOS-1806
 Project: Mesos
  Issue Type: Task
Reporter: Ed Ropple
Priority: Minor

 adam_mesos   eropple: Could you also file a new JIRA for Mesos to drop ZK 
 in favor of etcd or ReplicatedLog? Would love to get some momentum going on 
 that one.
 --
 Consider it filed. =)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1392) Failure when znode is removed before we can read its contents.

2014-09-18 Thread Jay Buffington (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139098#comment-14139098
 ] 

Jay Buffington commented on MESOS-1392:
---

Looks like this is resolved by this commit: 
https://github.com/apache/mesos/commit/14c605e8ce425ec8c517d8e4f899eb3ddeede56a

 Failure when znode is removed before we can read its contents.
 --

 Key: MESOS-1392
 URL: https://issues.apache.org/jira/browse/MESOS-1392
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.19.0
Reporter: Benjamin Mahler
Assignee: Yan Xu

 Looks like the following can occur when a znode goes away right before we can 
 read it's contents:
 {noformat: title=Slave exit}
 I0520 16:33:45.721727 29155 group.cpp:382] Trying to create path 
 '/home/mesos/test/master' in ZooKeeper
 I0520 16:33:48.600837 29155 detector.cpp:134] Detected a new leader: 
 (id='2617')
 I0520 16:33:48.601428 29147 group.cpp:655] Trying to get 
 '/home/mesos/test/master/info_002617' in ZooKeeper
 Failed to detect a master: Failed to get data for ephemeral node 
 '/home/mesos/test/master/info_002617' in ZooKeeper: no node
 Slave Exit Status: 1
 {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1384) Add support for loadable MesosModule

[
https://issues.apache.org/jira/browse/MESOS-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139097#comment-14139097
]

Timothy St. Clair commented on MESOS-1384:
--

Having a pluggable architecture would enable folks to do the following:

1. Test PoC ideas in a clean way without impacting mainline.
2. Enable Service providers to write custom interfaces that may only apply to
their workflow. *This is the big one*
3. Prevents mesos from accreating too much into it's core without having well
thought out boundaries on interfaces and adaptability over time. By forcing
the step, it helps to define clear boundaries.
...

Add support for loadable MesosModule

Key: MESOS-1384
URL: https://issues.apache.org/jira/browse/MESOS-1384
Project: Mesos
Issue Type: Improvement
Affects Versions: 0.19.0
Reporter: Timothy St. Clair
Assignee: Niklas Quarfot Nielsen

I think we should break this into multiple phases.
-(1) Let's get the dynamic library loading via a stout-ified version of
https://github.com/timothysc/tests/blob/master/plugin_modules/DynamicLibrary.h.
-
*DONE*
(2) Use (1) to instantiate some classes in Mesos (like an Authenticator
and/or isolator) from a dynamic library. This will give us some more
experience with how we want to name the underlying library symbol, how we
want to specify flags for finding the library, what types of validation we
want when loading a library.
*TARGET*
(3) After doing (2) for one or two classes in Mesos I think we can formalize
the approach in a mesos-ified version of
https://github.com/timothysc/tests/blob/master/plugin_modules/MesosModule.h.
*NEXT*

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-1814) Task attempted to use more offers than requested in example framework

Vinod Kone created MESOS-1814:
-

 Summary: Task attempted to use more offers than requested in 
example framework
 Key: MESOS-1814
 URL: https://issues.apache.org/jira/browse/MESOS-1814
 Project: Mesos
  Issue Type: Bug
Reporter: Vinod Kone


{code}
[ RUN  ] ExamplesTest.JavaFramework
Using temporary directory '/tmp/ExamplesTest_JavaFramework_2PcFCh'
Enabling authentication for the framework
WARNING: Logging before InitGoogleLogging() is written to STDERR
I0917 23:14:35.199069 31510 process.cpp:1771] libprocess is initialized on 
127.0.1.1:34609 for 8 cpus
I0917 23:14:35.199794 31510 logging.cpp:177] Logging to STDERR
I0917 23:14:35.225342 31510 leveldb.cpp:176] Opened db in 22.197149ms
I0917 23:14:35.231133 31510 leveldb.cpp:183] Compacted db in 5.601897ms
I0917 23:14:35.231498 31510 leveldb.cpp:198] Created db iterator in 215441ns
I0917 23:14:35.231608 31510 leveldb.cpp:204] Seeked to beginning of db in 
11488ns
I0917 23:14:35.231722 31510 leveldb.cpp:273] Iterated through 0 keys in the db 
in 14016ns
I0917 23:14:35.231917 31510 replica.cpp:741] Replica recovered with log 
positions 0 - 0 with 1 holes and 0 unlearned
I0917 23:14:35.233129 31526 recover.cpp:425] Starting replica recovery
I0917 23:14:35.233614 31526 recover.cpp:451] Replica is in EMPTY status
I0917 23:14:35.234994 31526 replica.cpp:638] Replica in EMPTY status received a 
broadcasted recover request
I0917 23:14:35.240116 31519 recover.cpp:188] Received a recover response from a 
replica in EMPTY status
I0917 23:14:35.240782 31519 recover.cpp:542] Updating replica status to STARTING
I0917 23:14:35.242846 31524 master.cpp:286] Master 
20140917-231435-16842879-34609-31503 (saucy) started on 127.0.1.1:34609
I0917 23:14:35.243191 31524 master.cpp:332] Master only allowing authenticated 
frameworks to register
I0917 23:14:35.243288 31524 master.cpp:339] Master allowing unauthenticated 
slaves to register
I0917 23:14:35.243399 31524 credentials.hpp:36] Loading credentials for 
authentication from '/tmp/ExamplesTest_JavaFramework_2PcFCh/credentials'
W0917 23:14:35.243588 31524 credentials.hpp:51] Permissions on credentials file 
'/tmp/ExamplesTest_JavaFramework_2PcFCh/credentials' are too open. It is 
recommended that your credentials file is NOT accessible by others.
I0917 23:14:35.243846 31524 master.cpp:366] Authorization enabled
I0917 23:14:35.244882 31520 hierarchical_allocator_process.hpp:299] 
Initializing hierarchical allocator process with master : master@127.0.1.1:34609
I0917 23:14:35.245224 31520 master.cpp:120] No whitelist given. Advertising 
offers for all slaves
I0917 23:14:35.246934 31524 master.cpp:1211] The newly elected leader is 
master@127.0.1.1:34609 with id 20140917-231435-16842879-34609-31503
I0917 23:14:35.247234 31524 master.cpp:1224] Elected as the leading master!
I0917 23:14:35.247336 31524 master.cpp:1042] Recovering from registrar
I0917 23:14:35.247542 31526 registrar.cpp:313] Recovering registrar
I0917 23:14:35.250555 31510 containerizer.cpp:89] Using isolation: 
posix/cpu,posix/mem
I0917 23:14:35.252326 31510 containerizer.cpp:89] Using isolation: 
posix/cpu,posix/mem
I0917 23:14:35.252821 31520 slave.cpp:169] Slave started on 1)@127.0.1.1:34609
I0917 23:14:35.253552 31520 slave.cpp:289] Slave resources: cpus(*):1; 
mem(*):1001; disk(*):24988; ports(*):[31000-32000]
I0917 23:14:35.253906 31520 slave.cpp:317] Slave hostname: saucy
I0917 23:14:35.254004 31520 slave.cpp:318] Slave checkpoint: true
I0917 23:14:35.254818 31520 state.cpp:33] Recovering state from 
'/tmp/mesos-w8snRW/0/meta'
I0917 23:14:35.255106 31519 leveldb.cpp:306] Persisting metadata (8 bytes) to 
leveldb took 13.99622ms
I0917 23:14:35.255235 31519 replica.cpp:320] Persisted replica status to 
STARTING
I0917 23:14:35.255419 31519 recover.cpp:451] Replica is in STARTING status
I0917 23:14:35.255834 31519 replica.cpp:638] Replica in STARTING status 
received a broadcasted recover request
I0917 23:14:35.256000 31519 recover.cpp:188] Received a recover response from a 
replica in STARTING status
I0917 23:14:35.256217 31519 recover.cpp:542] Updating replica status to VOTING
I0917 23:14:35.256641 31520 status_update_manager.cpp:193] Recovering status 
update manager
I0917 23:14:35.257064 31520 containerizer.cpp:252] Recovering containerizer
I0917 23:14:35.257725 31520 slave.cpp:3220] Finished recovery
I0917 23:14:35.258463 31520 slave.cpp:600] New master detected at 
master@127.0.1.1:34609
I0917 23:14:35.258769 31524 status_update_manager.cpp:167] New master detected 
at master@127.0.1.1:34609
I0917 23:14:35.258885 31520 slave.cpp:636] No credentials provided. Attempting 
to register without authentication
I0917 23:14:35.259024 31520 slave.cpp:647] Detecting new master
I0917 23:14:35.259863 31520 slave.cpp:169] Slave started on 2)@127.0.1.1:34609
I0917 23:14:35.260288 31520 slave.cpp:289] Slave resources: cpus(*):1; 
mem(*):1001; disk(*):24988;

[jira] [Commented] (MESOS-1812) Queued tasks are not actually launched in the order they were queued

2014-09-18 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1812?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139171#comment-14139171
 ] 

Dominic Hamon commented on MESOS-1812:
--

MESOS-497 doesn't have any reasoning other than it would be nice so I would 
also like to hear why this is important.

I'm not saying it isn't, just want to make sure we're not artificially adding 
constraints to the system.

 Queued tasks are not actually launched in the order they were queued
 

 Key: MESOS-1812
 URL: https://issues.apache.org/jira/browse/MESOS-1812
 Project: Mesos
  Issue Type: Bug
  Components: slave
Reporter: Tom Arnfeld

 Even though tasks are assigned and queued in the order in which they are 
 launched (e.g multiple tasks in reply to one offer), due to timing issues 
 with the futures, this can sometimes break the causality and end up not being 
 launched in order.
 Example trace from a slave... In this example the Task_Tracker_10 task should 
 be launched before slots_Task_Tracker_10.
 {code}
 I0918 02:10:50.371445 17072 slave.cpp:933] Got assigned task Task_Tracker_10 
 for framework 20140916-233111-3171422218-5050-14295-0015
 I0918 02:10:50.372110 17072 slave.cpp:933] Got assigned task 
 slots_Task_Tracker_10 for framework 20140916-233111-3171422218-5050-14295-0015
 I0918 02:10:50.372172 17073 gc.cpp:84] Unscheduling 
 '/mnt/mesos-slave/slaves/20140915-112519-3171422218-5050-5016-6/frameworks/20140916-233111-3171422218-5050-14295-0015'
  from gc
 I0918 02:10:50.375018 17072 slave.cpp:1043] Launching task 
 slots_Task_Tracker_10 for framework 20140916-233111-3171422218-5050-14295-0015
 I0918 02:10:50.386282 17072 slave.cpp:1153] Queuing task 
 'slots_Task_Tracker_10' for executor executor_Task_Tracker_10 of framework 
 '20140916-233111-3171422218-5050-14295-0015
 I0918 02:10:50.386312 17070 mesos_containerizer.cpp:537] Starting container 
 '5f507f09-b48e-44ea-b74e-740b0e8bba4d' for executor 
 'executor_Task_Tracker_10' of framework 
 '20140916-233111-3171422218-5050-14295-0015'
 I0918 02:10:50.388942 17072 slave.cpp:1043] Launching task Task_Tracker_10 
 for framework 20140916-233111-3171422218-5050-14295-0015
 I0918 02:10:50.406277 17070 launcher.cpp:117] Forked child with pid '817' for 
 container '5f507f09-b48e-44ea-b74e-740b0e8bba4d'
 I0918 02:10:50.406563 17072 slave.cpp:1153] Queuing task 'Task_Tracker_10' 
 for executor executor_Task_Tracker_10 of framework 
 '20140916-233111-3171422218-5050-14295-0015
 I0918 02:10:50.408499 17069 mesos_containerizer.cpp:647] Fetching URIs for 
 container '5f507f09-b48e-44ea-b74e-740b0e8bba4d' using command 
 '/usr/local/libexec/mesos/mesos-fetcher'
 I0918 02:11:11.650687 17071 slave.cpp:2873] Current usage 17.34%. Max allowed 
 age: 5.086371210668750days
 I0918 02:11:16.590270 17075 slave.cpp:2355] Monitoring executor 
 'executor_Task_Tracker_10' of framework 
 '20140916-233111-3171422218-5050-14295-0015' in container 
 '5f507f09-b48e-44ea-b74e-740b0e8bba4d'
 I0918 02:11:17.701015 17070 slave.cpp:1664] Got registration for executor 
 'executor_Task_Tracker_10' of framework 
 20140916-233111-3171422218-5050-14295-0015
 I0918 02:11:17.701897 17070 slave.cpp:1783] Flushing queued task 
 slots_Task_Tracker_10 for executor 'executor_Task_Tracker_10' of framework 
 20140916-233111-3171422218-5050-14295-0015
 I0918 02:11:17.702350 17070 slave.cpp:1783] Flushing queued task 
 Task_Tracker_10 for executor 'executor_Task_Tracker_10' of framework 
 20140916-233111-3171422218-5050-14295-0015
 I0918 02:11:18.588388 17070 mesos_containerizer.cpp:1112] Executor for 
 container '5f507f09-b48e-44ea-b74e-740b0e8bba4d' has exited
 I0918 02:11:18.588665 17070 mesos_containerizer.cpp:996] Destroying container 
 '5f507f09-b48e-44ea-b74e-740b0e8bba4d'
 I0918 02:11:18.599234 17072 slave.cpp:2413] Executor 
 'executor_Task_Tracker_10' of framework 
 20140916-233111-3171422218-5050-14295-0015 has exited with status 1
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1662) Mesos doesn't limit swap

2014-09-18 Thread Chi Hoang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139176#comment-14139176
 ] 

Chi Hoang commented on MESOS-1662:
--

Wondering what happened with this fix.  Status says fixed, but it wasn't 
included in 0.20.0.

 Mesos doesn't limit swap
 

 Key: MESOS-1662
 URL: https://issues.apache.org/jira/browse/MESOS-1662
 Project: Mesos
  Issue Type: Bug
  Components: isolation
Affects Versions: 0.19.1
Reporter: Andrew Forgue
Assignee: Anton Lindström

 When using control groups, mesos will limit memory usage, but if the 
 CONFIG_MEMCG_SWAP config option is enabled swap usage is not limited.
 This means that if a task that asked for 1G and allocated 4G, it will fill 3G 
 of swap.  The expected behavior is that the cgroup should have OOMed.  The 
 control group key for limiting both Memory+Swap is 
 memory.memsw.limit_in_bytes (not memory.limit_in_bytes).  It looks like 
 CONFIG_MEMCG_SWAP showed up in Kernel 3.6.
 Mesos should limit swap+memory if possible.  I can't imagine when you'd want 
 to limit memory but not swap, but there may be some situations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1662) Mesos doesn't limit swap

2014-09-18 Thread Jie Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-1662:
--
Fix Version/s: 0.20.0

 Mesos doesn't limit swap
 

 Key: MESOS-1662
 URL: https://issues.apache.org/jira/browse/MESOS-1662
 Project: Mesos
  Issue Type: Bug
  Components: isolation
Affects Versions: 0.19.1
Reporter: Andrew Forgue
Assignee: Anton Lindström
 Fix For: 0.20.0


 When using control groups, mesos will limit memory usage, but if the 
 CONFIG_MEMCG_SWAP config option is enabled swap usage is not limited.
 This means that if a task that asked for 1G and allocated 4G, it will fill 3G 
 of swap.  The expected behavior is that the cgroup should have OOMed.  The 
 control group key for limiting both Memory+Swap is 
 memory.memsw.limit_in_bytes (not memory.limit_in_bytes).  It looks like 
 CONFIG_MEMCG_SWAP showed up in Kernel 3.6.
 Mesos should limit swap+memory if possible.  I can't imagine when you'd want 
 to limit memory but not swap, but there may be some situations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1814) Task attempted to use more offers than requested in example jave and python frameworks


 [ 
https://issues.apache.org/jira/browse/MESOS-1814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-1814:
--
  Component/s: test
   Sprint: Mesos Q3 Sprint 5
 Target Version/s: 0.21.0
Affects Version/s: 0.21.0
 Shepherd: Yan Xu
 Story Points: 2
  Summary: Task attempted to use more offers than requested in 
example jave and python frameworks  (was: Task attempted to use more offers 
than requested in example framework)

This is a latent bug in both the java and python example frameworks. Both these 
frameworks launch tasks without looking at whether the resources offered to it 
are enough to launch the task.

We are seeing this now because of the recently landed change that offers 
frameworks resources with no memory or no cpu. Before this change, no such 
offers were made and hence the framework was lucky that any offer that it 
received matched its task requirements.

I'll send a patch shortly.

 Task attempted to use more offers than requested in example jave and python 
 frameworks
 --

 Key: MESOS-1814
 URL: https://issues.apache.org/jira/browse/MESOS-1814
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.21.0
Reporter: Vinod Kone
Assignee: Vinod Kone

 {code}
 [ RUN  ] ExamplesTest.JavaFramework
 Using temporary directory '/tmp/ExamplesTest_JavaFramework_2PcFCh'
 Enabling authentication for the framework
 WARNING: Logging before InitGoogleLogging() is written to STDERR
 I0917 23:14:35.199069 31510 process.cpp:1771] libprocess is initialized on 
 127.0.1.1:34609 for 8 cpus
 I0917 23:14:35.199794 31510 logging.cpp:177] Logging to STDERR
 I0917 23:14:35.225342 31510 leveldb.cpp:176] Opened db in 22.197149ms
 I0917 23:14:35.231133 31510 leveldb.cpp:183] Compacted db in 5.601897ms
 I0917 23:14:35.231498 31510 leveldb.cpp:198] Created db iterator in 215441ns
 I0917 23:14:35.231608 31510 leveldb.cpp:204] Seeked to beginning of db in 
 11488ns
 I0917 23:14:35.231722 31510 leveldb.cpp:273] Iterated through 0 keys in the 
 db in 14016ns
 I0917 23:14:35.231917 31510 replica.cpp:741] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0917 23:14:35.233129 31526 recover.cpp:425] Starting replica recovery
 I0917 23:14:35.233614 31526 recover.cpp:451] Replica is in EMPTY status
 I0917 23:14:35.234994 31526 replica.cpp:638] Replica in EMPTY status received 
 a broadcasted recover request
 I0917 23:14:35.240116 31519 recover.cpp:188] Received a recover response from 
 a replica in EMPTY status
 I0917 23:14:35.240782 31519 recover.cpp:542] Updating replica status to 
 STARTING
 I0917 23:14:35.242846 31524 master.cpp:286] Master 
 20140917-231435-16842879-34609-31503 (saucy) started on 127.0.1.1:34609
 I0917 23:14:35.243191 31524 master.cpp:332] Master only allowing 
 authenticated frameworks to register
 I0917 23:14:35.243288 31524 master.cpp:339] Master allowing unauthenticated 
 slaves to register
 I0917 23:14:35.243399 31524 credentials.hpp:36] Loading credentials for 
 authentication from '/tmp/ExamplesTest_JavaFramework_2PcFCh/credentials'
 W0917 23:14:35.243588 31524 credentials.hpp:51] Permissions on credentials 
 file '/tmp/ExamplesTest_JavaFramework_2PcFCh/credentials' are too open. It is 
 recommended that your credentials file is NOT accessible by others.
 I0917 23:14:35.243846 31524 master.cpp:366] Authorization enabled
 I0917 23:14:35.244882 31520 hierarchical_allocator_process.hpp:299] 
 Initializing hierarchical allocator process with master : 
 master@127.0.1.1:34609
 I0917 23:14:35.245224 31520 master.cpp:120] No whitelist given. Advertising 
 offers for all slaves
 I0917 23:14:35.246934 31524 master.cpp:1211] The newly elected leader is 
 master@127.0.1.1:34609 with id 20140917-231435-16842879-34609-31503
 I0917 23:14:35.247234 31524 master.cpp:1224] Elected as the leading master!
 I0917 23:14:35.247336 31524 master.cpp:1042] Recovering from registrar
 I0917 23:14:35.247542 31526 registrar.cpp:313] Recovering registrar
 I0917 23:14:35.250555 31510 containerizer.cpp:89] Using isolation: 
 posix/cpu,posix/mem
 I0917 23:14:35.252326 31510 containerizer.cpp:89] Using isolation: 
 posix/cpu,posix/mem
 I0917 23:14:35.252821 31520 slave.cpp:169] Slave started on 1)@127.0.1.1:34609
 I0917 23:14:35.253552 31520 slave.cpp:289] Slave resources: cpus(*):1; 
 mem(*):1001; disk(*):24988; ports(*):[31000-32000]
 I0917 23:14:35.253906 31520 slave.cpp:317] Slave hostname: saucy
 I0917 23:14:35.254004 31520 slave.cpp:318] Slave checkpoint: true
 I0917 23:14:35.254818 31520 state.cpp:33] Recovering state from 
 '/tmp/mesos-w8snRW/0/meta'
 I0917 23:14:35.255106 31519 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 13.99622ms
 I0917

[jira] [Created] (MESOS-1815) Create a guide to becoming a committer

2014-09-18 Thread Dominic Hamon (JIRA)

Dominic Hamon created MESOS-1815:


 Summary: Create a guide to becoming a committer
 Key: MESOS-1815
 URL: https://issues.apache.org/jira/browse/MESOS-1815
 Project: Mesos
  Issue Type: Documentation
  Components: documentation
Reporter: Dominic Hamon
Assignee: Dominic Hamon


We have a committer's guide, but the process by which one becomes a committer 
is unclear. We should set some guidelines and a process by which we can grow 
contributors into committers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1815) Create a guide to becoming a committer

2014-09-18 Thread Dominic Hamon (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1815?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139232#comment-14139232
 ] 

Dominic Hamon commented on MESOS-1815:
--

Please review at https://reviews.apache.org/r/25785/


 Create a guide to becoming a committer
 --

 Key: MESOS-1815
 URL: https://issues.apache.org/jira/browse/MESOS-1815
 Project: Mesos
  Issue Type: Documentation
  Components: documentation
Reporter: Dominic Hamon
Assignee: Dominic Hamon

 We have a committer's guide, but the process by which one becomes a committer 
 is unclear. We should set some guidelines and a process by which we can grow 
 contributors into committers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1662) Mesos doesn't limit swap

2014-09-18 Thread Chi Hoang (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1662?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139289#comment-14139289
 ] 

Chi Hoang commented on MESOS-1662:
--

awesome!  thanks!

 Mesos doesn't limit swap
 

 Key: MESOS-1662
 URL: https://issues.apache.org/jira/browse/MESOS-1662
 Project: Mesos
  Issue Type: Bug
  Components: isolation
Affects Versions: 0.19.1
Reporter: Andrew Forgue
Assignee: Anton Lindström
 Fix For: 0.20.0


 When using control groups, mesos will limit memory usage, but if the 
 CONFIG_MEMCG_SWAP config option is enabled swap usage is not limited.
 This means that if a task that asked for 1G and allocated 4G, it will fill 3G 
 of swap.  The expected behavior is that the cgroup should have OOMed.  The 
 control group key for limiting both Memory+Swap is 
 memory.memsw.limit_in_bytes (not memory.limit_in_bytes).  It looks like 
 CONFIG_MEMCG_SWAP showed up in Kernel 3.6.
 Mesos should limit swap+memory if possible.  I can't imagine when you'd want 
 to limit memory but not swap, but there may be some situations.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1808) expose RTT in container stats

2014-09-18 Thread Jie Yu (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1808?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jie Yu updated MESOS-1808:
--
Assignee: Chi Zhang  (was: Jie Yu)

 expose RTT in container stats
 -

 Key: MESOS-1808
 URL: https://issues.apache.org/jira/browse/MESOS-1808
 Project: Mesos
  Issue Type: Task
  Components: containerization
Reporter: Dominic Hamon
Assignee: Chi Zhang

 As we expose the bandwidth, so we should expose the RTT as a measure of 
 latency each container is experiencing.
 We can use {{ss}} to get the per-socket statistics and filter and aggregate 
 accordingly to get a measure of RTT.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1384) Add support for loadable MesosModule

[
https://issues.apache.org/jira/browse/MESOS-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139349#comment-14139349
]

Timothy St. Clair commented on MESOS-1384:
--

Folks -

I think this is ready for review.
You might want to make a couple of minor changes around named loading: e.g.
libFoo.so, libFoo.dylib
The load could check for extension, and in absence do the right thing. load
(Foo)

Add support for loadable MesosModule

Key: MESOS-1384
URL: https://issues.apache.org/jira/browse/MESOS-1384
Project: Mesos
Issue Type: Improvement
Affects Versions: 0.19.0
Reporter: Timothy St. Clair
Assignee: Niklas Quarfot Nielsen

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1384) Add support for loadable MesosModule


[ 
https://issues.apache.org/jira/browse/MESOS-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139390#comment-14139390
 ] 

Bernd Mathiske commented on MESOS-1384:
---

[~tstclair] Thanks for the vote of confidence! We will make a code improvement 
pass now and also remove non-essentials to get to a minimal viable first patch.

 Add support for loadable MesosModule
 

 Key: MESOS-1384
 URL: https://issues.apache.org/jira/browse/MESOS-1384
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 0.19.0
Reporter: Timothy St. Clair
Assignee: Niklas Quarfot Nielsen

 I think we should break this into multiple phases.
 -(1) Let's get the dynamic library loading via a stout-ified version of 
 https://github.com/timothysc/tests/blob/master/plugin_modules/DynamicLibrary.h.
  -
 *DONE*
 (2) Use (1) to instantiate some classes in Mesos (like an Authenticator 
 and/or isolator) from a dynamic library. This will give us some more 
 experience with how we want to name the underlying library symbol, how we 
 want to specify flags for finding the library, what types of validation we 
 want when loading a library.
 *TARGET* 
 (3) After doing (2) for one or two classes in Mesos I think we can formalize 
 the approach in a mesos-ified version of 
 https://github.com/timothysc/tests/blob/master/plugin_modules/MesosModule.h.
 *NEXT*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-1384) Add support for loadable MesosModule

[
https://issues.apache.org/jira/browse/MESOS-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139390#comment-14139390
]

Bernd Mathiske edited comment on MESOS-1384 at 9/18/14 7:57 PM:

[~tstclair] Thanks for the vote of confidence! We can a code improvement pass
now and also remove non-essentials to get to a minimal viable first patch.

However, we still have to solve the question what the command line interface
should look like. Go for JSON right away? On the command line? Or maybe this:
keep the simple format (lib path:module name,...) we have right now and
also add a second flag that points at a JSON file?

was (Author: bernd-mesos):
[~tstclair] Thanks for the vote of confidence! We will can a code improvement
pass now and also remove non-essentials to get to a minimal viable first patch.

Add support for loadable MesosModule

Key: MESOS-1384
URL: https://issues.apache.org/jira/browse/MESOS-1384
Project: Mesos
Issue Type: Improvement
Affects Versions: 0.19.0
Reporter: Timothy St. Clair
Assignee: Niklas Quarfot Nielsen

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-1384) Add support for loadable MesosModule

[
https://issues.apache.org/jira/browse/MESOS-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139390#comment-14139390
]

Bernd Mathiske edited comment on MESOS-1384 at 9/18/14 7:57 PM:

[~tstclair] Thanks for the vote of confidence! We will can a code improvement
pass now and also remove non-essentials to get to a minimal viable first patch.

was (Author: bernd-mesos):
[~tstclair] Thanks for the vote of confidence! We will make a code improvement
pass now and also remove non-essentials to get to a minimal viable first patch.

Add support for loadable MesosModule

Key: MESOS-1384
URL: https://issues.apache.org/jira/browse/MESOS-1384
Project: Mesos
Issue Type: Improvement
Affects Versions: 0.19.0
Reporter: Timothy St. Clair
Assignee: Niklas Quarfot Nielsen

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1384) Add support for loadable MesosModule


[ 
https://issues.apache.org/jira/browse/MESOS-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139412#comment-14139412
 ] 

Timothy St. Clair commented on MESOS-1384:
--

Keep it simple for now, as I fully expect this to iterate over time.  

It's also auxiliary and nothing depends on it yet, so until that point happens 
there can be refinement. 

 Add support for loadable MesosModule
 

 Key: MESOS-1384
 URL: https://issues.apache.org/jira/browse/MESOS-1384
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 0.19.0
Reporter: Timothy St. Clair
Assignee: Niklas Quarfot Nielsen

 I think we should break this into multiple phases.
 -(1) Let's get the dynamic library loading via a stout-ified version of 
 https://github.com/timothysc/tests/blob/master/plugin_modules/DynamicLibrary.h.
  -
 *DONE*
 (2) Use (1) to instantiate some classes in Mesos (like an Authenticator 
 and/or isolator) from a dynamic library. This will give us some more 
 experience with how we want to name the underlying library symbol, how we 
 want to specify flags for finding the library, what types of validation we 
 want when loading a library.
 *TARGET* 
 (3) After doing (2) for one or two classes in Mesos I think we can formalize 
 the approach in a mesos-ified version of 
 https://github.com/timothysc/tests/blob/master/plugin_modules/MesosModule.h.
 *NEXT*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (MESOS-1809) Modify docker pull to use docker inspect after a successful pull

2014-09-18 Thread Timothy Chen (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Chen resolved MESOS-1809.
-
Resolution: Fixed

 Modify docker pull to use docker inspect after a successful pull
 

 Key: MESOS-1809
 URL: https://issues.apache.org/jira/browse/MESOS-1809
 Project: Mesos
  Issue Type: Bug
Reporter: Timothy Chen
Assignee: Timothy Chen

 Currently in docker pull we read the stdout of pull to construct the docker 
 image object, however it contains extra output from stdout.
 We should docker inspect after pull instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1809) Modify docker pull to use docker inspect after a successful pull

2014-09-18 Thread Adam B (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Adam B updated MESOS-1809:
--
Fix Version/s: 0.20.1

 Modify docker pull to use docker inspect after a successful pull
 

 Key: MESOS-1809
 URL: https://issues.apache.org/jira/browse/MESOS-1809
 Project: Mesos
  Issue Type: Bug
Reporter: Timothy Chen
Assignee: Timothy Chen
 Fix For: 0.20.1


 Currently in docker pull we read the stdout of pull to construct the docker 
 image object, however it contains extra output from stdout.
 We should docker inspect after pull instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1675) Decouple version of the mesos library from the package release version


[ 
https://issues.apache.org/jira/browse/MESOS-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139531#comment-14139531
 ] 

Timothy St. Clair commented on MESOS-1675:
--

Provided that they linked to libmesos.so, I don't believe so.

 Decouple version of the mesos library from the package release version
 --

 Key: MESOS-1675
 URL: https://issues.apache.org/jira/browse/MESOS-1675
 Project: Mesos
  Issue Type: Bug
Reporter: Vinod Kone

 This discussion should be rolled into the larger discussion around how to 
 version Mesos (APIs, packages, libraries etc).
 Some notes from libtool docs.
 http://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html
 http://www.gnu.org/software/libtool/manual/html_node/Release-numbers.html#Release-numbers



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1809) Modify docker pull to use docker inspect after a successful pull

2014-09-18 Thread Timothy Chen (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139532#comment-14139532
 ] 

Timothy Chen commented on MESOS-1809:
-

commit 48db9a513fac0066c8f38aa98b8d893fdf298998
Author: Timothy Chen tnac...@apache.org
Date:   Thu Sep 18 02:11:40 2014 -0700

Modify Docker::pull to call inspect after pull.

Review: https://reviews.apache.org/r/25758


 Modify docker pull to use docker inspect after a successful pull
 

 Key: MESOS-1809
 URL: https://issues.apache.org/jira/browse/MESOS-1809
 Project: Mesos
  Issue Type: Bug
Reporter: Timothy Chen
Assignee: Timothy Chen
 Fix For: 0.20.1


 Currently in docker pull we read the stdout of pull to construct the docker 
 image object, however it contains extra output from stdout.
 We should docker inspect after pull instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1384) Add support for loadable MesosModule


[ 
https://issues.apache.org/jira/browse/MESOS-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139553#comment-14139553
 ] 

Vinod Kone commented on MESOS-1384:
---

Please have the flag as JSON. It's easy to maintain. Our JSON flag parser 
accepts a file with JSON or raw JSON string.

 Add support for loadable MesosModule
 

 Key: MESOS-1384
 URL: https://issues.apache.org/jira/browse/MESOS-1384
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 0.19.0
Reporter: Timothy St. Clair
Assignee: Niklas Quarfot Nielsen

 I think we should break this into multiple phases.
 -(1) Let's get the dynamic library loading via a stout-ified version of 
 https://github.com/timothysc/tests/blob/master/plugin_modules/DynamicLibrary.h.
  -
 *DONE*
 (2) Use (1) to instantiate some classes in Mesos (like an Authenticator 
 and/or isolator) from a dynamic library. This will give us some more 
 experience with how we want to name the underlying library symbol, how we 
 want to specify flags for finding the library, what types of validation we 
 want when loading a library.
 *TARGET* 
 (3) After doing (2) for one or two classes in Mesos I think we can formalize 
 the approach in a mesos-ified version of 
 https://github.com/timothysc/tests/blob/master/plugin_modules/MesosModule.h.
 *NEXT*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1384) Add support for loadable MesosModule

2014-09-18 Thread Niklas Quarfot Nielsen (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139575#comment-14139575
 ] 

Niklas Quarfot Nielsen commented on MESOS-1384:
---

[~vinodkone] +1

 Add support for loadable MesosModule
 

 Key: MESOS-1384
 URL: https://issues.apache.org/jira/browse/MESOS-1384
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 0.19.0
Reporter: Timothy St. Clair
Assignee: Niklas Quarfot Nielsen

 I think we should break this into multiple phases.
 -(1) Let's get the dynamic library loading via a stout-ified version of 
 https://github.com/timothysc/tests/blob/master/plugin_modules/DynamicLibrary.h.
  -
 *DONE*
 (2) Use (1) to instantiate some classes in Mesos (like an Authenticator 
 and/or isolator) from a dynamic library. This will give us some more 
 experience with how we want to name the underlying library symbol, how we 
 want to specify flags for finding the library, what types of validation we 
 want when loading a library.
 *TARGET* 
 (3) After doing (2) for one or two classes in Mesos I think we can formalize 
 the approach in a mesos-ified version of 
 https://github.com/timothysc/tests/blob/master/plugin_modules/MesosModule.h.
 *NEXT*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1816) lxc execution driver support for docker containerizer

2014-09-18 Thread Eugen Feller (JIRA)

[
https://issues.apache.org/jira/browse/MESOS-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Eugen Feller updated MESOS-1816:

Summary: lxc execution driver support for docker containerizer (was: lxc
execution driver for docker containerizer)

lxc execution driver support for docker containerizer
-

Key: MESOS-1816
URL: https://issues.apache.org/jira/browse/MESOS-1816
Project: Mesos
Issue Type: Improvement
Components: containerization
Affects Versions: 0.20.1
Reporter: Eugen Feller
Labels: docker

Hi all,
One way to get networking up and running in Docker is to use the bridge mode.
The bridge mode results in Docker automatically assigning IPs to the
containers from the IP range specified on the docker0 bridge.
In our setup we need to manage IPs using our own DHCP server. Unfortunately
this is not supported by Docker's libcontainer execution driver. Instead, the
lxc execution driver
(http://blog.docker.com/2014/03/docker-0-9-introducing-execution-drivers-and-libcontainer/)
can be used. In order to use the lxc execution driver, Docker daemon needs
to be started with the -e lxc flag. Once started, Docker own networking can
be disabled and lxc options can be passed to the docker run command. For
example:
$ docker run -n=false --lxc-conf=lxc.network.type = veth
--lxc-conf=lxc.network.link = br0 --lxc-conf=lxc.network.name = eth0
-lxc-conf=lxc.network.flags = up
This will force Docker to use my own bridge br0. Moreover, IP can be assigned
to the eth0 interface by executing the dhclient eth0 command inside the
started container.
In the previous integration of Docker in Mesos (using Deimos), I have passed
the aforementioned options using the options flag in Marathon. However,
with the new changes this is no longer possible. It would be great to support
the lxc execution driver in the current Docker integration.
Thanks.
Best regards,
Eugen

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-1816) lxc execution driver for docker containerizer

2014-09-18 Thread Eugen Feller (JIRA)

Eugen Feller created MESOS-1816:
---

Summary: lxc execution driver for docker containerizer
Key: MESOS-1816
URL: https://issues.apache.org/jira/browse/MESOS-1816
Project: Mesos
Issue Type: Improvement
Components: containerization
Affects Versions: 0.20.1
Reporter: Eugen Feller

Hi all,

One way to get networking up and running in Docker is to use the bridge mode.
The bridge mode results in Docker automatically assigning IPs to the containers
from the IP range specified on the docker0 bridge.

In our setup we need to manage IPs using our own DHCP server. Unfortunately
this is not supported by Docker's libcontainer execution driver. Instead, the
lxc execution driver
(http://blog.docker.com/2014/03/docker-0-9-introducing-execution-drivers-and-libcontainer/)
can be used. In order to use the lxc execution driver, Docker daemon needs to
be started with the -e lxc flag. Once started, Docker own networking can be
disabled and lxc options can be passed to the docker run command. For example:

$ docker run -n=false --lxc-conf=lxc.network.type = veth
--lxc-conf=lxc.network.link = br0 --lxc-conf=lxc.network.name = eth0
-lxc-conf=lxc.network.flags = up

This will force Docker to use my own bridge br0. Moreover, IP can be assigned
to the eth0 interface by executing the dhclient eth0 command inside the
started container.

In the previous integration of Docker in Mesos (using Deimos), I have passed
the aforementioned options using the options flag in Marathon. However, with
the new changes this is no longer possible. It would be great to support the
lxc execution driver in the current Docker integration.

Thanks.

Best regards,
Eugen

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1816) lxc execution driver support for docker containerizer

2014-09-18 Thread Eugen Feller (JIRA)

[
https://issues.apache.org/jira/browse/MESOS-1816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Eugen Feller updated MESOS-1816:

Description:
Hi all,

$ docker run -n=false --lxc-conf=lxc.network.type = veth
--lxc-conf=lxc.network.link = br0 --lxc-conf=lxc.network.name = eth0
-lxc-conf=lxc.network.flags = up ...

This will force Docker to use my own bridge br0. Moreover, IP can be assigned
to the eth0 interface by executing the dhclient eth0 command inside the
started container.

Thanks.

Best regards,
Eugen

was:
Hi all,

$ docker run -n=false --lxc-conf=lxc.network.type = veth
--lxc-conf=lxc.network.link = br0 --lxc-conf=lxc.network.name = eth0
-lxc-conf=lxc.network.flags = up

This will force Docker to use my own bridge br0. Moreover, IP can be assigned
to the eth0 interface by executing the dhclient eth0 command inside the
started container.

Thanks.

Best regards,
Eugen

lxc execution driver support for docker containerizer
-

Hi all,
One way to get networking up and running in Docker is to use the bridge mode.
The bridge mode results in Docker automatically assigning IPs to the
containers from the IP range specified on the docker0 bridge.
In our setup we need to manage IPs using our own DHCP server. Unfortunately
this is not supported by Docker's libcontainer execution driver. Instead, the
lxc execution driver
(http://blog.docker.com/2014/03/docker-0-9-introducing-execution-drivers-and-libcontainer/)
can be used. In order to use the lxc execution driver, Docker daemon needs
to be started with the -e lxc flag. Once started, Docker own networking can
be disabled and lxc options can be passed to the docker run command. For
example:
$ docker run -n=false --lxc-conf=lxc.network.type = veth
--lxc-conf=lxc.network.link = br0 --lxc-conf=lxc.network.name = eth0
-lxc-conf=lxc.network.flags = up ...
This will force Docker to use my own bridge br0. Moreover, IP can be assigned
to the eth0 interface by executing the dhclient eth0 command inside the
started container.
In the previous integration of Docker in Mesos (using Deimos), I have passed
the aforementioned options using the options flag in Marathon. However,
with the new changes this is no longer possible. It would be great to support
the lxc execution driver in the current Docker integration.
Thanks.
Best regards,
Eugen

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1814) Task attempted to use more offers than requested in example jave and python frameworks


[ 
https://issues.apache.org/jira/browse/MESOS-1814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139686#comment-14139686
 ] 

Vinod Kone commented on MESOS-1814:
---

https://reviews.apache.org/r/25801/

 Task attempted to use more offers than requested in example jave and python 
 frameworks
 --

 Key: MESOS-1814
 URL: https://issues.apache.org/jira/browse/MESOS-1814
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.21.0
Reporter: Vinod Kone
Assignee: Vinod Kone

 {code}
 [ RUN  ] ExamplesTest.JavaFramework
 Using temporary directory '/tmp/ExamplesTest_JavaFramework_2PcFCh'
 Enabling authentication for the framework
 WARNING: Logging before InitGoogleLogging() is written to STDERR
 I0917 23:14:35.199069 31510 process.cpp:1771] libprocess is initialized on 
 127.0.1.1:34609 for 8 cpus
 I0917 23:14:35.199794 31510 logging.cpp:177] Logging to STDERR
 I0917 23:14:35.225342 31510 leveldb.cpp:176] Opened db in 22.197149ms
 I0917 23:14:35.231133 31510 leveldb.cpp:183] Compacted db in 5.601897ms
 I0917 23:14:35.231498 31510 leveldb.cpp:198] Created db iterator in 215441ns
 I0917 23:14:35.231608 31510 leveldb.cpp:204] Seeked to beginning of db in 
 11488ns
 I0917 23:14:35.231722 31510 leveldb.cpp:273] Iterated through 0 keys in the 
 db in 14016ns
 I0917 23:14:35.231917 31510 replica.cpp:741] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0917 23:14:35.233129 31526 recover.cpp:425] Starting replica recovery
 I0917 23:14:35.233614 31526 recover.cpp:451] Replica is in EMPTY status
 I0917 23:14:35.234994 31526 replica.cpp:638] Replica in EMPTY status received 
 a broadcasted recover request
 I0917 23:14:35.240116 31519 recover.cpp:188] Received a recover response from 
 a replica in EMPTY status
 I0917 23:14:35.240782 31519 recover.cpp:542] Updating replica status to 
 STARTING
 I0917 23:14:35.242846 31524 master.cpp:286] Master 
 20140917-231435-16842879-34609-31503 (saucy) started on 127.0.1.1:34609
 I0917 23:14:35.243191 31524 master.cpp:332] Master only allowing 
 authenticated frameworks to register
 I0917 23:14:35.243288 31524 master.cpp:339] Master allowing unauthenticated 
 slaves to register
 I0917 23:14:35.243399 31524 credentials.hpp:36] Loading credentials for 
 authentication from '/tmp/ExamplesTest_JavaFramework_2PcFCh/credentials'
 W0917 23:14:35.243588 31524 credentials.hpp:51] Permissions on credentials 
 file '/tmp/ExamplesTest_JavaFramework_2PcFCh/credentials' are too open. It is 
 recommended that your credentials file is NOT accessible by others.
 I0917 23:14:35.243846 31524 master.cpp:366] Authorization enabled
 I0917 23:14:35.244882 31520 hierarchical_allocator_process.hpp:299] 
 Initializing hierarchical allocator process with master : 
 master@127.0.1.1:34609
 I0917 23:14:35.245224 31520 master.cpp:120] No whitelist given. Advertising 
 offers for all slaves
 I0917 23:14:35.246934 31524 master.cpp:1211] The newly elected leader is 
 master@127.0.1.1:34609 with id 20140917-231435-16842879-34609-31503
 I0917 23:14:35.247234 31524 master.cpp:1224] Elected as the leading master!
 I0917 23:14:35.247336 31524 master.cpp:1042] Recovering from registrar
 I0917 23:14:35.247542 31526 registrar.cpp:313] Recovering registrar
 I0917 23:14:35.250555 31510 containerizer.cpp:89] Using isolation: 
 posix/cpu,posix/mem
 I0917 23:14:35.252326 31510 containerizer.cpp:89] Using isolation: 
 posix/cpu,posix/mem
 I0917 23:14:35.252821 31520 slave.cpp:169] Slave started on 1)@127.0.1.1:34609
 I0917 23:14:35.253552 31520 slave.cpp:289] Slave resources: cpus(*):1; 
 mem(*):1001; disk(*):24988; ports(*):[31000-32000]
 I0917 23:14:35.253906 31520 slave.cpp:317] Slave hostname: saucy
 I0917 23:14:35.254004 31520 slave.cpp:318] Slave checkpoint: true
 I0917 23:14:35.254818 31520 state.cpp:33] Recovering state from 
 '/tmp/mesos-w8snRW/0/meta'
 I0917 23:14:35.255106 31519 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 13.99622ms
 I0917 23:14:35.255235 31519 replica.cpp:320] Persisted replica status to 
 STARTING
 I0917 23:14:35.255419 31519 recover.cpp:451] Replica is in STARTING status
 I0917 23:14:35.255834 31519 replica.cpp:638] Replica in STARTING status 
 received a broadcasted recover request
 I0917 23:14:35.256000 31519 recover.cpp:188] Received a recover response from 
 a replica in STARTING status
 I0917 23:14:35.256217 31519 recover.cpp:542] Updating replica status to VOTING
 I0917 23:14:35.256641 31520 status_update_manager.cpp:193] Recovering status 
 update manager
 I0917 23:14:35.257064 31520 containerizer.cpp:252] Recovering containerizer
 I0917 23:14:35.257725 31520 slave.cpp:3220] Finished recovery
 I0917 23:14:35.258463 31520 slave.cpp:600] New master detected at

[jira] [Commented] (MESOS-1813) Fail fast in example frameworks if task goes into unexpected state


[ 
https://issues.apache.org/jira/browse/MESOS-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139777#comment-14139777
 ] 

Vinod Kone commented on MESOS-1813:
---

https://reviews.apache.org/r/25805/

 Fail fast in example frameworks if task goes into unexpected state
 --

 Key: MESOS-1813
 URL: https://issues.apache.org/jira/browse/MESOS-1813
 Project: Mesos
  Issue Type: Improvement
  Components: test
Reporter: Vinod Kone
Assignee: Vinod Kone

 Most of the example frameworks launch a bunch of tasks and exit if *all* of 
 them reach FINISHED state. But if there is a bug in the code resulting in 
 TASK_LOST, the framework waits forever. Instead the framework should abort if 
 an un-expected task state is encountered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1817) Completed tasks remains in TASK_RUNNING when framework is disconnected

2014-09-18 Thread Niklas Quarfot Nielsen (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-1817?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-1817:
--
Description: 
We have run into a problem that cause tasks which completes, when a framework 
is disconnected and has a fail-over time, to remain in a running state even 
though the tasks actually finishes. This hogs the cluster and gives users a 
inconsistent view of the cluster state. Going to the slave, the task is 
finished. Going to the master, the task is still in a non-terminal state. When 
the scheduler reattaches or the failover timeout expires, the tasks finishes 
correctly. The current workflow of this scheduler has a long fail-over timeout, 
but may on the other hand never reattach.

Here is a test framework we have been able to reproduce the issue with: 
https://gist.github.com/nqn/9b9b1de9123a6e836f54
It launches many short-lived tasks (1 second sleep) and when killing the 
framework instance, the master reports the tasks as running even after several 
minutes: 
http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png

When clicking on one of the slaves where, for example, task 49 runs; the slave 
knows that it completed: 
http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png

Here is the log of a mesos-local instance where I reproduced it: 
https://gist.github.com/nqn/f7ee20601199d70787c0 (Here task 10 to 19 are stuck 
in running state).
There is a lot of output, so here is a filtered log for task 10: 
https://gist.github.com/nqn/a53e5ea05c5e41cd5a7d

The problem turn out to be an issue with the ack-cycle of status updates:
If the framework disconnects (with a failover timeout set), the status update 
manage on the slaves will keep trying to send the front of status update stream 
to the master (which in turn forwards it to the framework). If the first status 
update after the disconnect is terminal, things work out fine; the master pick 
the terminal state up, removes the task and release the resources.
If, on the other hand, one non-terminal status is in the stream. The master 
will never know that the task finished (or failed) before the framework 
reconnects.

During a discussion on the dev mailing list 
(http://mail-archives.apache.org/mod_mbox/mesos-dev/201409.mbox/%3cCADKthhAVR5mrq1s9HXw1BB_XFALXWWxjutp7MV4y3wP-Bh=a...@mail.gmail.com%3e)
 we enumerated a couple of options to solve this problem.

First off, having two ack-cycles: one between masters and slaves and one 
between masters and frameworks, would be ideal. We would be able to replay the 
statuses in order while keeping the master state current. However, this 
requires us to persist the master state in a replicated storage.

As a first pass, we can make sure that the tasks caught in a running state 
doesn't hog the cluster when completed and the framework being disconnected.

Here is a proof-of-concept to work out of: 
https://github.com/nqn/mesos/tree/niklas/status-update-disconnect/

A new (optional) field have been added to the internal status update message:
https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/messages/messages.proto#L68

Which makes it possible for the status update manager to set the field, if the 
latest status was terminal: 
https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/slave/status_update_manager.cpp#L501

I added a test which should high-light the issue as well:
https://github.com/nqn/mesos/blob/niklas/status-update-disconnect/src/tests/fault_tolerance_tests.cpp#L2478

I would love some input on the approach before moving on.
There are rough edges in the PoC which (of course) should be addressed before 
bringing it for up review.

  was:
We have run into a problem that cause tasks which completes, when a framework 
is disconnected and has a fail-over time, to remain in a running state even 
though the tasks actually finishes. This hogs the cluster and gives users a 
inconsistent view of the cluster state. Going to the slave, the task is 
finished. Going to the master, the task is still in a non-terminal state. When 
the scheduler reattaches or the failover timeout expires, the tasks finishes 
correctly. The current workflow of this scheduler has a long fail-over timeout, 
but may on the other hand never reattach.

Here is a test framework we have been able to reproduce the issue with: 
https://gist.github.com/nqn/9b9b1de9123a6e836f54
It launches many short-lived tasks (1 second sleep) and when killing the 
framework instance, the master reports the tasks as running even after several 
minutes: 
http://cl.ly/image/2R3719461e0t/Screen%20Shot%202014-09-10%20at%203.19.39%20PM.png

When clicking on one of the slaves where, for example, task 49 runs; the slave 
knows that it completed: 
http://cl.ly/image/2P410L3m1O1N/Screen%20Shot%202014-09-10%20at%203.21.29%20PM.png

Here is the log of a

[jira] [Created] (MESOS-1818) AllocatorTest/0.ResourcesUnused sometimes segfaults

Vinod Kone created MESOS-1818:
-

 Summary: AllocatorTest/0.ResourcesUnused sometimes segfaults
 Key: MESOS-1818
 URL: https://issues.apache.org/jira/browse/MESOS-1818
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.21.0
Reporter: Vinod Kone
Priority: Critical


{code}
[ RUN  ] AllocatorTest/0.ResourcesUnused
*** Aborted at 1411088950 (unix time) try date -d @1411088950 if you are 
using GNU date ***
PC: @   0x8649a4 mesos::SlaveID::value()
*** SIGSEGV (@0x2de9) received by PID 20876 (TID 0x7fb63a1c0940) from PID 
11753; stack trace: ***
@ 0x7fb643ec4ca0 (unknown)
@   0x8649a4 mesos::SlaveID::value()
@   0x8741c7 mesos::hash_value()
@   0x8f7448 boost::hash::operator()()
@   0x8e0bed 
boost::unordered::detail::mix64_policy::apply_hash()
@ 0x7fb64694c1cf boost::unordered::detail::table::hash()
@ 0x7fb646973615 boost::unordered::detail::table::find_node()
@ 0x7fb64694c191 boost::unordered::detail::table_impl::count()
@ 0x7fb64691f3c1 boost::unordered::unordered_map::count()
@ 0x7fb6468f4373 hashmap::contains()
@ 0x7fb6468c5eda mesos::internal::master::Master::getSlave()
@ 0x7fb6468c0fc3 mesos::internal::master::Master::removeFramework()
@ 0x7fb6468afa9f mesos::internal::master::Master::unregisterFramework()
@ 0x7fb646904ab9 ProtobufProcess::handler1()
@ 0x7fb6469a1e81 
_ZNSt5_BindIFPFvPN5mesos8internal6master6MasterEMS3_FvRKN7process4UPIDERKNS0_11FrameworkIDEEMNS1_26UnregisterFrameworkMessageEKFSB_vES8_RKSsES4_SD_SG_St12_PlaceholderILi1EESL_ILi26__callIvJS8_SI_EJLm0ELm1ELm2ELm3ELm4T_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
@ 0x7fb646983afe std::_Bind::operator()()
@ 0x7fb64695f83c std::_Function_handler::_M_invoke()
@   0xc4e17f std::function::operator()()
@ 0x7fb6468ebd10 ProtobufProcess::visit()
@ 0x7fb6468a9892 mesos::internal::master::Master::_visit()
@ 0x7fb6468a8f46 mesos::internal::master::Master::visit()
@ 0x7fb6468ce670 process::MessageEvent::visit()
@   0x86ad54 process::ProcessBase::serve()
@ 0x7fb6470e9738 process::ProcessManager::resume()
@ 0x7fb6470dff3f process::schedule()
@ 0x7fb643ebc83d start_thread
@ 0x7fb642c2426d clone
make[3]: *** [check-local] Segmentation fault
{code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1384) Add support for loadable MesosModule


[ 
https://issues.apache.org/jira/browse/MESOS-1384?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14139841#comment-14139841
 ] 

Bernd Mathiske commented on MESOS-1384:
---

OK, JSON it is then. 

 Add support for loadable MesosModule
 

 Key: MESOS-1384
 URL: https://issues.apache.org/jira/browse/MESOS-1384
 Project: Mesos
  Issue Type: Improvement
Affects Versions: 0.19.0
Reporter: Timothy St. Clair
Assignee: Niklas Quarfot Nielsen

 I think we should break this into multiple phases.
 -(1) Let's get the dynamic library loading via a stout-ified version of 
 https://github.com/timothysc/tests/blob/master/plugin_modules/DynamicLibrary.h.
  -
 *DONE*
 (2) Use (1) to instantiate some classes in Mesos (like an Authenticator 
 and/or isolator) from a dynamic library. This will give us some more 
 experience with how we want to name the underlying library symbol, how we 
 want to specify flags for finding the library, what types of validation we 
 want when loading a library.
 *TARGET* 
 (3) After doing (2) for one or two classes in Mesos I think we can formalize 
 the approach in a mesos-ified version of 
 https://github.com/timothysc/tests/blob/master/plugin_modules/MesosModule.h.
 *NEXT*



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1813) Fail fast in example frameworks if task goes into unexpected state


 [ 
https://issues.apache.org/jira/browse/MESOS-1813?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-1813:
--
Sprint: Mesos Q3 Sprint 5

 Fail fast in example frameworks if task goes into unexpected state
 --

 Key: MESOS-1813
 URL: https://issues.apache.org/jira/browse/MESOS-1813
 Project: Mesos
  Issue Type: Improvement
  Components: test
Reporter: Vinod Kone
Assignee: Vinod Kone
 Fix For: 0.21.0


 Most of the example frameworks launch a bunch of tasks and exit if *all* of 
 them reach FINISHED state. But if there is a bug in the code resulting in 
 TASK_LOST, the framework waits forever. Instead the framework should abort if 
 an un-expected task state is encountered.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1814) Task attempted to use more offers than requested in example jave and python frameworks


 [ 
https://issues.apache.org/jira/browse/MESOS-1814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-1814:
--
Shepherd: Benjamin Mahler  (was: Yan Xu)

 Task attempted to use more offers than requested in example jave and python 
 frameworks
 --

 Key: MESOS-1814
 URL: https://issues.apache.org/jira/browse/MESOS-1814
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.21.0
Reporter: Vinod Kone
Assignee: Vinod Kone

 {code}
 [ RUN  ] ExamplesTest.JavaFramework
 Using temporary directory '/tmp/ExamplesTest_JavaFramework_2PcFCh'
 Enabling authentication for the framework
 WARNING: Logging before InitGoogleLogging() is written to STDERR
 I0917 23:14:35.199069 31510 process.cpp:1771] libprocess is initialized on 
 127.0.1.1:34609 for 8 cpus
 I0917 23:14:35.199794 31510 logging.cpp:177] Logging to STDERR
 I0917 23:14:35.225342 31510 leveldb.cpp:176] Opened db in 22.197149ms
 I0917 23:14:35.231133 31510 leveldb.cpp:183] Compacted db in 5.601897ms
 I0917 23:14:35.231498 31510 leveldb.cpp:198] Created db iterator in 215441ns
 I0917 23:14:35.231608 31510 leveldb.cpp:204] Seeked to beginning of db in 
 11488ns
 I0917 23:14:35.231722 31510 leveldb.cpp:273] Iterated through 0 keys in the 
 db in 14016ns
 I0917 23:14:35.231917 31510 replica.cpp:741] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0917 23:14:35.233129 31526 recover.cpp:425] Starting replica recovery
 I0917 23:14:35.233614 31526 recover.cpp:451] Replica is in EMPTY status
 I0917 23:14:35.234994 31526 replica.cpp:638] Replica in EMPTY status received 
 a broadcasted recover request
 I0917 23:14:35.240116 31519 recover.cpp:188] Received a recover response from 
 a replica in EMPTY status
 I0917 23:14:35.240782 31519 recover.cpp:542] Updating replica status to 
 STARTING
 I0917 23:14:35.242846 31524 master.cpp:286] Master 
 20140917-231435-16842879-34609-31503 (saucy) started on 127.0.1.1:34609
 I0917 23:14:35.243191 31524 master.cpp:332] Master only allowing 
 authenticated frameworks to register
 I0917 23:14:35.243288 31524 master.cpp:339] Master allowing unauthenticated 
 slaves to register
 I0917 23:14:35.243399 31524 credentials.hpp:36] Loading credentials for 
 authentication from '/tmp/ExamplesTest_JavaFramework_2PcFCh/credentials'
 W0917 23:14:35.243588 31524 credentials.hpp:51] Permissions on credentials 
 file '/tmp/ExamplesTest_JavaFramework_2PcFCh/credentials' are too open. It is 
 recommended that your credentials file is NOT accessible by others.
 I0917 23:14:35.243846 31524 master.cpp:366] Authorization enabled
 I0917 23:14:35.244882 31520 hierarchical_allocator_process.hpp:299] 
 Initializing hierarchical allocator process with master : 
 master@127.0.1.1:34609
 I0917 23:14:35.245224 31520 master.cpp:120] No whitelist given. Advertising 
 offers for all slaves
 I0917 23:14:35.246934 31524 master.cpp:1211] The newly elected leader is 
 master@127.0.1.1:34609 with id 20140917-231435-16842879-34609-31503
 I0917 23:14:35.247234 31524 master.cpp:1224] Elected as the leading master!
 I0917 23:14:35.247336 31524 master.cpp:1042] Recovering from registrar
 I0917 23:14:35.247542 31526 registrar.cpp:313] Recovering registrar
 I0917 23:14:35.250555 31510 containerizer.cpp:89] Using isolation: 
 posix/cpu,posix/mem
 I0917 23:14:35.252326 31510 containerizer.cpp:89] Using isolation: 
 posix/cpu,posix/mem
 I0917 23:14:35.252821 31520 slave.cpp:169] Slave started on 1)@127.0.1.1:34609
 I0917 23:14:35.253552 31520 slave.cpp:289] Slave resources: cpus(*):1; 
 mem(*):1001; disk(*):24988; ports(*):[31000-32000]
 I0917 23:14:35.253906 31520 slave.cpp:317] Slave hostname: saucy
 I0917 23:14:35.254004 31520 slave.cpp:318] Slave checkpoint: true
 I0917 23:14:35.254818 31520 state.cpp:33] Recovering state from 
 '/tmp/mesos-w8snRW/0/meta'
 I0917 23:14:35.255106 31519 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 13.99622ms
 I0917 23:14:35.255235 31519 replica.cpp:320] Persisted replica status to 
 STARTING
 I0917 23:14:35.255419 31519 recover.cpp:451] Replica is in STARTING status
 I0917 23:14:35.255834 31519 replica.cpp:638] Replica in STARTING status 
 received a broadcasted recover request
 I0917 23:14:35.256000 31519 recover.cpp:188] Received a recover response from 
 a replica in STARTING status
 I0917 23:14:35.256217 31519 recover.cpp:542] Updating replica status to VOTING
 I0917 23:14:35.256641 31520 status_update_manager.cpp:193] Recovering status 
 update manager
 I0917 23:14:35.257064 31520 containerizer.cpp:252] Recovering containerizer
 I0917 23:14:35.257725 31520 slave.cpp:3220] Finished recovery
 I0917 23:14:35.258463 31520 slave.cpp:600] New master detected at 
 master@127.0.1.1:34609
 I0917 23:14:35.258769 31524

[jira] [Updated] (MESOS-1818) AllocatorTest/0.ResourcesUnused sometimes segfaults


 [ 
https://issues.apache.org/jira/browse/MESOS-1818?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-1818:
--
Assignee: Benjamin Mahler  (was: Vinod Kone)

 AllocatorTest/0.ResourcesUnused sometimes segfaults
 ---

 Key: MESOS-1818
 URL: https://issues.apache.org/jira/browse/MESOS-1818
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.21.0
Reporter: Vinod Kone
Assignee: Benjamin Mahler
Priority: Critical

 {code}
 [ RUN  ] AllocatorTest/0.ResourcesUnused
 *** Aborted at 1411088950 (unix time) try date -d @1411088950 if you are 
 using GNU date ***
 PC: @   0x8649a4 mesos::SlaveID::value()
 *** SIGSEGV (@0x2de9) received by PID 20876 (TID 0x7fb63a1c0940) from PID 
 11753; stack trace: ***
 @ 0x7fb643ec4ca0 (unknown)
 @   0x8649a4 mesos::SlaveID::value()
 @   0x8741c7 mesos::hash_value()
 @   0x8f7448 boost::hash::operator()()
 @   0x8e0bed 
 boost::unordered::detail::mix64_policy::apply_hash()
 @ 0x7fb64694c1cf boost::unordered::detail::table::hash()
 @ 0x7fb646973615 boost::unordered::detail::table::find_node()
 @ 0x7fb64694c191 boost::unordered::detail::table_impl::count()
 @ 0x7fb64691f3c1 boost::unordered::unordered_map::count()
 @ 0x7fb6468f4373 hashmap::contains()
 @ 0x7fb6468c5eda mesos::internal::master::Master::getSlave()
 @ 0x7fb6468c0fc3 mesos::internal::master::Master::removeFramework()
 @ 0x7fb6468afa9f 
 mesos::internal::master::Master::unregisterFramework()
 @ 0x7fb646904ab9 ProtobufProcess::handler1()
 @ 0x7fb6469a1e81 
 _ZNSt5_BindIFPFvPN5mesos8internal6master6MasterEMS3_FvRKN7process4UPIDERKNS0_11FrameworkIDEEMNS1_26UnregisterFrameworkMessageEKFSB_vES8_RKSsES4_SD_SG_St12_PlaceholderILi1EESL_ILi26__callIvJS8_SI_EJLm0ELm1ELm2ELm3ELm4T_OSt5tupleIJDpT0_EESt12_Index_tupleIJXspT1_EEE
 @ 0x7fb646983afe std::_Bind::operator()()
 @ 0x7fb64695f83c std::_Function_handler::_M_invoke()
 @   0xc4e17f std::function::operator()()
 @ 0x7fb6468ebd10 ProtobufProcess::visit()
 @ 0x7fb6468a9892 mesos::internal::master::Master::_visit()
 @ 0x7fb6468a8f46 mesos::internal::master::Master::visit()
 @ 0x7fb6468ce670 process::MessageEvent::visit()
 @   0x86ad54 process::ProcessBase::serve()
 @ 0x7fb6470e9738 process::ProcessManager::resume()
 @ 0x7fb6470dff3f process::schedule()
 @ 0x7fb643ebc83d start_thread
 @ 0x7fb642c2426d clone
 make[3]: *** [check-local] Segmentation fault
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-1818) AllocatorTest/0.ResourcesUnused sometimes segfaults