date:20150309


 [ 
https://issues.apache.org/jira/browse/MESOS-94?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-94:
---
Labels: http json test twitter  (was: http json test)

 Master and Slave HTTP handlers should have unit tests
 -

 Key: MESOS-94
 URL: https://issues.apache.org/jira/browse/MESOS-94
 Project: Mesos
  Issue Type: Improvement
  Components: json api, master, slave, test
Reporter: Charles Reiss
  Labels: http, json, test, twitter

 The Master and Slave have HTTP handlers which serve their state (mainly for 
 the webui to use). There should be unit tests of these.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2293) Implement the Call endpoint on master


 [ 
https://issues.apache.org/jira/browse/MESOS-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2293:
-
Labels: twitter  (was: )

 Implement the Call endpoint on master
 -

 Key: MESOS-2293
 URL: https://issues.apache.org/jira/browse/MESOS-2293
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone
  Labels: twitter





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1988) Scheduler driver should not generate TASK_LOST when disconnected from master


 [ 
https://issues.apache.org/jira/browse/MESOS-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-1988:
-
Labels: twitter  (was: )

 Scheduler driver should not generate TASK_LOST when disconnected from master
 

 Key: MESOS-1988
 URL: https://issues.apache.org/jira/browse/MESOS-1988
 Project: Mesos
  Issue Type: Improvement
Reporter: Vinod Kone
  Labels: twitter

 Currently, the driver replies to launchTasks() with TASK_LOST if it detects 
 that it is disconnected from the master. After MESOS-1972 lands, this will be 
 the only place where driver generates TASK_LOST. See MESOS-1972 for more 
 context.
 This fix is targeted for 0.22.0 to give frameworks time to implement 
 reconciliation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2419) Slave recovery not recovering tasks


[ 
https://issues.apache.org/jira/browse/MESOS-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353337#comment-14353337
 ] 

Niklas Quarfot Nielsen commented on MESOS-2419:
---

Jörg, any updates on this ticket? Can you reproduce? If not, try to reach out 
to [~brenden]

 Slave recovery not recovering tasks
 ---

 Key: MESOS-2419
 URL: https://issues.apache.org/jira/browse/MESOS-2419
 Project: Mesos
  Issue Type: Bug
  Components: slave
Affects Versions: 0.22.0, 0.23.0
Reporter: Brenden Matthews
Assignee: Joerg Schad
 Attachments: mesos-chronos.log.gz, mesos.log.gz


 In a recent build from master (updated yesterday), slave recovery appears to 
 have broken.
 I'll attach the slave log (with GLOG_v=1) showing a task called 
 `long-running-job` which is a Chronos job that just does `sleep 1h`. After 
 restarting the slave, the task shows as `TASK_FAILED`.
 Here's another case, which is for a docker task:
 {noformat}
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.247159 10022 docker.cpp:421] Recovering Docker containers
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.247207 10022 docker.cpp:468] Recovering container 
 'f2001064-e076-4978-b764-ed12a5244e78' for executor 
 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 
 20150226-230228-2931198986-5050-717-
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.254791 10022 docker.cpp:1333] Executor for container 
 'f2001064-e076-4978-b764-ed12a5244e78' has exited
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.254812 10022 docker.cpp:1159] Destroying container 
 'f2001064-e076-4978-b764-ed12a5244e78'
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.254844 10022 docker.cpp:1248] Running docker stop on container 
 'f2001064-e076-4978-b764-ed12a5244e78'
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.262481 10027 containerizer.cpp:310] Recovering containerizer
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.262565 10027 containerizer.cpp:353] Recovering container 
 'f2001064-e076-4978-b764-ed12a5244e78' for executor 
 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 
 20150226-230228-2931198986-5050-717-
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.263675 10027 linux_launcher.cpp:162] Couldn't find freezer cgroup 
 for container f2001064-e076-4978-b764-ed12a5244e78, assuming already destroyed
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 
 00:09:49.265467 10020 cpushare.cpp:199] Couldn't find cgroup for container 
 f2001064-e076-4978-b764-ed12a5244e78
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.266448 10022 containerizer.cpp:1147] Executor for container 
 'f2001064-e076-4978-b764-ed12a5244e78' has exited
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.266466 10022 containerizer.cpp:938] Destroying container 
 'f2001064-e076-4978-b764-ed12a5244e78'
 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:50.593585 10021 slave.cpp:3735] Sending reconnect request to executor 
 chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
 20150226-230228-2931198986-5050-717- at executor(1)@10.81.189.232:43130
 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 
 00:09:50.597843 10024 slave.cpp:3175] Termination of executor 
 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 
 '20150226-230228-2931198986-5050-717-' failed: Container 
 'f2001064-e076-4978-b764-ed12a5244e78' not found
 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 
 00:09:50.597949 10025 slave.cpp:3429] Failed to unmonitor container for 
 executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
 20150226-230228-2931198986-5050-717-: Not monitored
 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:50.598785 10024 slave.cpp:2508] Handling status update TASK_FAILED 
 (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task 
 chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
 20150226-230228-2931198986-5050-717- from @0.0.0.0:0
 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 
 00:09:50.599093 10023 slave.cpp:2637] Failed to update resources for 
 container f2001064-e076-4978-b764-ed12a5244e78 of executor 
 chronos.55ffc971-be13-11e4-b8d6-566d21d75321 running task 
 chronos.55ffc971-be13-11e4-b8d6-566d21d75321 on status update for terminal 
 task, destroying container: Container

[jira] [Commented] (MESOS-2289) Design doc for the HTTP API


[ 
https://issues.apache.org/jira/browse/MESOS-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353227#comment-14353227
 ] 

Vinod Kone commented on MESOS-2289:
---

https://docs.google.com/a/twitter.com/document/d/17EjlrEBEvSBllDC6Xu3BjDoKoGosZpJS0k78JRGx134/edit?usp=sharing

 Design doc for the HTTP API
 ---

 Key: MESOS-2289
 URL: https://issues.apache.org/jira/browse/MESOS-2289
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone
Assignee: Vinod Kone

 This tracks the design of the HTTP API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2467) Allow --resources flag to take JSON.


[ 
https://issues.apache.org/jira/browse/MESOS-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353182#comment-14353182
 ] 

Dominic Hamon commented on MESOS-2467:
--

instead of relying on the first character (which can also be '{' in valid json) 
perhaps we can instead:

- try JSON parsing, catch failure
- fallback to old parsing


This also means we can deprecate the old parsing behaviour more easily. 

 Allow --resources flag to take JSON.
 

 Key: MESOS-2467
 URL: https://issues.apache.org/jira/browse/MESOS-2467
 Project: Mesos
  Issue Type: Improvement
Reporter: Jie Yu

 Currently, we used a customized format for --resources flag. As we introduce 
 more and more stuffs (e.g., persistence, reservation) in Resource object, we 
 need a more generic way to specify --resources.
 For backward compatibility, we can scan the first character. If it is '[', 
 then we invoke the JSON parser. Otherwise, we use the existing parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Resolved] (MESOS-2426) Developer Guide improvements


 [ 
https://issues.apache.org/jira/browse/MESOS-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone resolved MESOS-2426.
---
   Resolution: Fixed
Fix Version/s: 0.23.0

commit a0a5f0ae710c88c7d9c5decc8bbbe6c30c2c9048
Author: Aaron Bell aaron.b...@gmail.com
Date:   Mon Mar 9 10:50:39 2015 -0700

MESOS-2426 developer guide improvements.

1. Add a new line for the user to run 'rbt status' to log into RB.
2. Change 'git co' (technically invalid) to 'git checkout'.

Review: https://reviews.apache.org/r/31638


 Developer Guide improvements
 

 Key: MESOS-2426
 URL: https://issues.apache.org/jira/browse/MESOS-2426
 Project: Mesos
  Issue Type: Bug
  Components: documentation
Affects Versions: 0.20.1
Reporter: Aaron Bell
Assignee: Aaron Bell
Priority: Minor
 Fix For: 0.23.0


 # The docs need to mention that `post-reviews.py` will not work until `rbt 
 status` or similar has been used to log in. The post script actually hangs 
 with no timeout.
 # `git co` is not a valid command.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2277) Document undocumented HTTP endpoints


 [ 
https://issues.apache.org/jira/browse/MESOS-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2277:
-
Labels: documentation newbie starter twitter  (was: documentation newbie 
starter)

 Document undocumented HTTP endpoints
 

 Key: MESOS-2277
 URL: https://issues.apache.org/jira/browse/MESOS-2277
 Project: Mesos
  Issue Type: Improvement
Reporter: Niklas Quarfot Nielsen
Priority: Minor
  Labels: documentation, newbie, starter, twitter

 Did a quick scan and we are missing documentation for a few endpoints:
 {code}
 files/browse.json
 files/read.json
 files/download.json
 files/debug.json
 master/roles.json
 master/state.json
 master/stats.json
 slave/state.json
 slave/stats.json
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2294) Implement the Events endpoint on master


 [ 
https://issues.apache.org/jira/browse/MESOS-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon updated MESOS-2294:
-
Labels: twitter  (was: )

 Implement the Events endpoint on master
 ---

 Key: MESOS-2294
 URL: https://issues.apache.org/jira/browse/MESOS-2294
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone
  Labels: twitter





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-1127) Implement the protobufs for the scheduler API


 [ 
https://issues.apache.org/jira/browse/MESOS-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dominic Hamon reassigned MESOS-1127:


Assignee: Vinod Kone  (was: Benjamin Hindman)

 Implement the protobufs for the scheduler API
 -

 Key: MESOS-1127
 URL: https://issues.apache.org/jira/browse/MESOS-1127
 Project: Mesos
  Issue Type: Task
  Components: framework
Reporter: Benjamin Hindman
Assignee: Vinod Kone
  Labels: twitter

 The default scheduler/executor interface and implementation in Mesos have a 
 few drawbacks:
 (1) The interface is fairly high-level which makes it hard to do certain 
 things, for example, handle events (callbacks) in batch. This can have a big 
 impact on the performance of schedulers (for example, writing task updates 
 that need to be persisted).
 (2) The implementation requires writing a lot of boilerplate JNI and native 
 Python wrappers when adding additional API components.
 The plan is to provide a lower-level API that can easily be used to implement 
 the higher-level API that is currently provided. This will also open the door 
 to more easily building native-language Mesos libraries (i.e., not needing 
 the C++ shim layer) and building new higher-level abstractions on top of the 
 lower-level API.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-2467) Allow --resources flag to take JSON.

Jie Yu created MESOS-2467:
-

 Summary: Allow --resources flag to take JSON.
 Key: MESOS-2467
 URL: https://issues.apache.org/jira/browse/MESOS-2467
 Project: Mesos
  Issue Type: Improvement
Reporter: Jie Yu


Currently, we used a customized format for --resources flag. As we introduce 
more and more stuffs (e.g., persistence, reservation) in Resource object, we 
need a more generic way to specify --resources.

For backward compatibility, we can scan the first character. If it is '[', then 
we invoke the JSON parser. Otherwise, we use the existing parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2309) Mesos rejects ExecutorInfo as incompatible when there is no functional difference


[ 
https://issues.apache.org/jira/browse/MESOS-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353736#comment-14353736
 ] 

Vinod Kone commented on MESOS-2309:
---

[~js84] I don't think it matters if a field has default value or not. IIUC, 
every singular field has a default value. For message types, the default is a 
message with all fields unset (but have default values).

 Mesos rejects ExecutorInfo as incompatible when there is no functional 
 difference
 -

 Key: MESOS-2309
 URL: https://issues.apache.org/jira/browse/MESOS-2309
 Project: Mesos
  Issue Type: Bug
Reporter: Zameer Manji
Assignee: Joerg Schad
Priority: Minor
  Labels: twitter

 In AURORA-1076 it was discovered that if an ExecutorInfo was changed such 
 that a previously unset optional field with a default value was changed to 
 have the field set with the default value, it would be rejected as not 
 compatible.
 For example if we have an ExecutorInfo with a CommandInfo with the {{shell}} 
 attribute unset and then we change the CommandInfo to set the {{shell}} 
 attribute to true Mesos will reject the task with:
 {noformat}
 I0130 21:50:05.373389 50869 master.cpp:3441] Sending status update TASK_LOST 
 (UUID: 82ef615c-0d59-4427-95d5-80cf0e52b3fc) for task 
 system-gc-c89c0c05-200c-462e-958a-ecd7b9a76831 of framework 
 201103282247-19- 'Task has invalid ExecutorInfo (existing 
 ExecutorInfo with same ExecutorID is not compatible).
 {noformat}
 This is not intuitive because the default value of the {{shell}} attribute is 
 true. There should be no difference between not setting an optional field 
 with a default value and setting that field to the default value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2309) Mesos rejects ExecutorInfo as incompatible when there is no functional difference


[ 
https://issues.apache.org/jira/browse/MESOS-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353669#comment-14353669
 ] 

Joerg Schad commented on MESOS-2309:


Do you mean optional fields with defaults (as in the case of shell)? For 
optional fields without defaults I believe we should first check via has_field 
whether it is actually present.

 Mesos rejects ExecutorInfo as incompatible when there is no functional 
 difference
 -

 Key: MESOS-2309
 URL: https://issues.apache.org/jira/browse/MESOS-2309
 Project: Mesos
  Issue Type: Bug
Reporter: Zameer Manji
Assignee: Joerg Schad
Priority: Minor
  Labels: twitter

 In AURORA-1076 it was discovered that if an ExecutorInfo was changed such 
 that a previously unset optional field with a default value was changed to 
 have the field set with the default value, it would be rejected as not 
 compatible.
 For example if we have an ExecutorInfo with a CommandInfo with the {{shell}} 
 attribute unset and then we change the CommandInfo to set the {{shell}} 
 attribute to true Mesos will reject the task with:
 {noformat}
 I0130 21:50:05.373389 50869 master.cpp:3441] Sending status update TASK_LOST 
 (UUID: 82ef615c-0d59-4427-95d5-80cf0e52b3fc) for task 
 system-gc-c89c0c05-200c-462e-958a-ecd7b9a76831 of framework 
 201103282247-19- 'Task has invalid ExecutorInfo (existing 
 ExecutorInfo with same ExecutorID is not compatible).
 {noformat}
 This is not intuitive because the default value of the {{shell}} attribute is 
 true. There should be no difference between not setting an optional field 
 with a default value and setting that field to the default value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2309) Mesos rejects ExecutorInfo as incompatible when there is no functional difference


[ 
https://issues.apache.org/jira/browse/MESOS-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353640#comment-14353640
 ] 

Vinod Kone commented on MESOS-2309:
---

In our code base, we are almost always interested in functional equivalence and 
not object equivalence. As such, I think we should fix our == operator 
overloads for all our protobufs to do equivalance checks of optional fields the 
same way as we do required fields.

I'll prep a review for this shortly.

 Mesos rejects ExecutorInfo as incompatible when there is no functional 
 difference
 -

 Key: MESOS-2309
 URL: https://issues.apache.org/jira/browse/MESOS-2309
 Project: Mesos
  Issue Type: Bug
Reporter: Zameer Manji
Assignee: Joerg Schad
Priority: Minor
  Labels: twitter

 In AURORA-1076 it was discovered that if an ExecutorInfo was changed such 
 that a previously unset optional field with a default value was changed to 
 have the field set with the default value, it would be rejected as not 
 compatible.
 For example if we have an ExecutorInfo with a CommandInfo with the {{shell}} 
 attribute unset and then we change the CommandInfo to set the {{shell}} 
 attribute to true Mesos will reject the task with:
 {noformat}
 I0130 21:50:05.373389 50869 master.cpp:3441] Sending status update TASK_LOST 
 (UUID: 82ef615c-0d59-4427-95d5-80cf0e52b3fc) for task 
 system-gc-c89c0c05-200c-462e-958a-ecd7b9a76831 of framework 
 201103282247-19- 'Task has invalid ExecutorInfo (existing 
 ExecutorInfo with same ExecutorID is not compatible).
 {noformat}
 This is not intuitive because the default value of the {{shell}} attribute is 
 true. There should be no difference between not setting an optional field 
 with a default value and setting that field to the default value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Closed] (MESOS-2414) Java bindings segfault during framework shutdown


 [ 
https://issues.apache.org/jira/browse/MESOS-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen closed MESOS-2414.
-
Resolution: Fixed

commit e1bae61258bc80bd6006bbb6d4a4fb5c0cc95820
Author: Niklas Nielsen n...@qni.dk
Date:   Fri Mar 6 17:14:17 2015 -0800

Fixed race in getFieldID helper.

The getFieldID helper looks up the java/lang/NoSuchFieldError class and
stores it in a static. It has turned out to provoke a racy behavior with
Java 8 when multiple drivers are created (and the class object may have
been created by another thread). This patch reverts the 'static'
optimization and looks up the class object when exceptions are thrown.

Review: https://reviews.apache.org/r/31818

 Java bindings segfault during framework shutdown
 

 Key: MESOS-2414
 URL: https://issues.apache.org/jira/browse/MESOS-2414
 Project: Mesos
  Issue Type: Bug
Reporter: Niklas Quarfot Nielsen
Assignee: Niklas Quarfot Nielsen

 {code}
 I0226 16:39:59.063369 626044928 sched.cpp:831] Stopping framework 
 '20150220-141149-16777343-5050-45194-'
 [2015-02-26 16:39:59,063] INFO Driver future completed. Executing optional 
 abdication command. (mesosphere.marathon.MarathonSchedulerService:191)
 [2015-02-26 16:39:59,065] INFO Setting framework ID to 
 20150220-141149-16777343-5050-45194- 
 (mesosphere.marathon.MarathonSchedulerService:75)
 #
 # A fatal error has been detected by the Java Runtime Environment:
 #
 #  SIGSEGV (0xb) at pc=0x000106a266d0, pid=99408, tid=44291
 #
 # JRE version: Java(TM) SE Runtime Environment (8.0_25-b17) (build 
 1.8.0_25-b17)
 # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.25-b02 mixed mode bsd-amd64 
 compressed oops)
 # Problematic frame:
 # V  [libjvm.dylib+0x4266d0]  Klass::is_subtype_of(Klass*) const+0x4
 #
 # Failed to write core dump. Core dumps have been disabled. To enable core 
 dumping, try ulimit -c unlimited before starting Java again
 #
 # An error report file with more information is saved as:
 # /Users/corpsi/projects/marathon/hs_err_pid99408.log
 #
 # If you would like to submit a bug report, please visit:
 #   http://bugreport.sun.com/bugreport/crash.jsp
 #
 Abort trap: 6
 {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2448) release checklist should include 'update homebrew' for OS X developers


[ 
https://issues.apache.org/jira/browse/MESOS-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353689#comment-14353689
 ] 

Niklas Quarfot Nielsen commented on MESOS-2448:
---

Alright - will do.

 release checklist should include 'update homebrew' for OS X developers
 --

 Key: MESOS-2448
 URL: https://issues.apache.org/jira/browse/MESOS-2448
 Project: Mesos
  Issue Type: Documentation
  Components: documentation
Affects Versions: 0.20.1
Reporter: Aaron Bell
Assignee: Aaron Bell
Priority: Minor

 For many developers the {{brew install mesos}} path is the first exposure to 
 Mesos. Currently this is maintained best-efforts by the community, and 
 they're not doing very well.
 Example: current Mesos compatibility is stuck in this PR: 
 https://github.com/Homebrew/homebrew/pull/37087/
 By adding this to the release checklist we can ensure developers can get the 
 latest version easily.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-2468) Update homebrew should be marked as an optional step in the release manager process doc

Niklas Quarfot Nielsen created MESOS-2468:
-

 Summary: Update homebrew should be marked as an optional step in 
the release manager process doc
 Key: MESOS-2468
 URL: https://issues.apache.org/jira/browse/MESOS-2468
 Project: Mesos
  Issue Type: Documentation
Reporter: Niklas Quarfot Nielsen
Priority: Trivial


The homebrew step should be marked as optional, as it's not owned by the mesos 
project.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1023) Replace all static/global variables with non-POD type


[ 
https://issues.apache.org/jira/browse/MESOS-1023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353464#comment-14353464
 ] 

Dominic Hamon commented on MESOS-1023:
--

commit f780f67717fe0aa25b6870baedd55c43a7017edb (HEAD, origin/master, 
origin/HEAD, master)
Author: Dominic Hamon d...@twitter.com
Commit: Dominic Hamon d...@twitter.com

Remove static strings from process and split out some source.

Review: https://reviews.apache.org/r/30841


 Replace all static/global variables with non-POD type
 -

 Key: MESOS-1023
 URL: https://issues.apache.org/jira/browse/MESOS-1023
 Project: Mesos
  Issue Type: Bug
  Components: general, technical debt
Reporter: Dominic Hamon
Assignee: Dominic Hamon
  Labels: c++

 See 
 http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml#Static_and_Global_Variables
  for the background.
 Real bugs have been seen. For example, in process::ID::generate we have a 
 mapstring, int that can be accessed within the function after exit has been 
 called. Ie, we can try to access the map after it's been destroyed, but 
 before exit has completed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-391) Slave GarbageCollector needs to also take into account the number of links, when determining removal time.

2015-03-09 Thread Bernd Mathiske (JIRA)

[
https://issues.apache.org/jira/browse/MESOS-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353394#comment-14353394
]

Bernd Mathiske commented on MESOS-391:
--

A simple approach would be to fix the problem right inside
paths::createExecutorDirectory():

1. Check how many subdirs the executor parent dir already has.
2. If it is close to the limit, find out which of the existing subdirs are the
oldest.
3. Delete one or more of the latter.
4. Now it should be safe to proceed with the mkdir().

We could then either immediately remove the deleted paths from the GC's
internal bookkeeping or we could just make sure that whenever their GC time is
up it makes no difference if they are already gone (pre-deleted).

Problems with this approach:
a) Recursively deleting a directory may take a while. This operation blocks
slave process (actor) progress. This is already a problem with the mkdir()
itself, but that one is less likely to take long (although ultimately it
might). It is one file operation. In contrast, the number of file system
operations for a recursive deletion is in general unknown and could potentially
be large.
b) If you are close to the limit and you only remove one subdir then, you may
end up doing so again and again for many tasks.

I propose we deal with problem a) by handling the deletion on a different
process than the slave process. The GC process is an obvious candidate. In the
slave process, we can wait for a future that signals the completion of the
deletion. (There are some concurrency issues that we will get to later.)

The advantage of this approach is that it is watertight. There is no way
LINK_MAX can be exceeded by executor dirs any more then.

On the other hand, maybe speeding up GC for subdirs once a parent dir fills up
more than say 3/4 is always fast enough? But how do you know that for sure if
the duration of a file deletion is in principle unknown?

That said, the two approaches can of course be combined. I would start with the
watertight one and add the other one if still so desired.

Slave GarbageCollector needs to also take into account the number of links,
when determining removal time.
--

Key: MESOS-391
URL: https://issues.apache.org/jira/browse/MESOS-391
Project: Mesos
Issue Type: Bug
Reporter: Benjamin Mahler
Assignee: Bernd Mathiske
Labels: twitter

The slave garbage collector does not take into account the number of links
present, which means that if we create a lot of executor directories (up to
LINK_MAX), we won't necessarily GC.
As a result of this, the slave crashes:
F0313 21:40:02.926494 33746 paths.hpp:233] CHECK_SOME(mkdir) failed: Failed
to create executor directory
'/var/lib/mesos/slaves/201303090208-193162-5050-38880-267/frameworks/201103282247-19-/executors/thermos-1363210801777-mesos-meta_slave_0-27-e74e4b30-dcf1-4e88-8954-dd2b40b7dd89/runs/499fcc13-c391-421c-93d2-a56d1a4a931e':
Too many links
*** Check failure stack trace: ***
@ 0x7f9320f82f9d google::LogMessage::Fail()
@ 0x7f9320f88c07 google::LogMessage::SendToLog()
@ 0x7f9320f8484c google::LogMessage::Flush()
@ 0x7f9320f84ab6 google::LogMessageFatal::~LogMessageFatal()
@ 0x7f9320c70312 _CheckSome::~_CheckSome()
@ 0x7f9320c9dd5c
mesos::internal::slave::paths::createExecutorDirectory()
@ 0x7f9320c9e60d mesos::internal::slave::Framework::createExecutor()
@ 0x7f9320c7a7f7 mesos::internal::slave::Slave::runTask()
@ 0x7f9320c9cb43 ProtobufProcess::handler4()
@ 0x7f9320c8678b std::tr1::_Function_handler::_M_invoke()
@ 0x7f9320c9d1ab ProtobufProcess::visit()
@ 0x7f9320e4c774 process::MessageEvent::visit()
@ 0x7f9320e40a1d process::ProcessManager::resume()
@ 0x7f9320e41268 process::schedule()
@ 0x7f932055973d start_thread
@ 0x7f931ef3df6d clone
The fix here is to take into account the number of links (st_nlinks), when
determining whether we need to GC.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2467) Allow --resources flag to take JSON.


[ 
https://issues.apache.org/jira/browse/MESOS-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353446#comment-14353446
 ] 

Jie Yu commented on MESOS-2467:
---

Should --resources be a JSON array (array of Resource) or a JSON object? I 
think it should be a JSON array. We need to add support for parsing JSON array 
to RepeatedProtobufXXX?

 Allow --resources flag to take JSON.
 

 Key: MESOS-2467
 URL: https://issues.apache.org/jira/browse/MESOS-2467
 Project: Mesos
  Issue Type: Improvement
Reporter: Jie Yu

 Currently, we used a customized format for --resources flag. As we introduce 
 more and more stuffs (e.g., persistence, reservation) in Resource object, we 
 need a more generic way to specify --resources.
 For backward compatibility, we can scan the first character. If it is '[', 
 then we invoke the JSON parser. Otherwise, we use the existing parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-391) Slave GarbageCollector needs to also take into account the number of links, when determining removal time.

2015-03-09 Thread Ritwik Yadav (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353480#comment-14353480
 ] 

Ritwik Yadav commented on MESOS-391:


That IMO is an excellent solution. However, what I fail to understand is the 
comparison between these two scenarios:

1) The slave process itself recursively deleting a directory recursively.
2) The slave process waiting on the GC process which deletes a directory 
recursively.

The latter might be a good design option but does it affect run time in any way?

 Slave GarbageCollector needs to also take into account the number of links, 
 when determining removal time.
 --

 Key: MESOS-391
 URL: https://issues.apache.org/jira/browse/MESOS-391
 Project: Mesos
  Issue Type: Bug
Reporter: Benjamin Mahler
Assignee: Bernd Mathiske
  Labels: twitter

 The slave garbage collector does not take into account the number of links 
 present, which means that if we create a lot of executor directories (up to 
 LINK_MAX), we won't necessarily GC.
 As a result of this, the slave crashes:
 F0313 21:40:02.926494 33746 paths.hpp:233] CHECK_SOME(mkdir) failed: Failed 
 to create executor directory 
 '/var/lib/mesos/slaves/201303090208-193162-5050-38880-267/frameworks/201103282247-19-/executors/thermos-1363210801777-mesos-meta_slave_0-27-e74e4b30-dcf1-4e88-8954-dd2b40b7dd89/runs/499fcc13-c391-421c-93d2-a56d1a4a931e':
  Too many links
 *** Check failure stack trace: ***
 @ 0x7f9320f82f9d  google::LogMessage::Fail()
 @ 0x7f9320f88c07  google::LogMessage::SendToLog()
 @ 0x7f9320f8484c  google::LogMessage::Flush()
 @ 0x7f9320f84ab6  google::LogMessageFatal::~LogMessageFatal()
 @ 0x7f9320c70312  _CheckSome::~_CheckSome()
 @ 0x7f9320c9dd5c  
 mesos::internal::slave::paths::createExecutorDirectory()
 @ 0x7f9320c9e60d  mesos::internal::slave::Framework::createExecutor()
 @ 0x7f9320c7a7f7  mesos::internal::slave::Slave::runTask()
 @ 0x7f9320c9cb43  ProtobufProcess::handler4()
 @ 0x7f9320c8678b  std::tr1::_Function_handler::_M_invoke()
 @ 0x7f9320c9d1ab  ProtobufProcess::visit()
 @ 0x7f9320e4c774  process::MessageEvent::visit()
 @ 0x7f9320e40a1d  process::ProcessManager::resume()
 @ 0x7f9320e41268  process::schedule()
 @ 0x7f932055973d  start_thread
 @ 0x7f931ef3df6d  clone
 The fix here is to take into account the number of links (st_nlinks), when 
 determining whether we need to GC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-2419) Slave recovery not recovering tasks


[ 
https://issues.apache.org/jira/browse/MESOS-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353349#comment-14353349
 ] 

Joerg Schad edited comment on MESOS-2419 at 3/9/15 6:33 PM:


So far I can only reproduce this on the cluster. Currently reproducing the 
environment of the cluster (especially cgroups setup), already in contact with 
Brenden.



was (Author: js84):
So far I can only reproduce this on the cluster. Currently reproducing the 
environment of the clustert (especially cgroups setup), already in contact with 
Brenden.


 Slave recovery not recovering tasks
 ---

 Key: MESOS-2419
 URL: https://issues.apache.org/jira/browse/MESOS-2419
 Project: Mesos
  Issue Type: Bug
  Components: slave
Affects Versions: 0.22.0, 0.23.0
Reporter: Brenden Matthews
Assignee: Joerg Schad
 Attachments: mesos-chronos.log.gz, mesos.log.gz


 In a recent build from master (updated yesterday), slave recovery appears to 
 have broken.
 I'll attach the slave log (with GLOG_v=1) showing a task called 
 `long-running-job` which is a Chronos job that just does `sleep 1h`. After 
 restarting the slave, the task shows as `TASK_FAILED`.
 Here's another case, which is for a docker task:
 {noformat}
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.247159 10022 docker.cpp:421] Recovering Docker containers
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.247207 10022 docker.cpp:468] Recovering container 
 'f2001064-e076-4978-b764-ed12a5244e78' for executor 
 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 
 20150226-230228-2931198986-5050-717-
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.254791 10022 docker.cpp:1333] Executor for container 
 'f2001064-e076-4978-b764-ed12a5244e78' has exited
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.254812 10022 docker.cpp:1159] Destroying container 
 'f2001064-e076-4978-b764-ed12a5244e78'
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.254844 10022 docker.cpp:1248] Running docker stop on container 
 'f2001064-e076-4978-b764-ed12a5244e78'
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.262481 10027 containerizer.cpp:310] Recovering containerizer
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.262565 10027 containerizer.cpp:353] Recovering container 
 'f2001064-e076-4978-b764-ed12a5244e78' for executor 
 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 
 20150226-230228-2931198986-5050-717-
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.263675 10027 linux_launcher.cpp:162] Couldn't find freezer cgroup 
 for container f2001064-e076-4978-b764-ed12a5244e78, assuming already destroyed
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 
 00:09:49.265467 10020 cpushare.cpp:199] Couldn't find cgroup for container 
 f2001064-e076-4978-b764-ed12a5244e78
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.266448 10022 containerizer.cpp:1147] Executor for container 
 'f2001064-e076-4978-b764-ed12a5244e78' has exited
 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:49.266466 10022 containerizer.cpp:938] Destroying container 
 'f2001064-e076-4978-b764-ed12a5244e78'
 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:50.593585 10021 slave.cpp:3735] Sending reconnect request to executor 
 chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
 20150226-230228-2931198986-5050-717- at executor(1)@10.81.189.232:43130
 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 
 00:09:50.597843 10024 slave.cpp:3175] Termination of executor 
 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 
 '20150226-230228-2931198986-5050-717-' failed: Container 
 'f2001064-e076-4978-b764-ed12a5244e78' not found
 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 
 00:09:50.597949 10025 slave.cpp:3429] Failed to unmonitor container for 
 executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
 20150226-230228-2931198986-5050-717-: Not monitored
 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 
 00:09:50.598785 10024 slave.cpp:2508] Handling status update TASK_FAILED 
 (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task 
 chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 
 20150226-230228-2931198986-5050-717- from @0.0.0.0:0
 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 
 00:09:50.599093 10023 slave.cpp:2637] Failed to update

[jira] [Commented] (MESOS-391) Slave GarbageCollector needs to also take into account the number of links, when determining removal time.

2015-03-09 Thread Ritwik Yadav (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353481#comment-14353481
 ] 

Ritwik Yadav commented on MESOS-391:


That IMO is an excellent solution. However, what I fail to understand is the 
comparison between these two scenarios:

1) The slave process itself recursively deleting a directory recursively.
2) The slave process waiting on the GC process which deletes a directory 
recursively.

The latter might be a good design option but does it affect run time in any way?

 Slave GarbageCollector needs to also take into account the number of links, 
 when determining removal time.
 --

 Key: MESOS-391
 URL: https://issues.apache.org/jira/browse/MESOS-391
 Project: Mesos
  Issue Type: Bug
Reporter: Benjamin Mahler
Assignee: Bernd Mathiske
  Labels: twitter

 The slave garbage collector does not take into account the number of links 
 present, which means that if we create a lot of executor directories (up to 
 LINK_MAX), we won't necessarily GC.
 As a result of this, the slave crashes:
 F0313 21:40:02.926494 33746 paths.hpp:233] CHECK_SOME(mkdir) failed: Failed 
 to create executor directory 
 '/var/lib/mesos/slaves/201303090208-193162-5050-38880-267/frameworks/201103282247-19-/executors/thermos-1363210801777-mesos-meta_slave_0-27-e74e4b30-dcf1-4e88-8954-dd2b40b7dd89/runs/499fcc13-c391-421c-93d2-a56d1a4a931e':
  Too many links
 *** Check failure stack trace: ***
 @ 0x7f9320f82f9d  google::LogMessage::Fail()
 @ 0x7f9320f88c07  google::LogMessage::SendToLog()
 @ 0x7f9320f8484c  google::LogMessage::Flush()
 @ 0x7f9320f84ab6  google::LogMessageFatal::~LogMessageFatal()
 @ 0x7f9320c70312  _CheckSome::~_CheckSome()
 @ 0x7f9320c9dd5c  
 mesos::internal::slave::paths::createExecutorDirectory()
 @ 0x7f9320c9e60d  mesos::internal::slave::Framework::createExecutor()
 @ 0x7f9320c7a7f7  mesos::internal::slave::Slave::runTask()
 @ 0x7f9320c9cb43  ProtobufProcess::handler4()
 @ 0x7f9320c8678b  std::tr1::_Function_handler::_M_invoke()
 @ 0x7f9320c9d1ab  ProtobufProcess::visit()
 @ 0x7f9320e4c774  process::MessageEvent::visit()
 @ 0x7f9320e40a1d  process::ProcessManager::resume()
 @ 0x7f9320e41268  process::schedule()
 @ 0x7f932055973d  start_thread
 @ 0x7f931ef3df6d  clone
 The fix here is to take into account the number of links (st_nlinks), when 
 determining whether we need to GC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Issue Comment Deleted] (MESOS-391) Slave GarbageCollector needs to also take into account the number of links, when determining removal time.

2015-03-09 Thread Ritwik Yadav (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ritwik Yadav updated MESOS-391:
---
Comment: was deleted

(was: That IMO is an excellent solution. However, what I fail to understand is 
the comparison between these two scenarios:

1) The slave process itself recursively deleting a directory recursively.
2) The slave process waiting on the GC process which deletes a directory 
recursively.

The latter might be a good design option but does it affect run time in any 
way?)

 Slave GarbageCollector needs to also take into account the number of links, 
 when determining removal time.
 --

 Key: MESOS-391
 URL: https://issues.apache.org/jira/browse/MESOS-391
 Project: Mesos
  Issue Type: Bug
Reporter: Benjamin Mahler
Assignee: Bernd Mathiske
  Labels: twitter

 The slave garbage collector does not take into account the number of links 
 present, which means that if we create a lot of executor directories (up to 
 LINK_MAX), we won't necessarily GC.
 As a result of this, the slave crashes:
 F0313 21:40:02.926494 33746 paths.hpp:233] CHECK_SOME(mkdir) failed: Failed 
 to create executor directory 
 '/var/lib/mesos/slaves/201303090208-193162-5050-38880-267/frameworks/201103282247-19-/executors/thermos-1363210801777-mesos-meta_slave_0-27-e74e4b30-dcf1-4e88-8954-dd2b40b7dd89/runs/499fcc13-c391-421c-93d2-a56d1a4a931e':
  Too many links
 *** Check failure stack trace: ***
 @ 0x7f9320f82f9d  google::LogMessage::Fail()
 @ 0x7f9320f88c07  google::LogMessage::SendToLog()
 @ 0x7f9320f8484c  google::LogMessage::Flush()
 @ 0x7f9320f84ab6  google::LogMessageFatal::~LogMessageFatal()
 @ 0x7f9320c70312  _CheckSome::~_CheckSome()
 @ 0x7f9320c9dd5c  
 mesos::internal::slave::paths::createExecutorDirectory()
 @ 0x7f9320c9e60d  mesos::internal::slave::Framework::createExecutor()
 @ 0x7f9320c7a7f7  mesos::internal::slave::Slave::runTask()
 @ 0x7f9320c9cb43  ProtobufProcess::handler4()
 @ 0x7f9320c8678b  std::tr1::_Function_handler::_M_invoke()
 @ 0x7f9320c9d1ab  ProtobufProcess::visit()
 @ 0x7f9320e4c774  process::MessageEvent::visit()
 @ 0x7f9320e40a1d  process::ProcessManager::resume()
 @ 0x7f9320e41268  process::schedule()
 @ 0x7f932055973d  start_thread
 @ 0x7f931ef3df6d  clone
 The fix here is to take into account the number of links (st_nlinks), when 
 determining whether we need to GC.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2309) Mesos rejects ExecutorInfo as incompatible when there is no functional difference


[ 
https://issues.apache.org/jira/browse/MESOS-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353839#comment-14353839
 ] 

Joerg Schad commented on MESOS-2309:


Just tested and you are right :-).

 Mesos rejects ExecutorInfo as incompatible when there is no functional 
 difference
 -

 Key: MESOS-2309
 URL: https://issues.apache.org/jira/browse/MESOS-2309
 Project: Mesos
  Issue Type: Bug
Reporter: Zameer Manji
Assignee: Joerg Schad
Priority: Minor
  Labels: twitter

 In AURORA-1076 it was discovered that if an ExecutorInfo was changed such 
 that a previously unset optional field with a default value was changed to 
 have the field set with the default value, it would be rejected as not 
 compatible.
 For example if we have an ExecutorInfo with a CommandInfo with the {{shell}} 
 attribute unset and then we change the CommandInfo to set the {{shell}} 
 attribute to true Mesos will reject the task with:
 {noformat}
 I0130 21:50:05.373389 50869 master.cpp:3441] Sending status update TASK_LOST 
 (UUID: 82ef615c-0d59-4427-95d5-80cf0e52b3fc) for task 
 system-gc-c89c0c05-200c-462e-958a-ecd7b9a76831 of framework 
 201103282247-19- 'Task has invalid ExecutorInfo (existing 
 ExecutorInfo with same ExecutorID is not compatible).
 {noformat}
 This is not intuitive because the default value of the {{shell}} attribute is 
 true. There should be no difference between not setting an optional field 
 with a default value and setting that field to the default value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2108) Add configure flag or environment variable to enable SSL/libevent Socket


 [ 
https://issues.apache.org/jira/browse/MESOS-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2108:
--
Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, 
Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 
Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20  (was: Mesosphere Q4 Sprint 3 - 
12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere 
Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6)

 Add configure flag or environment variable to enable SSL/libevent Socket
 

 Key: MESOS-2108
 URL: https://issues.apache.org/jira/browse/MESOS-2108
 Project: Mesos
  Issue Type: Task
Reporter: Niklas Quarfot Nielsen
Assignee: Joris Van Remoortere





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2110) Configurable Ping Timeouts


 [ 
https://issues.apache.org/jira/browse/MESOS-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2110:
--
Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, 
Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 
Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20  (was: Mesosphere Q4 Sprint 3 - 
12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere 
Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6)

 Configurable Ping Timeouts
 --

 Key: MESOS-2110
 URL: https://issues.apache.org/jira/browse/MESOS-2110
 Project: Mesos
  Issue Type: Improvement
  Components: master, slave
Reporter: Adam B
Assignee: Adam B
  Labels: master, network, slave, timeout

 After a series of ping-failures, the master considers the slave lost and 
 calls shutdownSlave, requiring such a slave that reconnects to kill its tasks 
 and re-register as a new slaveId. On the other side, after a similar timeout, 
 the slave will consider the master lost and try to detect a new master. These 
 timeouts are currently hardcoded constants (5 * 15s), which may not be 
 well-suited for all scenarios.
 - Some clusters may tolerate a longer slave process restart period, and 
 wouldn't want tasks to be killed upon reconnect.
 - Some clusters may have higher-latency networks (e.g. cross-datacenter, or 
 for volunteer computing efforts), and would like to tolerate longer periods 
 without communication.
 We should provide flags/mechanisms on the master to control its tolerance for 
 non-communicative slaves, and (less importantly?) on the slave to tolerate 
 missing masters.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2351) Enable label and environment decorators (hooks) to remove label and environment entries


 [ 
https://issues.apache.org/jira/browse/MESOS-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2351:
--
Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, 
Mesosphere Q1 Sprint 5 - 3/20  (was: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere 
Q1 Sprint 4 - 3/6)

 Enable label and environment decorators (hooks) to remove label and 
 environment entries
 ---

 Key: MESOS-2351
 URL: https://issues.apache.org/jira/browse/MESOS-2351
 Project: Mesos
  Issue Type: Task
Reporter: Niklas Quarfot Nielsen
Assignee: Niklas Quarfot Nielsen

 We need to change the semantics of decorators to be able to not only add 
 labels and environment variables, but also remove them.
 The change is fairly small. The hook manager (and call site) use CopyFrom 
 instead of MergeFrom and hook implementors pass on the labels and environment 
 from task and executor commands respectively.
 In the future, we can tag labels such that only labels belonging to a hook 
 type (across master and slave) can be inspected and changed. For now, the 
 active hooks are selected by the operator and therefore be trusted.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2074) Fetcher cache test fixture


 [ 
https://issues.apache.org/jira/browse/MESOS-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2074:
--
Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, 
Mesosphere Q1 Sprint 5 - 3/20  (was: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere 
Q1 Sprint 4 - 3/6)

 Fetcher cache test fixture
 --

 Key: MESOS-2074
 URL: https://issues.apache.org/jira/browse/MESOS-2074
 Project: Mesos
  Issue Type: Improvement
  Components: fetcher, slave
Reporter: Bernd Mathiske
Assignee: Bernd Mathiske
   Original Estimate: 72h
  Remaining Estimate: 72h

 To accelerate providing good test coverage for the fetcher cache (MESOS-336), 
 we can provide a framework that canonicalizes creating and running a number 
 of tasks and allows easy parametrization with combinations of the following:
 - whether to cache or not
 - whether make what has been downloaded executable or not
 - whether to extract from an archive or not
 - whether to download from a file system, http, or...
 We can create a simple HHTP server in the test fixture to support the latter.
 Furthermore, the tests need to be robust wrt. varying numbers of StatusUpdate 
 messages. An accumulating update message sink that reports the final state is 
 needed.
 All this has already been programmed in this patch, just needs to be rebased:
 https://reviews.apache.org/r/21316/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1913) Create libevent/SSL-backed Socket implementation


 [ 
https://issues.apache.org/jira/browse/MESOS-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-1913:
--
Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, 
Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 
Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20  (was: Mesosphere Q4 Sprint 3 - 
12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere 
Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6)

 Create libevent/SSL-backed Socket implementation
 

 Key: MESOS-1913
 URL: https://issues.apache.org/jira/browse/MESOS-1913
 Project: Mesos
  Issue Type: Task
Reporter: Niklas Quarfot Nielsen
Assignee: Joris Van Remoortere





--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2155) Make docker containerizer killing orphan containers optional


 [ 
https://issues.apache.org/jira/browse/MESOS-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2155:
--
Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, 
Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 
Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20  (was: Mesosphere Q4 Sprint 3 - 
12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere 
Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6)

 Make docker containerizer killing orphan containers optional
 

 Key: MESOS-2155
 URL: https://issues.apache.org/jira/browse/MESOS-2155
 Project: Mesos
  Issue Type: Improvement
  Components: docker
Reporter: Timothy Chen
Assignee: Timothy Chen

 Currently the docker containerizer on recover will kill containers that are 
 not recognized by the containerizer.
 We want to make this behavior optional as there are certain situations we 
 want to let the docker containers still continue to run.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2226) HookTest.VerifySlaveLaunchExecutorHook is flaky


 [ 
https://issues.apache.org/jira/browse/MESOS-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2226:
--
Sprint: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, 
Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20  (was: Mesosphere 
Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6)

 HookTest.VerifySlaveLaunchExecutorHook is flaky
 ---

 Key: MESOS-2226
 URL: https://issues.apache.org/jira/browse/MESOS-2226
 Project: Mesos
  Issue Type: Bug
  Components: test
Affects Versions: 0.22.0
Reporter: Vinod Kone
Assignee: Kapil Arya
  Labels: flaky-test

 Observed this on internal CI
 {code}
 [ RUN  ] HookTest.VerifySlaveLaunchExecutorHook
 Using temporary directory '/tmp/HookTest_VerifySlaveLaunchExecutorHook_GjBgME'
 I0114 18:51:34.659353  4720 leveldb.cpp:176] Opened db in 1.255951ms
 I0114 18:51:34.662112  4720 leveldb.cpp:183] Compacted db in 596090ns
 I0114 18:51:34.662364  4720 leveldb.cpp:198] Created db iterator in 177877ns
 I0114 18:51:34.662719  4720 leveldb.cpp:204] Seeked to beginning of db in 
 19709ns
 I0114 18:51:34.663010  4720 leveldb.cpp:273] Iterated through 0 keys in the 
 db in 18208ns
 I0114 18:51:34.663312  4720 replica.cpp:744] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I0114 18:51:34.664266  4735 recover.cpp:449] Starting replica recovery
 I0114 18:51:34.664908  4735 recover.cpp:475] Replica is in EMPTY status
 I0114 18:51:34.667842  4734 replica.cpp:641] Replica in EMPTY status received 
 a broadcasted recover request
 I0114 18:51:34.669117  4735 recover.cpp:195] Received a recover response from 
 a replica in EMPTY status
 I0114 18:51:34.677913  4735 recover.cpp:566] Updating replica status to 
 STARTING
 I0114 18:51:34.683157  4735 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 137939ns
 I0114 18:51:34.683507  4735 replica.cpp:323] Persisted replica status to 
 STARTING
 I0114 18:51:34.684013  4735 recover.cpp:475] Replica is in STARTING status
 I0114 18:51:34.685554  4738 replica.cpp:641] Replica in STARTING status 
 received a broadcasted recover request
 I0114 18:51:34.696512  4736 recover.cpp:195] Received a recover response from 
 a replica in STARTING status
 I0114 18:51:34.700552  4735 recover.cpp:566] Updating replica status to VOTING
 I0114 18:51:34.701128  4735 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 115624ns
 I0114 18:51:34.701478  4735 replica.cpp:323] Persisted replica status to 
 VOTING
 I0114 18:51:34.701817  4735 recover.cpp:580] Successfully joined the Paxos 
 group
 I0114 18:51:34.702569  4735 recover.cpp:464] Recover process terminated
 I0114 18:51:34.716439  4736 master.cpp:262] Master 
 20150114-185134-2272962752-57018-4720 (fedora-19) started on 
 192.168.122.135:57018
 I0114 18:51:34.716913  4736 master.cpp:308] Master only allowing 
 authenticated frameworks to register
 I0114 18:51:34.717136  4736 master.cpp:313] Master only allowing 
 authenticated slaves to register
 I0114 18:51:34.717488  4736 credentials.hpp:36] Loading credentials for 
 authentication from 
 '/tmp/HookTest_VerifySlaveLaunchExecutorHook_GjBgME/credentials'
 I0114 18:51:34.718077  4736 master.cpp:357] Authorization enabled
 I0114 18:51:34.719238  4738 whitelist_watcher.cpp:65] No whitelist given
 I0114 18:51:34.719755  4737 hierarchical_allocator_process.hpp:285] 
 Initialized hierarchical allocator process
 I0114 18:51:34.722584  4736 master.cpp:1219] The newly elected leader is 
 master@192.168.122.135:57018 with id 20150114-185134-2272962752-57018-4720
 I0114 18:51:34.722865  4736 master.cpp:1232] Elected as the leading master!
 I0114 18:51:34.723310  4736 master.cpp:1050] Recovering from registrar
 I0114 18:51:34.723760  4734 registrar.cpp:313] Recovering registrar
 I0114 18:51:34.725229  4740 log.cpp:660] Attempting to start the writer
 I0114 18:51:34.727893  4739 replica.cpp:477] Replica received implicit 
 promise request with proposal 1
 I0114 18:51:34.728425  4739 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 114781ns
 I0114 18:51:34.728662  4739 replica.cpp:345] Persisted promised to 1
 I0114 18:51:34.731271  4741 coordinator.cpp:230] Coordinator attemping to 
 fill missing position
 I0114 18:51:34.733223  4734 replica.cpp:378] Replica received explicit 
 promise request for position 0 with proposal 2
 I0114 18:51:34.734076  4734 leveldb.cpp:343] Persisting action (8 bytes) to 
 leveldb took 87441ns
 I0114 18:51:34.734441  4734 replica.cpp:679] Persisted action at 0
 I0114 18:51:34.740272  4739 replica.cpp:511] Replica received write request 
 for position 0
 I0114 18:51:34.740910  4739 leveldb.cpp:438] Reading position from leveldb 
 took 59846ns
 I0114 18:51:34.741672  4739

[jira] [Updated] (MESOS-2160) Add support for allocator modules


 [ 
https://issues.apache.org/jira/browse/MESOS-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2160:
--
Sprint: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, 
Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20  (was: Mesosphere 
Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6)

 Add support for allocator modules
 -

 Key: MESOS-2160
 URL: https://issues.apache.org/jira/browse/MESOS-2160
 Project: Mesos
  Issue Type: Task
Reporter: Niklas Quarfot Nielsen
Assignee: Alexander Rukletsov
  Labels: mesosphere

 Currently Mesos supports only the DRF allocator, changing which requires 
 hacking Mesos source code, which, in turn, sets a high entry barrier. 
 Allocator modules give an easy possibility to tweak resource allocation 
 policy. This will enable swapping allocation policies without the necessity 
 to edit Mesos source code. Custom allocators may be written by everybody and 
 does not need be distributed together with Mesos.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2057) Concurrency control for fetcher cache

[
https://issues.apache.org/jira/browse/MESOS-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Niklas Quarfot Nielsen updated MESOS-2057:
--
Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23,
Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1
Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q4 Sprint 3 -
12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere
Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6)

Concurrency control for fetcher cache
-

Key: MESOS-2057
URL: https://issues.apache.org/jira/browse/MESOS-2057
Project: Mesos
Issue Type: Improvement
Components: fetcher, slave
Reporter: Bernd Mathiske
Assignee: Bernd Mathiske
Original Estimate: 96h
Remaining Estimate: 96h

Having added a URI flag to CommandInfo messages (in MESOS-2069) that
indicates caching, caching files downloaded by the fetcher in a repository,
now ensure that when a URI is cached, it is only ever downloaded once for
the same user on the same slave as long as the slave keeps running.
This even holds if multiple tasks request the same URI concurrently. If
multiple requests for the same URI occur, perform only one of them and reuse
the result. Make concurrent requests for the same URI wait for the one
download.
Different URIs from different CommandInfos can be downloaded concurrently.
No cache eviction, cleanup or failover will be handled for now. Additional
tickets will be filed for these enhancements. (So don't use this feature in
production until the whole epic is complete.)
Note that implementing this does not suffice for production use. This ticket
contains the main part of the fetcher logic, though. See the epic MESOS-336
for the rest of the features that lead to a fully functional fetcher cache.
The proposed general approach is to keep all bookkeeping about what is in
which stage of being fetched and where it resides in the slave's
MesosContainerizerProcess, so that all concurrent access is disambiguated and
controlled by an actor (aka libprocess process).
Depends on MESOS-2056 and MESOS-2069.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2157) Add /master/slaves and /master/frameworks/{framework}/tasks/{task} endpoints


 [ 
https://issues.apache.org/jira/browse/MESOS-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2157:
--
Sprint: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, 
Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20  (was: Mesosphere 
Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6)

 Add /master/slaves and /master/frameworks/{framework}/tasks/{task} endpoints
 

 Key: MESOS-2157
 URL: https://issues.apache.org/jira/browse/MESOS-2157
 Project: Mesos
  Issue Type: Task
  Components: master
Reporter: Niklas Quarfot Nielsen
Assignee: Alexander Rojas
Priority: Trivial
  Labels: mesosphere, newbie

 master/state.json exports the entire state of the cluster and can, for large 
 clusters, become massive (tens of megabytes of JSON).
 Often, a client only need information about subsets of the entire state, for 
 example all connected slaves, or information (registration info, tasks, etc) 
 belonging to a particular framework.
 We can partition state.json into many smaller endpoints, but for starters, 
 being able to get slave information and tasks information per framework would 
 be useful.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2119) Add Socket tests


 [ 
https://issues.apache.org/jira/browse/MESOS-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2119:
--
Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, 
Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 
Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20  (was: Mesosphere Q4 Sprint 3 - 
12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere 
Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6)

 Add Socket tests
 

 Key: MESOS-2119
 URL: https://issues.apache.org/jira/browse/MESOS-2119
 Project: Mesos
  Issue Type: Task
Reporter: Niklas Quarfot Nielsen
Assignee: Joris Van Remoortere

 Add more Socket specific tests to get coverage while doing libev to libevent 
 (w and wo SSL) move



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2373) DRFSorter needs to distinguish resources from different slaves.


 [ 
https://issues.apache.org/jira/browse/MESOS-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2373:
--
Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, 
Mesosphere Q1 Sprint 5 - 3/20  (was: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere 
Q1 Sprint 4 - 3/6)

 DRFSorter needs to distinguish resources from different slaves.
 ---

 Key: MESOS-2373
 URL: https://issues.apache.org/jira/browse/MESOS-2373
 Project: Mesos
  Issue Type: Bug
  Components: allocation
Reporter: Michael Park
Assignee: Michael Park
  Labels: mesosphere

 Currently the {{DRFSorter}} aggregates total and allocated resources across 
 multiple slaves, which only works for scalar resources. We need to 
 distinguish resources from different slaves.
 Suppose we have 2 slaves and 1 framework. The framework is allocated all 
 resources from both slaves.
 {code}
 Resources slaveResources =
   Resources::parse(cpus:2;mem:512;ports:[31000-32000]).get();
 DRFSorter sorter;
 sorter.add(slaveResources);  // Add slave1 resources
 sorter.add(slaveResources);  // Add slave2 resources
 // Total resources in sorter at this point is
 // cpus(*):4; mem(*):1024; ports(*):[31000-32000].
 // The scalar resources get aggregated correctly but ports do not.
 sorter.add(F);
 // The 2 calls to allocated only works because we simply do:
 //   allocation[name] += resources;
 // without checking that the 'resources' is available in the total.
 sorter.allocated(F, slaveResources);
 sorter.allocated(F, slaveResources);
 // At this point, sorter.allocation(F) is:
 // cpus(*):4; mem(*):1024; ports(*):[31000-32000].
 {code}
 To provide some context, this issue came up while trying to reserve all 
 unreserved resources from every offer.
 {code}
 for (const Offer offer : offers) { 
   Resources unreserved = offer.resources().unreserved();
   Resources reserved = unreserved.flatten(role, Resource::FRAMEWORK); 
   Offer::Operation reserve;
   reserve.set_type(Offer::Operation::RESERVE); 
   reserve.mutable_reserve()-mutable_resources()-CopyFrom(reserved); 
  
   driver-acceptOffers({offer.id()}, {reserve}); 
 } 
 {code}
 Suppose the slave resources are the same as above:
 {quote}
 Slave1: {{cpus(\*):2; mem(\*):512; ports(\*):\[31000-32000\]}}
 Slave2: {{cpus(\*):2; mem(\*):512; ports(\*):\[31000-32000\]}}
 {quote}
 Initial (incorrect) total resources in the DRFSorter is:
 {quote}
 {{cpus(\*):4; mem(\*):1024; ports(\*):\[31000-32000\]}}
 {quote}
 We receive 2 offers, 1 from each slave:
 {quote}
 Offer1: {{cpus(\*):2; mem(\*):512; ports(\*):\[31000-32000\]}}
 Offer2: {{cpus(\*):2; mem(\*):512; ports(\*):\[31000-32000\]}}
 {quote}
 At this point, the resources allocated for the framework is:
 {quote}
 {{cpus(\*):4; mem(\*):1024; ports(\*):\[31000-32000\]}}
 {quote}
 After first {{RESERVE}} operation with Offer1:
 The allocated resources for the framework becomes:
 {quote}
 {{cpus(\*):2; mem(\*):512; cpus(role):2; mem(role):512; 
 ports(role):\[31000-32000\]}}
 {quote}
 During second {{RESERVE}} operation with Offer2:
 {code:title=HierarchicalAllocatorProcess::updateAllocation}
   // ...
   FrameworkSorter* frameworkSorter =
 frameworkSorters[frameworks\[frameworkId\].role];
   Resources allocation = frameworkSorter-allocation(frameworkId.value());
   // Update the allocated resources.
   TryResources updatedAllocation = allocation.apply(operations);
   CHECK_SOME(updatedAllocation);
   // ...
 {code}
 {{allocation}} in the above code is:
 {quote}
 {{cpus(\*):2; mem(\*):512; cpus(role):2; mem(role):512; 
 ports(role):\[31000-32000\]}}
 {quote}
 We try to {{apply}} a {{RESERVE}} operation and we fail to find 
 {{ports(\*):\[31000-32000\]}} which leads to the {{CHECK}} fail at 
 {{CHECK_SOME(updatedAllocation);}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2248) 0.22.0 release


 [ 
https://issues.apache.org/jira/browse/MESOS-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2248:
--
Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, 
Mesosphere Q1 Sprint 5 - 3/20  (was: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere 
Q1 Sprint 4 - 3/6)

 0.22.0 release
 --

 Key: MESOS-2248
 URL: https://issues.apache.org/jira/browse/MESOS-2248
 Project: Mesos
  Issue Type: Epic
Reporter: Niklas Quarfot Nielsen
Assignee: Niklas Quarfot Nielsen

 Mesos release 0.22.0 will include the following major feature(s):
  - Module Hooks (MESOS-2060)
  - Disk quota isolation in Mesos containerizer (MESOS-1587 and MESOS-1588)
 Minor features and fixes:
  - Task labels (MESOS-2120)
  - Service discovery info for tasks and executors (MESOS-2208)
 - Docker containerizer able to recover when running in a container 
 (MESOS-2115)
  - Containerizer fixes (...)
  - Various bug fixes (...)
 Possible major features:
  - Container level network isolation (MESOS-1585)
  - Dynamic Reservations (MESOS-2018)
 This ticket will be used to track blockers to this release.
 For reference (per Jan 22nd) this has gone into Mesos since 0.21.1: 
 https://gist.github.com/nqn/76aeb41a555625659ed8



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1831) Master should send PingSlaveMessage instead of PING


 [ 
https://issues.apache.org/jira/browse/MESOS-1831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-1831:
--
Sprint: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, 
Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20  (was: Mesosphere 
Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6)

 Master should send PingSlaveMessage instead of PING
 -

 Key: MESOS-1831
 URL: https://issues.apache.org/jira/browse/MESOS-1831
 Project: Mesos
  Issue Type: Task
Reporter: Vinod Kone
Assignee: Adam B
  Labels: mesosphere

 In 0.21.0 master sends PING message with an embedded PingSlaveMessage for 
 backwards compatibility (https://reviews.apache.org/r/25867/).
 In 0.22.0, master should send PingSlaveMessage directly instead of PING.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2215) The Docker containerizer attempts to recover any task when checkpointing is enabled, not just docker tasks.


 [ 
https://issues.apache.org/jira/browse/MESOS-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2215:
--
Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, 
Mesosphere Q1 Sprint 5 - 3/20  (was: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere 
Q1 Sprint 4 - 3/6)

 The Docker containerizer attempts to recover any task when checkpointing is 
 enabled, not just docker tasks.
 ---

 Key: MESOS-2215
 URL: https://issues.apache.org/jira/browse/MESOS-2215
 Project: Mesos
  Issue Type: Bug
  Components: docker
Affects Versions: 0.21.0
Reporter: Steve Niemitz
Assignee: Timothy Chen

 Once the slave restarts and recovers the task, I see this error in the log 
 for all tasks that were recovered every second or so.  Note, these were NOT 
 docker tasks:
 W0113 16:01:00.790323 773142 monitor.cpp:213] Failed to get resource usage 
 for  container 7b729b89-dc7e-4d08-af97-8cd1af560a21 for executor 
 thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd
  of framework 20150109-161713-715350282-5050-290797-: Failed to 'docker 
 inspect mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21': exit status = exited 
 with status 1 stderr = Error: No such image or container: 
 mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21
 However the tasks themselves are still healthy and running.
 The slave was launched with --containerizers=mesos,docker
 -
 More info: it looks like the docker containerizer is a little too ambitious 
 about recovering containers, again this was not a docker task:
 I0113 15:59:59.476145 773142 docker.cpp:814] Recovering container 
 '7b729b89-dc7e-4d08-af97-8cd1af560a21' for executor 
 'thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd'
  of framework 20150109-161713-715350282-5050-290797-
 Looking into the source, it looks like the problem is that the 
 ComposingContainerize runs recover in parallel, but neither the docker 
 containerizer nor mesos containerizer check if they should recover the task 
 or not (because they were the ones that launched it).  Perhaps this needs to 
 be written into the checkpoint somewhere?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2072) Fetcher cache eviction

[
https://issues.apache.org/jira/browse/MESOS-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Niklas Quarfot Nielsen updated MESOS-2072:
--
Sprint: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20,
Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere
Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6)

Fetcher cache eviction
--

Key: MESOS-2072
URL: https://issues.apache.org/jira/browse/MESOS-2072
Project: Mesos
Issue Type: Improvement
Components: fetcher, slave
Reporter: Bernd Mathiske
Assignee: Bernd Mathiske
Original Estimate: 336h
Remaining Estimate: 336h

Delete files from the fetcher cache so that a given cache size is never
exceeded. Succeed in doing so while concurrent downloads are on their way and
new requests are pouring in.
Idea: measure the size of each download before it begins, make enough room
before the download. This means that only download mechanisms that divulge
the size before the main download will be supported. AFAWK, those in use so
far have this property.
The calculation of how much space to free needs to be under concurrency
control, accumulating all space needed for competing, incomplete download
requests. (The Python script that performs fetcher caching for Aurora does
not seem to implement this. See
https://gist.github.com/zmanji/f41df77510ef9d00265a, imagine several of these
programs running concurrently, each one's _cache_eviction() call succeeding,
each perceiving the SAME free space being available.)
Ultimately, a conflict resolution strategy is needed if just the downloads
underway already exceed the cache capacity. Then, as a fallback, direct
download into the work directory will be used for some tasks. TBD how to pick
which task gets treated how.
At first, only support copying of any downloaded files to the work directory
for task execution. This isolates the task life cycle after starting a task
from cache eviction considerations.
(Later, we can add symbolic links that avoid copying. But then eviction of
fetched files used by ongoing tasks must be blocked, which adds complexity.
another future extension is MESOS-1667 Extract from URI while downloading
into work dir).

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2069) Basic fetcher cache functionality


 [ 
https://issues.apache.org/jira/browse/MESOS-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2069:
--
Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, 
Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 
Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20  (was: Mesosphere Q4 Sprint 3 - 
12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere 
Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6)

 Basic fetcher cache functionality
 -

 Key: MESOS-2069
 URL: https://issues.apache.org/jira/browse/MESOS-2069
 Project: Mesos
  Issue Type: Improvement
  Components: fetcher, slave
Reporter: Bernd Mathiske
Assignee: Bernd Mathiske
  Labels: fetcher, slave
   Original Estimate: 48h
  Remaining Estimate: 48h

 Add a flag to CommandInfo URI protobufs that indicates that files downloaded 
 by the fetcher shall be cached in a repository. To be followed by MESOS-2057 
 for concurrency control.
 Also see MESOS-336 for the overall goals for the fetcher cache.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2337) init.py not getting installed in $PREFIX/lib/pythonX.Y/site-packages/mesos


 [ 
https://issues.apache.org/jira/browse/MESOS-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2337:
--
Sprint: Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20  (was: 
Mesosphere Q1 Sprint 4 - 3/6)

 __init__.py not getting installed in $PREFIX/lib/pythonX.Y/site-packages/mesos
 --

 Key: MESOS-2337
 URL: https://issues.apache.org/jira/browse/MESOS-2337
 Project: Mesos
  Issue Type: Bug
  Components: build, python api
Reporter: Kapil Arya
Assignee: Kapil Arya
Priority: Blocker

 When doing a make install, the src/python/native/src/mesos/__init__.py file 
 is not getting installed in ${PREFIX}/lib/pythonX.Y/site-packages/mesos/.  
 This makes it impossible to do the following import when PYTHONPATH is set to 
 the site-packages directory.
 {code}
 import mesos.interface.mesos_pb2
 {code}
 The directories 
 `${PREFIX}/lib/pythonX.Y/site-packages/mesos/{interface,native}/` do have 
 their corresponding `__init__.py` files.
 Reproducing the bug:
 ../configure --prefix=$HOME/test-install  make install



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2070) Implement simple slave recovery behavior for fetcher cache


 [ 
https://issues.apache.org/jira/browse/MESOS-2070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2070:
--
Sprint: Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, 
Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 
Sprint 5 - 3/20  (was: Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 
2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6)

 Implement simple slave recovery behavior for fetcher cache
 --

 Key: MESOS-2070
 URL: https://issues.apache.org/jira/browse/MESOS-2070
 Project: Mesos
  Issue Type: Improvement
  Components: fetcher, slave
Reporter: Bernd Mathiske
Assignee: Bernd Mathiske
  Labels: newbie
   Original Estimate: 6h
  Remaining Estimate: 6h

 Clean the fetcher cache completely upon slave restart/recovery. This 
 implements correct, albeit not ideal behavior. More efficient schemes that 
 restore knowledge about cached files or even resume downloads can be added 
 later. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2016) docker_name_prefix is too generic


 [ 
https://issues.apache.org/jira/browse/MESOS-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2016:
--
Sprint: Mesosphere Q4 Sprint 2 - 11/14, Mesosphere Q4 Sprint 3 - 12/7, 
Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 
Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20  
(was: Mesosphere Q4 Sprint 2 - 11/14, Mesosphere Q4 Sprint 3 - 12/7, Mesosphere 
Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 
2/20, Mesosphere Q1 Sprint 4 - 3/6)

 docker_name_prefix is too generic
 -

 Key: MESOS-2016
 URL: https://issues.apache.org/jira/browse/MESOS-2016
 Project: Mesos
  Issue Type: Bug
Reporter: Jay Buffington
Assignee: Timothy Chen

 From docker.hpp and docker.cpp:
 {quote}
 // Prefix used to name Docker containers in order to distinguish those
 // created by Mesos from those created manually.
 extern std::string DOCKER_NAME_PREFIX;
 // TODO(benh): At some point to run multiple slaves we'll need to make
 // the Docker container name creation include the slave ID.
 string DOCKER_NAME_PREFIX = mesos-;
 {quote}
 This name is too generic.  A common pattern in docker land is to run 
 everything in a container and use volume mounts to share sockets do RPC 
 between containers.  CoreOS has popularized this technique. 
 Inevitably, what people do is start a container named mesos-slave which 
 runs the docker containerizer recovery code which removes all containers that 
 start with mesos-  And then ask huh, why did my mesos-slave docker 
 container die? I don't see any error messages...
 Ideally, we should do what Ben suggested and add the slave id to the name 
 prefix.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-1806) Substituting etcd or ReplicatedLog for Zookeeper


 [ 
https://issues.apache.org/jira/browse/MESOS-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-1806:
--
Sprint: Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, 
Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 
Sprint 5 - 3/20  (was: Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 
2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6)

 Substituting etcd or ReplicatedLog for Zookeeper
 

 Key: MESOS-1806
 URL: https://issues.apache.org/jira/browse/MESOS-1806
 Project: Mesos
  Issue Type: Task
Reporter: Ed Ropple
Assignee: Cody Maloney
Priority: Minor

 adam_mesos   eropple: Could you also file a new JIRA for Mesos to drop ZK 
 in favor of etcd or ReplicatedLog? Would love to get some momentum going on 
 that one.
 --
 Consider it filed. =)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2050) InMemoryAuxProp plugin used by Authenticators results in SEGFAULT


 [ 
https://issues.apache.org/jira/browse/MESOS-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2050:
--
Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, 
Mesosphere Q1 Sprint 5 - 3/20  (was: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere 
Q1 Sprint 4 - 3/6)

 InMemoryAuxProp plugin used by Authenticators results in SEGFAULT
 -

 Key: MESOS-2050
 URL: https://issues.apache.org/jira/browse/MESOS-2050
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.21.0
Reporter: Vinod Kone
Assignee: Till Toenshoff

 Observed this on ASF CI:
 Basically, as part of the recent Auth refactor for modules, the loading of 
 secrets is being done once per Authenticator Process instead of once in the 
 Master.  Since, InMemoryAuxProp plugin manipulates static variables (e.g, 
 'properties') it results in SEGFAULT when one Authenticator (e.g., for slave) 
 does load() while another Authenticator (e.g., for framework) does lookup(), 
 as both these methods manipulate static 'properties'.
 {code}
 [ RUN  ] MasterTest.LaunchDuplicateOfferTest
 Using temporary directory '/tmp/MasterTest_LaunchDuplicateOfferTest_XEBbvp'
 I1104 03:37:55.523553 28363 leveldb.cpp:176] Opened db in 2.270387ms
 I1104 03:37:55.524250 28363 leveldb.cpp:183] Compacted db in 662527ns
 I1104 03:37:55.524276 28363 leveldb.cpp:198] Created db iterator in 4964ns
 I1104 03:37:55.524284 28363 leveldb.cpp:204] Seeked to beginning of db in 
 702ns
 I1104 03:37:55.524291 28363 leveldb.cpp:273] Iterated through 0 keys in the 
 db in 450ns
 I1104 03:37:55.524333 28363 replica.cpp:741] Replica recovered with log 
 positions 0 - 0 with 1 holes and 0 unlearned
 I1104 03:37:55.524852 28384 recover.cpp:437] Starting replica recovery
 I1104 03:37:55.525188 28384 recover.cpp:463] Replica is in EMPTY status
 I1104 03:37:55.526577 28378 replica.cpp:638] Replica in EMPTY status received 
 a broadcasted recover request
 I1104 03:37:55.527135 28378 master.cpp:318] Master 
 20141104-033755-3176252227-49988-28363 (proserpina.apache.org) started on 
 67.195.81.189:49988
 I1104 03:37:55.527180 28378 master.cpp:364] Master only allowing 
 authenticated frameworks to register
 I1104 03:37:55.527191 28378 master.cpp:369] Master only allowing 
 authenticated slaves to register
 I1104 03:37:55.527217 28378 credentials.hpp:36] Loading credentials for 
 authentication from 
 '/tmp/MasterTest_LaunchDuplicateOfferTest_XEBbvp/credentials'
 I1104 03:37:55.527451 28378 master.cpp:408] Authorization enabled
 I1104 03:37:55.528081 28384 master.cpp:126] No whitelist given. Advertising 
 offers for all slaves
 I1104 03:37:55.528548 28383 recover.cpp:188] Received a recover response from 
 a replica in EMPTY status
 I1104 03:37:55.528645 28388 hierarchical_allocator_process.hpp:299] 
 Initializing hierarchical allocator process with master : 
 master@67.195.81.189:49988
 I1104 03:37:55.529233 28388 master.cpp:1258] The newly elected leader is 
 master@67.195.81.189:49988 with id 20141104-033755-3176252227-49988-28363
 I1104 03:37:55.529266 28388 master.cpp:1271] Elected as the leading master!
 I1104 03:37:55.529289 28388 master.cpp:1089] Recovering from registrar
 I1104 03:37:55.529311 28385 recover.cpp:554] Updating replica status to 
 STARTING
 I1104 03:37:55.529500 28384 registrar.cpp:313] Recovering registrar
 I1104 03:37:55.530037 28383 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 497965ns
 I1104 03:37:55.530083 28383 replica.cpp:320] Persisted replica status to 
 STARTING
 I1104 03:37:55.530335 28387 recover.cpp:463] Replica is in STARTING status
 I1104 03:37:55.531343 28381 replica.cpp:638] Replica in STARTING status 
 received a broadcasted recover request
 I1104 03:37:55.531739 28384 recover.cpp:188] Received a recover response from 
 a replica in STARTING status
 I1104 03:37:55.532168 28379 recover.cpp:554] Updating replica status to VOTING
 I1104 03:37:55.532572 28381 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 293974ns
 I1104 03:37:55.532594 28381 replica.cpp:320] Persisted replica status to 
 VOTING
 I1104 03:37:55.532790 28390 recover.cpp:568] Successfully joined the Paxos 
 group
 I1104 03:37:55.533107 28390 recover.cpp:452] Recover process terminated
 I1104 03:37:55.533604 28382 log.cpp:656] Attempting to start the writer
 I1104 03:37:55.534840 28381 replica.cpp:474] Replica received implicit 
 promise request with proposal 1
 I1104 03:37:55.535188 28381 leveldb.cpp:306] Persisting metadata (8 bytes) to 
 leveldb took 321021ns
 I1104 03:37:55.535212 28381 replica.cpp:342] Persisted promised to 1
 I1104 03:37:55.535893 28378 coordinator.cpp:230] Coordinator attemping to 
 fill missing position
 I1104 03:37:55.537318 28392 replica.cpp:375] Replica received explicit 
 promise

[jira] [Updated] (MESOS-2085) Add support encrypted and non-encrypted communication in parallel for cluster upgrade


 [ 
https://issues.apache.org/jira/browse/MESOS-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2085:
--
Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, 
Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 
Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20  (was: Mesosphere Q4 Sprint 3 - 
12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere 
Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6)

 Add support encrypted and non-encrypted communication in parallel for cluster 
 upgrade
 -

 Key: MESOS-2085
 URL: https://issues.apache.org/jira/browse/MESOS-2085
 Project: Mesos
  Issue Type: Task
Reporter: Niklas Quarfot Nielsen
Assignee: Joris Van Remoortere

 During cluster upgrade from non-encrypted to encrypted communication, we need 
 to support an interim where:
 1) A master can have connections to both encrypted and non-encrypted slaves
 2) A slave that supports encrypted communication connects to a master that 
 has not yet been upgraded.
 3) Frameworks are encrypted but the master has not been upgraded yet.
 4) Master has been upgraded but frameworks haven't.
 5) A slave process has upgraded but running executor processes haven't.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (MESOS-2309) Mesos rejects ExecutorInfo as incompatible when there is no functional difference


[ 
https://issues.apache.org/jira/browse/MESOS-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353839#comment-14353839
 ] 

Joerg Schad edited comment on MESOS-2309 at 3/9/15 11:10 PM:
-

Just tested and you are correct :-).


was (Author: js84):
Just tested and you are right :-).

 Mesos rejects ExecutorInfo as incompatible when there is no functional 
 difference
 -

 Key: MESOS-2309
 URL: https://issues.apache.org/jira/browse/MESOS-2309
 Project: Mesos
  Issue Type: Bug
Reporter: Zameer Manji
Assignee: Joerg Schad
Priority: Minor
  Labels: twitter

 In AURORA-1076 it was discovered that if an ExecutorInfo was changed such 
 that a previously unset optional field with a default value was changed to 
 have the field set with the default value, it would be rejected as not 
 compatible.
 For example if we have an ExecutorInfo with a CommandInfo with the {{shell}} 
 attribute unset and then we change the CommandInfo to set the {{shell}} 
 attribute to true Mesos will reject the task with:
 {noformat}
 I0130 21:50:05.373389 50869 master.cpp:3441] Sending status update TASK_LOST 
 (UUID: 82ef615c-0d59-4427-95d5-80cf0e52b3fc) for task 
 system-gc-c89c0c05-200c-462e-958a-ecd7b9a76831 of framework 
 201103282247-19- 'Task has invalid ExecutorInfo (existing 
 ExecutorInfo with same ExecutorID is not compatible).
 {noformat}
 This is not intuitive because the default value of the {{shell}} attribute is 
 true. There should be no difference between not setting an optional field 
 with a default value and setting that field to the default value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2333) Securing Sandboxes via Filebrowser Access Control

[
https://issues.apache.org/jira/browse/MESOS-2333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Niklas Quarfot Nielsen updated MESOS-2333:
--
Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6,
Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere
Q1 Sprint 4 - 3/6)

Securing Sandboxes via Filebrowser Access Control
-

Key: MESOS-2333
URL: https://issues.apache.org/jira/browse/MESOS-2333
Project: Mesos
Issue Type: Improvement
Components: security
Reporter: Adam B
Assignee: Alexander Rojas
Labels: authorization, filebrowser, mesosphere, security

As it stands now, anybody with access to the master or slave web UI can use
the filebrowser to view the contents of any attached/mounted paths on the
master or slave. Currently, the attached paths include master and slave logs
as well as executor/task sandboxes. While there's a chance that the master
and slave logs could contain sensitive information, it's much more likely
that sandboxes could contain customer data or other files that should not be
globally accessible. Securing the sandboxes is the primary goal of this
ticket.
There are four filebrowser endpoints: browse, read, download, and debug. Here
are some potential solutions.
1) We could easily provide flags that globally enable/disable each endpoint,
allowing coarse-grained access control. This might be a reasonable
short-term plan. We would also want to update the web UIs to display an
Access Denied error, rather than showing links that open up blank pailers.
2) Each master and slave handles is own authn/authz. Slaves will need to have
an authenticator, and there must be a way to provide each node with
credentials and ACLs, and keep these in sync across the cluster.
3) Filter all slave communications through the master(s), which already has
credentials and ACLs. We'll have to restrict access to the filebrowser (and
other?) endpoints to the (leading?) master. Then the master can perform the
authentication and authorization, only passing the request on to the slave if
auth succeeds.
3a) The slave returns the browse/read/download response back through the
master. This could be a network bottleneck.
3b) Upon authn/z success, the master redirects the request to the appropriate
slave, which will send the response directly back to the requester.

--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2139) Enable master's Accept call handler to support Dynamic Reservation


 [ 
https://issues.apache.org/jira/browse/MESOS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2139:
--
Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, 
Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 
Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20  (was: Mesosphere Q4 Sprint 3 - 
12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere 
Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6)

 Enable master's Accept call handler to support Dynamic Reservation 
 ---

 Key: MESOS-2139
 URL: https://issues.apache.org/jira/browse/MESOS-2139
 Project: Mesos
  Issue Type: Task
  Components: master
Reporter: Michael Park
Assignee: Michael Park
  Labels: mesosphere

 The allocated resources in the allocator needs to be updated when a dynamic 
 reservation is performed because we need to transition the {{Resources}} that 
 are marked {{reservationType=STATIC}} to {{DYNAMIC}}. 
 {{Resources::apply(Offer::Operation)}} is used to determine the resulting set 
 of resources after an operation. This is to be used to update the resources 
 in places such as the allocator and the total slave resources.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2229) Add max allowed age to Slave stats.json endpoint


 [ 
https://issues.apache.org/jira/browse/MESOS-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Niklas Quarfot Nielsen updated MESOS-2229:
--
Sprint: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, 
Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20  (was: Mesosphere 
Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6)

 Add max allowed age to Slave stats.json endpoint
 --

 Key: MESOS-2229
 URL: https://issues.apache.org/jira/browse/MESOS-2229
 Project: Mesos
  Issue Type: Improvement
  Components: json api
Reporter: Sunil Abraham
Assignee: Alexander Rojas
  Labels: mesosphere

 Currently max allowed age gets logged, but it would be great to have this 
 in the slave's stats.json endpoint for programmatic access.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2161) AbstractState JNI check fails for Marathon framework

2015-03-09 Thread Jason Swartz (JIRA)


 [ 
https://issues.apache.org/jira/browse/MESOS-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jason Swartz updated MESOS-2161:

Attachment: mesos_core_dump_gdb.txt.bz2

GDB investigation of the core dump from Marathon including registers.

 AbstractState JNI check fails for Marathon framework
 

 Key: MESOS-2161
 URL: https://issues.apache.org/jira/browse/MESOS-2161
 Project: Mesos
  Issue Type: Bug
Affects Versions: 0.21.0
 Environment: Mesos 0.21.0
 Marathon 0.7.5
 Fedora 20
Reporter: Matthew Sanders
 Attachments: mesos_core_dump_gdb.txt.bz2


 I've recently upgraded to mesos 0.21.0 and now it seems that every few 
 minutes or so I see the following error, which kills marathon. 
 Nov 25 18:12:42 gianthornet.trading.imc.intra marathon[5453]: [2014-11-25 
 18:12:42,064] INFO 10.133.128.26 -  -  [26/Nov/2014:00:12:41 +] GET 
 /v2/apps HTTP/1.1 200 2321 http://marathon:8080/; Mozilla/5.0 (X11; Linux 
 x86_64; rv:31.0) Gecko/20100101 Firefox/31.0 
 (mesosphere.chaos.http.ChaosRequestLog:15)
 Nov 25 18:12:42 gianthornet.trading.imc.intra marathon[5453]: [2014-11-25 
 18:12:42,238] INFO 10.133.128.26 -  -  [26/Nov/2014:00:12:42 +] GET 
 /v2/deployments HTTP/1.1 200 2 http://marathon:8080/; Mozilla/5.0 (X11; 
 Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0 
 (mesosphere.chaos.http.ChaosRequestLog:15)
 Nov 25 18:12:42 gianthornet.trading.imc.intra marathon[5453]: [2014-11-25 
 18:12:42,961] INFO 10.192.221.95 -  -  [26/Nov/2014:00:12:42 +] GET 
 /v2/apps HTTP/1.1 200 2321 http://marathon:8080/; Mozilla/5.0 (Macintosh; 
 Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) 
 Chrome/39.0.2171.65 Safari/537...
 Nov 25 18:12:43 gianthornet.trading.imc.intra marathon[5453]: [2014-11-25 
 18:12:43,032] INFO 10.192.221.95 -  -  [26/Nov/2014:00:12:42 +] GET 
 /v2/deployments HTTP/1.1 200 2 http://marathon:8080/; Mozilla/5.0 
 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) 
 Chrome/39.0.2171.65 Safari...
 Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: F1125 
 18:12:44.146260  5897 check.hpp:79] Check failed: f.isReady()
 Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: *** Check 
 failure stack trace: ***
 Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 
 0x7f8176a2b17c  google::LogMessage::Fail()
 Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 
 0x7f8176a2b0d5  google::LogMessage::SendToLog()
 Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 
 0x7f8176a2aab3  google::LogMessage::Flush()
 Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 
 0x7f8176a2da3b  google::LogMessageFatal::~LogMessageFatal()
 Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 
 0x7f8176a1ea64  _checkReady()
 Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 
 0x7f8176a1d43b  Java_org_apache_mesos_state_AbstractState__1_1names_1get
 Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 
 0x7f81f644ca70  (unknown)
 Nov 25 18:12:44 gianthornet.trading.imc.intra systemd[1]: marathon.service: 
 main process exited, code=killed, status=6/ABRT
 Here's the command that mesos-master is being run with
 /usr/local/sbin/mesos-master 
 --zk=zk://usint-zk-d01-node1chi:2191,usint-zk-d01-node2chi:2192,usint-zk-d01-node3chi:2193/mesos
  --port=5050 --log_dir=/var/log/mesos --quorum=1 --work_dir=/var/lib/mesos
 Here's the command that the slave is running with:
 /usr/local/sbin/mesos-slave 
 --master=zk://usint-zk-d01-node1chi:2191,usint-zk-d01-node2chi:2192,usint-zk-d01-node3chi:2193/mesos
  --log_dir=/var/log/mesos --containerizers=docker,mesos 
 --executor_registration_timeout=5mins 
 --attributes=country:us;datacenter:njl3;environment:dev;region:amer;timezone:America/Chicago
 I realize this could also be filed to marathon, but it sort of looks like a 
 c++ issue to me, which is why I came here to post this. Any help would be 
 greatly appreciated. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (MESOS-2309) Mesos rejects ExecutorInfo as incompatible when there is no functional difference


 [ 
https://issues.apache.org/jira/browse/MESOS-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinod Kone updated MESOS-2309:
--
Assignee: Vinod Kone  (was: Joerg Schad)

 Mesos rejects ExecutorInfo as incompatible when there is no functional 
 difference
 -

 Key: MESOS-2309
 URL: https://issues.apache.org/jira/browse/MESOS-2309
 Project: Mesos
  Issue Type: Bug
Reporter: Zameer Manji
Assignee: Vinod Kone
Priority: Minor
  Labels: twitter

 In AURORA-1076 it was discovered that if an ExecutorInfo was changed such 
 that a previously unset optional field with a default value was changed to 
 have the field set with the default value, it would be rejected as not 
 compatible.
 For example if we have an ExecutorInfo with a CommandInfo with the {{shell}} 
 attribute unset and then we change the CommandInfo to set the {{shell}} 
 attribute to true Mesos will reject the task with:
 {noformat}
 I0130 21:50:05.373389 50869 master.cpp:3441] Sending status update TASK_LOST 
 (UUID: 82ef615c-0d59-4427-95d5-80cf0e52b3fc) for task 
 system-gc-c89c0c05-200c-462e-958a-ecd7b9a76831 of framework 
 201103282247-19- 'Task has invalid ExecutorInfo (existing 
 ExecutorInfo with same ExecutorID is not compatible).
 {noformat}
 This is not intuitive because the default value of the {{shell}} attribute is 
 true. There should be no difference between not setting an optional field 
 with a default value and setting that field to the default value.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-1921) Design and implement protobuf storage of IP addresses

2015-03-09 Thread Evelina Dumitrescu (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353926#comment-14353926
 ] 

Evelina Dumitrescu commented on MESOS-1921:
---

I think that a better aproach would be to:
- Design an additional protobuffer message inside the MasterInfo message
  
message IPv6 {
required uint32 s6_addr1 = 1;
required uint32 s6_addr2 = 2;
required uint32 s6_addr3 = 3;
required uint32 s6_addr4 = 4;
  }

- Add an optional field of Ipv6 type inside the MasterInfo
- An additional field for the IPv6 hostname will be needed in SlaveInfo, Offer, 
ContainerInfo protobuffer messages. The resolved hostnames from the IPv4 and 
IPv6 addresses might differ. Moreover, if the hostname cannot be resolved then 
a string version of the IP address will be returned)


 Design and implement protobuf storage of IP addresses
 -

 Key: MESOS-1921
 URL: https://issues.apache.org/jira/browse/MESOS-1921
 Project: Mesos
  Issue Type: Task
Reporter: Dominic Hamon
Assignee: Evelina Dumitrescu

 We can use {{bytes}} type or statements like {{repeated uint32 data = 
 4[packed=true];}}
 {{string}} representations might add again some parsing overhead. An 
 additional field might be necessary to specify the protocol family type 
 (distinguish between IPv4/IPv6). For example, if we don't specify the family 
 type we can't distinguish between these Ip addresses in the case of 
 byte/array representation: 0:0:0:0:0:0:IPV4 and IPv4 (see 
 http://tools.ietf.org/html/rfc4291#page-10)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Created] (MESOS-2470) Create state abstraction stress test

Niklas Quarfot Nielsen created MESOS-2470:
-

 Summary: Create state abstraction stress test
 Key: MESOS-2470
 URL: https://issues.apache.org/jira/browse/MESOS-2470
 Project: Mesos
  Issue Type: Task
Reporter: Niklas Quarfot Nielsen


Due to https://issues.apache.org/jira/browse/MESOS-2161, we need a way to 
stress test the state abstraction and show it's scalability properties.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2467) Allow --resources flag to take JSON.

2015-03-09 Thread Michael Park (JIRA)


[ 
https://issues.apache.org/jira/browse/MESOS-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353874#comment-14353874
 ] 

Michael Park commented on MESOS-2467:
-

I'm surprised we only support {{JSON::Object}} currently, but I guess we 
haven't needed the other ones. It looks like the other flags have the pattern 
of: See the X Protobuf in mesos.proto for the expected format. It seems that 
in order to keep consistency we would need a {{Resources}} message:
{code}message Resources { repeated Resource resources; }{code}
which of course would break badly since we already have a {{Resources}} C++ 
type. Unless there's a sane way to separate those, it totally makes sense to 
support JSON array of Resource objects instead and we can say something like, 
See the Resource protobuf in mesos.proto for the expected format of each of 
the elements.

 Allow --resources flag to take JSON.
 

 Key: MESOS-2467
 URL: https://issues.apache.org/jira/browse/MESOS-2467
 Project: Mesos
  Issue Type: Improvement
Reporter: Jie Yu

 Currently, we used a customized format for --resources flag. As we introduce 
 more and more stuffs (e.g., persistence, reservation) in Resource object, we 
 need a more generic way to specify --resources.
 For backward compatibility, we can scan the first character. If it is '[', 
 then we invoke the JSON parser. Otherwise, we use the existing parser.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2469) Mesos master/slave should be able to bind to 127.0.0.1 if explicitly requested


[ 
https://issues.apache.org/jira/browse/MESOS-2469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353894#comment-14353894
 ] 

Vinod Kone commented on MESOS-2469:
---

https://reviews.apache.org/r/31872/

 Mesos master/slave should be able to bind to 127.0.0.1 if explicitly requested
 --

 Key: MESOS-2469
 URL: https://issues.apache.org/jira/browse/MESOS-2469
 Project: Mesos
  Issue Type: Bug
Reporter: Vinod Kone
Assignee: Vinod Kone

 With the current refactoring to IP it looks like master and slave can no 
 longer bind to 127.0.0.1 even if explicitly requested via --ip flag. 
 Among other things, this breaks the balloon framework test which uses this 
 flag.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (MESOS-2427) Add Java binding for the acceptOffers API.


[ 
https://issues.apache.org/jira/browse/MESOS-2427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353934#comment-14353934
 ] 

Jie Yu commented on MESOS-2427:
---

https://reviews.apache.org/r/31873/

 Add Java binding for the acceptOffers API.
 --

 Key: MESOS-2427
 URL: https://issues.apache.org/jira/browse/MESOS-2427
 Project: Mesos
  Issue Type: Task
  Components: java api
Reporter: Jie Yu
Assignee: Jie Yu

 We introduced the new acceptOffers API in C++ driver. We need to provide Java 
 binding for this API as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Assigned] (MESOS-2427) Add Java binding for the acceptOffers API.