[jira] [Updated] (MESOS-2424) Components installed by Mesos 0.22.0-rc2 conflict with python-setuptools on CentOS 6
[ https://issues.apache.org/jira/browse/MESOS-2424?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alexander Rojas updated MESOS-2424: --- Assignee: (was: Alexander Rojas) Components installed by Mesos 0.22.0-rc2 conflict with python-setuptools on CentOS 6 Key: MESOS-2424 URL: https://issues.apache.org/jira/browse/MESOS-2424 Project: Mesos Issue Type: Bug Affects Versions: 0.22.0 Environment: Package Build Env: CentOS 6.5 Test Env: CentOS 6.5 + python_setuptools + mesos 0.22.0-rc2 package Reporter: Jeremy Lingmann With Mesos 0.22.0-rc2 we are seeing files installed in /usr/lib/python2.6 which conflict with the upstream python-setuptools package on CentOS 6. This is different behavior with 0.22.0 and something we were not seeing with 0.21.1 builds for CentOS 6. Here is the failure with our pre-built package: {code} # yum install ./pkg.rpm Loaded plugins: fastestmirror, presto Loading mirror speeds from cached hostfile * base: mirrors.advancedhosters.com * extras: mirror.ash.fastserv.com * updates: mirrors.lga7.us.voxel.net Setting up Install Process Examining ./pkg.rpm: mesos-0.22.0-0.1.20150228011042.rc2.centos65.x86_64 Marking ./pkg.rpm to be installed Resolving Dependencies -- Running transaction check --- Package mesos.x86_64 0:0.22.0-0.1.20150228011042.rc2.centos65 will be installed -- Processing Dependency: subversion for package: mesos-0.22.0-0.1.20150228011042.rc2.centos65.x86_64 -- Running transaction check --- Package subversion.x86_64 0:1.6.11-12.el6_6 will be installed -- Processing Dependency: perl(URI) = 1.17 for package: subversion-1.6.11-12.el6_6.x86_64 -- Processing Dependency: apr = 1.3.0 for package: subversion-1.6.11-12.el6_6.x86_64 -- Processing Dependency: libneon.so.27()(64bit) for package: subversion-1.6.11-12.el6_6.x86_64 -- Processing Dependency: libaprutil-1.so.0()(64bit) for package: subversion-1.6.11-12.el6_6.x86_64 -- Processing Dependency: libapr-1.so.0()(64bit) for package: subversion-1.6.11-12.el6_6.x86_64 -- Running transaction check --- Package apr.x86_64 0:1.3.9-5.el6_2 will be installed --- Package apr-util.x86_64 0:1.3.9-3.el6_0.1 will be installed --- Package neon.x86_64 0:0.29.3-3.el6_4 will be installed -- Processing Dependency: libproxy.so.0()(64bit) for package: neon-0.29.3-3.el6_4.x86_64 -- Processing Dependency: libpakchois.so.0()(64bit) for package: neon-0.29.3-3.el6_4.x86_64 --- Package perl-URI.noarch 0:1.40-2.el6 will be installed -- Running transaction check --- Package libproxy.x86_64 0:0.3.0-10.el6 will be installed -- Processing Dependency: libproxy-python = 0.3.0-10.el6 for package: libproxy-0.3.0-10.el6.x86_64 -- Processing Dependency: libproxy-bin = 0.3.0-10.el6 for package: libproxy-0.3.0-10.el6.x86_64 --- Package pakchois.x86_64 0:0.4-3.2.el6 will be installed -- Running transaction check --- Package libproxy-bin.x86_64 0:0.3.0-10.el6 will be installed --- Package libproxy-python.x86_64 0:0.3.0-10.el6 will be installed -- Finished Dependency Resolution Dependencies Resolved == Package Arch Version Repository Size == Installing: mesos x86_64 0.22.0-0.1.20150228011042.rc2.centos65 /pkg 67 M Installing for dependencies: apr x86_64 1.3.9-5.el6_2 base123 k apr-utilx86_64 1.3.9-3.el6_0.1 base 87 k libproxyx86_64 0.3.0-10.el6 base 39 k libproxy-binx86_64 0.3.0-10.el6 base9.0 k libproxy-python x86_64 0.3.0-10.el6 base9.1 k neonx86_64 0.29.3-3.el6_4 base 119 k pakchois
[jira] [Commented] (MESOS-2467) Allow --resources flag to take JSON.
[ https://issues.apache.org/jira/browse/MESOS-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353265#comment-14353265 ] Michael Park commented on MESOS-2467: - +1 Dominic. Seems like a clean, simple, working solution to me. A question that I asked myself on this topic is: what format will the JSON take in terms of its fields? I think the answer is to use the symmetric schema as the {{protobuf}} message which simplifies implementation since we just use the {{protobuf::parse}} function in {{stout/protobuf.hpp}}. Allow --resources flag to take JSON. Key: MESOS-2467 URL: https://issues.apache.org/jira/browse/MESOS-2467 Project: Mesos Issue Type: Improvement Reporter: Jie Yu Currently, we used a customized format for --resources flag. As we introduce more and more stuffs (e.g., persistence, reservation) in Resource object, we need a more generic way to specify --resources. For backward compatibility, we can scan the first character. If it is '[', then we invoke the JSON parser. Otherwise, we use the existing parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-94) Master and Slave HTTP handlers should have unit tests
[ https://issues.apache.org/jira/browse/MESOS-94?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-94: --- Labels: http json test twitter (was: http json test) Master and Slave HTTP handlers should have unit tests - Key: MESOS-94 URL: https://issues.apache.org/jira/browse/MESOS-94 Project: Mesos Issue Type: Improvement Components: json api, master, slave, test Reporter: Charles Reiss Labels: http, json, test, twitter The Master and Slave have HTTP handlers which serve their state (mainly for the webui to use). There should be unit tests of these. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2293) Implement the Call endpoint on master
[ https://issues.apache.org/jira/browse/MESOS-2293?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2293: - Labels: twitter (was: ) Implement the Call endpoint on master - Key: MESOS-2293 URL: https://issues.apache.org/jira/browse/MESOS-2293 Project: Mesos Issue Type: Task Reporter: Vinod Kone Labels: twitter -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1988) Scheduler driver should not generate TASK_LOST when disconnected from master
[ https://issues.apache.org/jira/browse/MESOS-1988?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-1988: - Labels: twitter (was: ) Scheduler driver should not generate TASK_LOST when disconnected from master Key: MESOS-1988 URL: https://issues.apache.org/jira/browse/MESOS-1988 Project: Mesos Issue Type: Improvement Reporter: Vinod Kone Labels: twitter Currently, the driver replies to launchTasks() with TASK_LOST if it detects that it is disconnected from the master. After MESOS-1972 lands, this will be the only place where driver generates TASK_LOST. See MESOS-1972 for more context. This fix is targeted for 0.22.0 to give frameworks time to implement reconciliation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2419) Slave recovery not recovering tasks
[ https://issues.apache.org/jira/browse/MESOS-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353337#comment-14353337 ] Niklas Quarfot Nielsen commented on MESOS-2419: --- Jörg, any updates on this ticket? Can you reproduce? If not, try to reach out to [~brenden] Slave recovery not recovering tasks --- Key: MESOS-2419 URL: https://issues.apache.org/jira/browse/MESOS-2419 Project: Mesos Issue Type: Bug Components: slave Affects Versions: 0.22.0, 0.23.0 Reporter: Brenden Matthews Assignee: Joerg Schad Attachments: mesos-chronos.log.gz, mesos.log.gz In a recent build from master (updated yesterday), slave recovery appears to have broken. I'll attach the slave log (with GLOG_v=1) showing a task called `long-running-job` which is a Chronos job that just does `sleep 1h`. After restarting the slave, the task shows as `TASK_FAILED`. Here's another case, which is for a docker task: {noformat} Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.247159 10022 docker.cpp:421] Recovering Docker containers Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.247207 10022 docker.cpp:468] Recovering container 'f2001064-e076-4978-b764-ed12a5244e78' for executor 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 20150226-230228-2931198986-5050-717- Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.254791 10022 docker.cpp:1333] Executor for container 'f2001064-e076-4978-b764-ed12a5244e78' has exited Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.254812 10022 docker.cpp:1159] Destroying container 'f2001064-e076-4978-b764-ed12a5244e78' Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.254844 10022 docker.cpp:1248] Running docker stop on container 'f2001064-e076-4978-b764-ed12a5244e78' Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.262481 10027 containerizer.cpp:310] Recovering containerizer Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.262565 10027 containerizer.cpp:353] Recovering container 'f2001064-e076-4978-b764-ed12a5244e78' for executor 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 20150226-230228-2931198986-5050-717- Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.263675 10027 linux_launcher.cpp:162] Couldn't find freezer cgroup for container f2001064-e076-4978-b764-ed12a5244e78, assuming already destroyed Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 00:09:49.265467 10020 cpushare.cpp:199] Couldn't find cgroup for container f2001064-e076-4978-b764-ed12a5244e78 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.266448 10022 containerizer.cpp:1147] Executor for container 'f2001064-e076-4978-b764-ed12a5244e78' has exited Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.266466 10022 containerizer.cpp:938] Destroying container 'f2001064-e076-4978-b764-ed12a5244e78' Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:50.593585 10021 slave.cpp:3735] Sending reconnect request to executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717- at executor(1)@10.81.189.232:43130 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 00:09:50.597843 10024 slave.cpp:3175] Termination of executor 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework '20150226-230228-2931198986-5050-717-' failed: Container 'f2001064-e076-4978-b764-ed12a5244e78' not found Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 00:09:50.597949 10025 slave.cpp:3429] Failed to unmonitor container for executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717-: Not monitored Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:50.598785 10024 slave.cpp:2508] Handling status update TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717- from @0.0.0.0:0 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 00:09:50.599093 10023 slave.cpp:2637] Failed to update resources for container f2001064-e076-4978-b764-ed12a5244e78 of executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321 running task chronos.55ffc971-be13-11e4-b8d6-566d21d75321 on status update for terminal task, destroying container: Container
[jira] [Commented] (MESOS-2289) Design doc for the HTTP API
[ https://issues.apache.org/jira/browse/MESOS-2289?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353227#comment-14353227 ] Vinod Kone commented on MESOS-2289: --- https://docs.google.com/a/twitter.com/document/d/17EjlrEBEvSBllDC6Xu3BjDoKoGosZpJS0k78JRGx134/edit?usp=sharing Design doc for the HTTP API --- Key: MESOS-2289 URL: https://issues.apache.org/jira/browse/MESOS-2289 Project: Mesos Issue Type: Task Reporter: Vinod Kone Assignee: Vinod Kone This tracks the design of the HTTP API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2467) Allow --resources flag to take JSON.
[ https://issues.apache.org/jira/browse/MESOS-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353182#comment-14353182 ] Dominic Hamon commented on MESOS-2467: -- instead of relying on the first character (which can also be '{' in valid json) perhaps we can instead: - try JSON parsing, catch failure - fallback to old parsing This also means we can deprecate the old parsing behaviour more easily. Allow --resources flag to take JSON. Key: MESOS-2467 URL: https://issues.apache.org/jira/browse/MESOS-2467 Project: Mesos Issue Type: Improvement Reporter: Jie Yu Currently, we used a customized format for --resources flag. As we introduce more and more stuffs (e.g., persistence, reservation) in Resource object, we need a more generic way to specify --resources. For backward compatibility, we can scan the first character. If it is '[', then we invoke the JSON parser. Otherwise, we use the existing parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (MESOS-2426) Developer Guide improvements
[ https://issues.apache.org/jira/browse/MESOS-2426?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone resolved MESOS-2426. --- Resolution: Fixed Fix Version/s: 0.23.0 commit a0a5f0ae710c88c7d9c5decc8bbbe6c30c2c9048 Author: Aaron Bell aaron.b...@gmail.com Date: Mon Mar 9 10:50:39 2015 -0700 MESOS-2426 developer guide improvements. 1. Add a new line for the user to run 'rbt status' to log into RB. 2. Change 'git co' (technically invalid) to 'git checkout'. Review: https://reviews.apache.org/r/31638 Developer Guide improvements Key: MESOS-2426 URL: https://issues.apache.org/jira/browse/MESOS-2426 Project: Mesos Issue Type: Bug Components: documentation Affects Versions: 0.20.1 Reporter: Aaron Bell Assignee: Aaron Bell Priority: Minor Fix For: 0.23.0 # The docs need to mention that `post-reviews.py` will not work until `rbt status` or similar has been used to log in. The post script actually hangs with no timeout. # `git co` is not a valid command. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2277) Document undocumented HTTP endpoints
[ https://issues.apache.org/jira/browse/MESOS-2277?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2277: - Labels: documentation newbie starter twitter (was: documentation newbie starter) Document undocumented HTTP endpoints Key: MESOS-2277 URL: https://issues.apache.org/jira/browse/MESOS-2277 Project: Mesos Issue Type: Improvement Reporter: Niklas Quarfot Nielsen Priority: Minor Labels: documentation, newbie, starter, twitter Did a quick scan and we are missing documentation for a few endpoints: {code} files/browse.json files/read.json files/download.json files/debug.json master/roles.json master/state.json master/stats.json slave/state.json slave/stats.json {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2294) Implement the Events endpoint on master
[ https://issues.apache.org/jira/browse/MESOS-2294?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon updated MESOS-2294: - Labels: twitter (was: ) Implement the Events endpoint on master --- Key: MESOS-2294 URL: https://issues.apache.org/jira/browse/MESOS-2294 Project: Mesos Issue Type: Task Reporter: Vinod Kone Labels: twitter -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-1127) Implement the protobufs for the scheduler API
[ https://issues.apache.org/jira/browse/MESOS-1127?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dominic Hamon reassigned MESOS-1127: Assignee: Vinod Kone (was: Benjamin Hindman) Implement the protobufs for the scheduler API - Key: MESOS-1127 URL: https://issues.apache.org/jira/browse/MESOS-1127 Project: Mesos Issue Type: Task Components: framework Reporter: Benjamin Hindman Assignee: Vinod Kone Labels: twitter The default scheduler/executor interface and implementation in Mesos have a few drawbacks: (1) The interface is fairly high-level which makes it hard to do certain things, for example, handle events (callbacks) in batch. This can have a big impact on the performance of schedulers (for example, writing task updates that need to be persisted). (2) The implementation requires writing a lot of boilerplate JNI and native Python wrappers when adding additional API components. The plan is to provide a lower-level API that can easily be used to implement the higher-level API that is currently provided. This will also open the door to more easily building native-language Mesos libraries (i.e., not needing the C++ shim layer) and building new higher-level abstractions on top of the lower-level API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2467) Allow --resources flag to take JSON.
Jie Yu created MESOS-2467: - Summary: Allow --resources flag to take JSON. Key: MESOS-2467 URL: https://issues.apache.org/jira/browse/MESOS-2467 Project: Mesos Issue Type: Improvement Reporter: Jie Yu Currently, we used a customized format for --resources flag. As we introduce more and more stuffs (e.g., persistence, reservation) in Resource object, we need a more generic way to specify --resources. For backward compatibility, we can scan the first character. If it is '[', then we invoke the JSON parser. Otherwise, we use the existing parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2309) Mesos rejects ExecutorInfo as incompatible when there is no functional difference
[ https://issues.apache.org/jira/browse/MESOS-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353736#comment-14353736 ] Vinod Kone commented on MESOS-2309: --- [~js84] I don't think it matters if a field has default value or not. IIUC, every singular field has a default value. For message types, the default is a message with all fields unset (but have default values). Mesos rejects ExecutorInfo as incompatible when there is no functional difference - Key: MESOS-2309 URL: https://issues.apache.org/jira/browse/MESOS-2309 Project: Mesos Issue Type: Bug Reporter: Zameer Manji Assignee: Joerg Schad Priority: Minor Labels: twitter In AURORA-1076 it was discovered that if an ExecutorInfo was changed such that a previously unset optional field with a default value was changed to have the field set with the default value, it would be rejected as not compatible. For example if we have an ExecutorInfo with a CommandInfo with the {{shell}} attribute unset and then we change the CommandInfo to set the {{shell}} attribute to true Mesos will reject the task with: {noformat} I0130 21:50:05.373389 50869 master.cpp:3441] Sending status update TASK_LOST (UUID: 82ef615c-0d59-4427-95d5-80cf0e52b3fc) for task system-gc-c89c0c05-200c-462e-958a-ecd7b9a76831 of framework 201103282247-19- 'Task has invalid ExecutorInfo (existing ExecutorInfo with same ExecutorID is not compatible). {noformat} This is not intuitive because the default value of the {{shell}} attribute is true. There should be no difference between not setting an optional field with a default value and setting that field to the default value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2309) Mesos rejects ExecutorInfo as incompatible when there is no functional difference
[ https://issues.apache.org/jira/browse/MESOS-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353669#comment-14353669 ] Joerg Schad commented on MESOS-2309: Do you mean optional fields with defaults (as in the case of shell)? For optional fields without defaults I believe we should first check via has_field whether it is actually present. Mesos rejects ExecutorInfo as incompatible when there is no functional difference - Key: MESOS-2309 URL: https://issues.apache.org/jira/browse/MESOS-2309 Project: Mesos Issue Type: Bug Reporter: Zameer Manji Assignee: Joerg Schad Priority: Minor Labels: twitter In AURORA-1076 it was discovered that if an ExecutorInfo was changed such that a previously unset optional field with a default value was changed to have the field set with the default value, it would be rejected as not compatible. For example if we have an ExecutorInfo with a CommandInfo with the {{shell}} attribute unset and then we change the CommandInfo to set the {{shell}} attribute to true Mesos will reject the task with: {noformat} I0130 21:50:05.373389 50869 master.cpp:3441] Sending status update TASK_LOST (UUID: 82ef615c-0d59-4427-95d5-80cf0e52b3fc) for task system-gc-c89c0c05-200c-462e-958a-ecd7b9a76831 of framework 201103282247-19- 'Task has invalid ExecutorInfo (existing ExecutorInfo with same ExecutorID is not compatible). {noformat} This is not intuitive because the default value of the {{shell}} attribute is true. There should be no difference between not setting an optional field with a default value and setting that field to the default value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2309) Mesos rejects ExecutorInfo as incompatible when there is no functional difference
[ https://issues.apache.org/jira/browse/MESOS-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353640#comment-14353640 ] Vinod Kone commented on MESOS-2309: --- In our code base, we are almost always interested in functional equivalence and not object equivalence. As such, I think we should fix our == operator overloads for all our protobufs to do equivalance checks of optional fields the same way as we do required fields. I'll prep a review for this shortly. Mesos rejects ExecutorInfo as incompatible when there is no functional difference - Key: MESOS-2309 URL: https://issues.apache.org/jira/browse/MESOS-2309 Project: Mesos Issue Type: Bug Reporter: Zameer Manji Assignee: Joerg Schad Priority: Minor Labels: twitter In AURORA-1076 it was discovered that if an ExecutorInfo was changed such that a previously unset optional field with a default value was changed to have the field set with the default value, it would be rejected as not compatible. For example if we have an ExecutorInfo with a CommandInfo with the {{shell}} attribute unset and then we change the CommandInfo to set the {{shell}} attribute to true Mesos will reject the task with: {noformat} I0130 21:50:05.373389 50869 master.cpp:3441] Sending status update TASK_LOST (UUID: 82ef615c-0d59-4427-95d5-80cf0e52b3fc) for task system-gc-c89c0c05-200c-462e-958a-ecd7b9a76831 of framework 201103282247-19- 'Task has invalid ExecutorInfo (existing ExecutorInfo with same ExecutorID is not compatible). {noformat} This is not intuitive because the default value of the {{shell}} attribute is true. There should be no difference between not setting an optional field with a default value and setting that field to the default value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (MESOS-2414) Java bindings segfault during framework shutdown
[ https://issues.apache.org/jira/browse/MESOS-2414?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen closed MESOS-2414. - Resolution: Fixed commit e1bae61258bc80bd6006bbb6d4a4fb5c0cc95820 Author: Niklas Nielsen n...@qni.dk Date: Fri Mar 6 17:14:17 2015 -0800 Fixed race in getFieldID helper. The getFieldID helper looks up the java/lang/NoSuchFieldError class and stores it in a static. It has turned out to provoke a racy behavior with Java 8 when multiple drivers are created (and the class object may have been created by another thread). This patch reverts the 'static' optimization and looks up the class object when exceptions are thrown. Review: https://reviews.apache.org/r/31818 Java bindings segfault during framework shutdown Key: MESOS-2414 URL: https://issues.apache.org/jira/browse/MESOS-2414 Project: Mesos Issue Type: Bug Reporter: Niklas Quarfot Nielsen Assignee: Niklas Quarfot Nielsen {code} I0226 16:39:59.063369 626044928 sched.cpp:831] Stopping framework '20150220-141149-16777343-5050-45194-' [2015-02-26 16:39:59,063] INFO Driver future completed. Executing optional abdication command. (mesosphere.marathon.MarathonSchedulerService:191) [2015-02-26 16:39:59,065] INFO Setting framework ID to 20150220-141149-16777343-5050-45194- (mesosphere.marathon.MarathonSchedulerService:75) # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x000106a266d0, pid=99408, tid=44291 # # JRE version: Java(TM) SE Runtime Environment (8.0_25-b17) (build 1.8.0_25-b17) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.25-b02 mixed mode bsd-amd64 compressed oops) # Problematic frame: # V [libjvm.dylib+0x4266d0] Klass::is_subtype_of(Klass*) const+0x4 # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try ulimit -c unlimited before starting Java again # # An error report file with more information is saved as: # /Users/corpsi/projects/marathon/hs_err_pid99408.log # # If you would like to submit a bug report, please visit: # http://bugreport.sun.com/bugreport/crash.jsp # Abort trap: 6 {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2448) release checklist should include 'update homebrew' for OS X developers
[ https://issues.apache.org/jira/browse/MESOS-2448?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353689#comment-14353689 ] Niklas Quarfot Nielsen commented on MESOS-2448: --- Alright - will do. release checklist should include 'update homebrew' for OS X developers -- Key: MESOS-2448 URL: https://issues.apache.org/jira/browse/MESOS-2448 Project: Mesos Issue Type: Documentation Components: documentation Affects Versions: 0.20.1 Reporter: Aaron Bell Assignee: Aaron Bell Priority: Minor For many developers the {{brew install mesos}} path is the first exposure to Mesos. Currently this is maintained best-efforts by the community, and they're not doing very well. Example: current Mesos compatibility is stuck in this PR: https://github.com/Homebrew/homebrew/pull/37087/ By adding this to the release checklist we can ensure developers can get the latest version easily. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2468) Update homebrew should be marked as an optional step in the release manager process doc
Niklas Quarfot Nielsen created MESOS-2468: - Summary: Update homebrew should be marked as an optional step in the release manager process doc Key: MESOS-2468 URL: https://issues.apache.org/jira/browse/MESOS-2468 Project: Mesos Issue Type: Documentation Reporter: Niklas Quarfot Nielsen Priority: Trivial The homebrew step should be marked as optional, as it's not owned by the mesos project. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1023) Replace all static/global variables with non-POD type
[ https://issues.apache.org/jira/browse/MESOS-1023?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353464#comment-14353464 ] Dominic Hamon commented on MESOS-1023: -- commit f780f67717fe0aa25b6870baedd55c43a7017edb (HEAD, origin/master, origin/HEAD, master) Author: Dominic Hamon d...@twitter.com Commit: Dominic Hamon d...@twitter.com Remove static strings from process and split out some source. Review: https://reviews.apache.org/r/30841 Replace all static/global variables with non-POD type - Key: MESOS-1023 URL: https://issues.apache.org/jira/browse/MESOS-1023 Project: Mesos Issue Type: Bug Components: general, technical debt Reporter: Dominic Hamon Assignee: Dominic Hamon Labels: c++ See http://google-styleguide.googlecode.com/svn/trunk/cppguide.xml#Static_and_Global_Variables for the background. Real bugs have been seen. For example, in process::ID::generate we have a mapstring, int that can be accessed within the function after exit has been called. Ie, we can try to access the map after it's been destroyed, but before exit has completed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-391) Slave GarbageCollector needs to also take into account the number of links, when determining removal time.
[ https://issues.apache.org/jira/browse/MESOS-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353394#comment-14353394 ] Bernd Mathiske commented on MESOS-391: -- A simple approach would be to fix the problem right inside paths::createExecutorDirectory(): 1. Check how many subdirs the executor parent dir already has. 2. If it is close to the limit, find out which of the existing subdirs are the oldest. 3. Delete one or more of the latter. 4. Now it should be safe to proceed with the mkdir(). We could then either immediately remove the deleted paths from the GC's internal bookkeeping or we could just make sure that whenever their GC time is up it makes no difference if they are already gone (pre-deleted). Problems with this approach: a) Recursively deleting a directory may take a while. This operation blocks slave process (actor) progress. This is already a problem with the mkdir() itself, but that one is less likely to take long (although ultimately it might). It is one file operation. In contrast, the number of file system operations for a recursive deletion is in general unknown and could potentially be large. b) If you are close to the limit and you only remove one subdir then, you may end up doing so again and again for many tasks. I propose we deal with problem a) by handling the deletion on a different process than the slave process. The GC process is an obvious candidate. In the slave process, we can wait for a future that signals the completion of the deletion. (There are some concurrency issues that we will get to later.) The advantage of this approach is that it is watertight. There is no way LINK_MAX can be exceeded by executor dirs any more then. On the other hand, maybe speeding up GC for subdirs once a parent dir fills up more than say 3/4 is always fast enough? But how do you know that for sure if the duration of a file deletion is in principle unknown? That said, the two approaches can of course be combined. I would start with the watertight one and add the other one if still so desired. Slave GarbageCollector needs to also take into account the number of links, when determining removal time. -- Key: MESOS-391 URL: https://issues.apache.org/jira/browse/MESOS-391 Project: Mesos Issue Type: Bug Reporter: Benjamin Mahler Assignee: Bernd Mathiske Labels: twitter The slave garbage collector does not take into account the number of links present, which means that if we create a lot of executor directories (up to LINK_MAX), we won't necessarily GC. As a result of this, the slave crashes: F0313 21:40:02.926494 33746 paths.hpp:233] CHECK_SOME(mkdir) failed: Failed to create executor directory '/var/lib/mesos/slaves/201303090208-193162-5050-38880-267/frameworks/201103282247-19-/executors/thermos-1363210801777-mesos-meta_slave_0-27-e74e4b30-dcf1-4e88-8954-dd2b40b7dd89/runs/499fcc13-c391-421c-93d2-a56d1a4a931e': Too many links *** Check failure stack trace: *** @ 0x7f9320f82f9d google::LogMessage::Fail() @ 0x7f9320f88c07 google::LogMessage::SendToLog() @ 0x7f9320f8484c google::LogMessage::Flush() @ 0x7f9320f84ab6 google::LogMessageFatal::~LogMessageFatal() @ 0x7f9320c70312 _CheckSome::~_CheckSome() @ 0x7f9320c9dd5c mesos::internal::slave::paths::createExecutorDirectory() @ 0x7f9320c9e60d mesos::internal::slave::Framework::createExecutor() @ 0x7f9320c7a7f7 mesos::internal::slave::Slave::runTask() @ 0x7f9320c9cb43 ProtobufProcess::handler4() @ 0x7f9320c8678b std::tr1::_Function_handler::_M_invoke() @ 0x7f9320c9d1ab ProtobufProcess::visit() @ 0x7f9320e4c774 process::MessageEvent::visit() @ 0x7f9320e40a1d process::ProcessManager::resume() @ 0x7f9320e41268 process::schedule() @ 0x7f932055973d start_thread @ 0x7f931ef3df6d clone The fix here is to take into account the number of links (st_nlinks), when determining whether we need to GC. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2467) Allow --resources flag to take JSON.
[ https://issues.apache.org/jira/browse/MESOS-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353446#comment-14353446 ] Jie Yu commented on MESOS-2467: --- Should --resources be a JSON array (array of Resource) or a JSON object? I think it should be a JSON array. We need to add support for parsing JSON array to RepeatedProtobufXXX? Allow --resources flag to take JSON. Key: MESOS-2467 URL: https://issues.apache.org/jira/browse/MESOS-2467 Project: Mesos Issue Type: Improvement Reporter: Jie Yu Currently, we used a customized format for --resources flag. As we introduce more and more stuffs (e.g., persistence, reservation) in Resource object, we need a more generic way to specify --resources. For backward compatibility, we can scan the first character. If it is '[', then we invoke the JSON parser. Otherwise, we use the existing parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-391) Slave GarbageCollector needs to also take into account the number of links, when determining removal time.
[ https://issues.apache.org/jira/browse/MESOS-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353480#comment-14353480 ] Ritwik Yadav commented on MESOS-391: That IMO is an excellent solution. However, what I fail to understand is the comparison between these two scenarios: 1) The slave process itself recursively deleting a directory recursively. 2) The slave process waiting on the GC process which deletes a directory recursively. The latter might be a good design option but does it affect run time in any way? Slave GarbageCollector needs to also take into account the number of links, when determining removal time. -- Key: MESOS-391 URL: https://issues.apache.org/jira/browse/MESOS-391 Project: Mesos Issue Type: Bug Reporter: Benjamin Mahler Assignee: Bernd Mathiske Labels: twitter The slave garbage collector does not take into account the number of links present, which means that if we create a lot of executor directories (up to LINK_MAX), we won't necessarily GC. As a result of this, the slave crashes: F0313 21:40:02.926494 33746 paths.hpp:233] CHECK_SOME(mkdir) failed: Failed to create executor directory '/var/lib/mesos/slaves/201303090208-193162-5050-38880-267/frameworks/201103282247-19-/executors/thermos-1363210801777-mesos-meta_slave_0-27-e74e4b30-dcf1-4e88-8954-dd2b40b7dd89/runs/499fcc13-c391-421c-93d2-a56d1a4a931e': Too many links *** Check failure stack trace: *** @ 0x7f9320f82f9d google::LogMessage::Fail() @ 0x7f9320f88c07 google::LogMessage::SendToLog() @ 0x7f9320f8484c google::LogMessage::Flush() @ 0x7f9320f84ab6 google::LogMessageFatal::~LogMessageFatal() @ 0x7f9320c70312 _CheckSome::~_CheckSome() @ 0x7f9320c9dd5c mesos::internal::slave::paths::createExecutorDirectory() @ 0x7f9320c9e60d mesos::internal::slave::Framework::createExecutor() @ 0x7f9320c7a7f7 mesos::internal::slave::Slave::runTask() @ 0x7f9320c9cb43 ProtobufProcess::handler4() @ 0x7f9320c8678b std::tr1::_Function_handler::_M_invoke() @ 0x7f9320c9d1ab ProtobufProcess::visit() @ 0x7f9320e4c774 process::MessageEvent::visit() @ 0x7f9320e40a1d process::ProcessManager::resume() @ 0x7f9320e41268 process::schedule() @ 0x7f932055973d start_thread @ 0x7f931ef3df6d clone The fix here is to take into account the number of links (st_nlinks), when determining whether we need to GC. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-2419) Slave recovery not recovering tasks
[ https://issues.apache.org/jira/browse/MESOS-2419?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353349#comment-14353349 ] Joerg Schad edited comment on MESOS-2419 at 3/9/15 6:33 PM: So far I can only reproduce this on the cluster. Currently reproducing the environment of the cluster (especially cgroups setup), already in contact with Brenden. was (Author: js84): So far I can only reproduce this on the cluster. Currently reproducing the environment of the clustert (especially cgroups setup), already in contact with Brenden. Slave recovery not recovering tasks --- Key: MESOS-2419 URL: https://issues.apache.org/jira/browse/MESOS-2419 Project: Mesos Issue Type: Bug Components: slave Affects Versions: 0.22.0, 0.23.0 Reporter: Brenden Matthews Assignee: Joerg Schad Attachments: mesos-chronos.log.gz, mesos.log.gz In a recent build from master (updated yesterday), slave recovery appears to have broken. I'll attach the slave log (with GLOG_v=1) showing a task called `long-running-job` which is a Chronos job that just does `sleep 1h`. After restarting the slave, the task shows as `TASK_FAILED`. Here's another case, which is for a docker task: {noformat} Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.247159 10022 docker.cpp:421] Recovering Docker containers Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.247207 10022 docker.cpp:468] Recovering container 'f2001064-e076-4978-b764-ed12a5244e78' for executor 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 20150226-230228-2931198986-5050-717- Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.254791 10022 docker.cpp:1333] Executor for container 'f2001064-e076-4978-b764-ed12a5244e78' has exited Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.254812 10022 docker.cpp:1159] Destroying container 'f2001064-e076-4978-b764-ed12a5244e78' Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.254844 10022 docker.cpp:1248] Running docker stop on container 'f2001064-e076-4978-b764-ed12a5244e78' Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.262481 10027 containerizer.cpp:310] Recovering containerizer Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.262565 10027 containerizer.cpp:353] Recovering container 'f2001064-e076-4978-b764-ed12a5244e78' for executor 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework 20150226-230228-2931198986-5050-717- Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.263675 10027 linux_launcher.cpp:162] Couldn't find freezer cgroup for container f2001064-e076-4978-b764-ed12a5244e78, assuming already destroyed Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: W0227 00:09:49.265467 10020 cpushare.cpp:199] Couldn't find cgroup for container f2001064-e076-4978-b764-ed12a5244e78 Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.266448 10022 containerizer.cpp:1147] Executor for container 'f2001064-e076-4978-b764-ed12a5244e78' has exited Feb 27 00:09:49 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:49.266466 10022 containerizer.cpp:938] Destroying container 'f2001064-e076-4978-b764-ed12a5244e78' Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:50.593585 10021 slave.cpp:3735] Sending reconnect request to executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717- at executor(1)@10.81.189.232:43130 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 00:09:50.597843 10024 slave.cpp:3175] Termination of executor 'chronos.55ffc971-be13-11e4-b8d6-566d21d75321' of framework '20150226-230228-2931198986-5050-717-' failed: Container 'f2001064-e076-4978-b764-ed12a5244e78' not found Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 00:09:50.597949 10025 slave.cpp:3429] Failed to unmonitor container for executor chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717-: Not monitored Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: I0227 00:09:50.598785 10024 slave.cpp:2508] Handling status update TASK_FAILED (UUID: d8afb771-a47a-4adc-b38b-c8cc016ab289) for task chronos.55ffc971-be13-11e4-b8d6-566d21d75321 of framework 20150226-230228-2931198986-5050-717- from @0.0.0.0:0 Feb 27 00:09:50 ip-10-81-189-232.ec2.internal mesos-slave[10018]: E0227 00:09:50.599093 10023 slave.cpp:2637] Failed to update
[jira] [Commented] (MESOS-391) Slave GarbageCollector needs to also take into account the number of links, when determining removal time.
[ https://issues.apache.org/jira/browse/MESOS-391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353481#comment-14353481 ] Ritwik Yadav commented on MESOS-391: That IMO is an excellent solution. However, what I fail to understand is the comparison between these two scenarios: 1) The slave process itself recursively deleting a directory recursively. 2) The slave process waiting on the GC process which deletes a directory recursively. The latter might be a good design option but does it affect run time in any way? Slave GarbageCollector needs to also take into account the number of links, when determining removal time. -- Key: MESOS-391 URL: https://issues.apache.org/jira/browse/MESOS-391 Project: Mesos Issue Type: Bug Reporter: Benjamin Mahler Assignee: Bernd Mathiske Labels: twitter The slave garbage collector does not take into account the number of links present, which means that if we create a lot of executor directories (up to LINK_MAX), we won't necessarily GC. As a result of this, the slave crashes: F0313 21:40:02.926494 33746 paths.hpp:233] CHECK_SOME(mkdir) failed: Failed to create executor directory '/var/lib/mesos/slaves/201303090208-193162-5050-38880-267/frameworks/201103282247-19-/executors/thermos-1363210801777-mesos-meta_slave_0-27-e74e4b30-dcf1-4e88-8954-dd2b40b7dd89/runs/499fcc13-c391-421c-93d2-a56d1a4a931e': Too many links *** Check failure stack trace: *** @ 0x7f9320f82f9d google::LogMessage::Fail() @ 0x7f9320f88c07 google::LogMessage::SendToLog() @ 0x7f9320f8484c google::LogMessage::Flush() @ 0x7f9320f84ab6 google::LogMessageFatal::~LogMessageFatal() @ 0x7f9320c70312 _CheckSome::~_CheckSome() @ 0x7f9320c9dd5c mesos::internal::slave::paths::createExecutorDirectory() @ 0x7f9320c9e60d mesos::internal::slave::Framework::createExecutor() @ 0x7f9320c7a7f7 mesos::internal::slave::Slave::runTask() @ 0x7f9320c9cb43 ProtobufProcess::handler4() @ 0x7f9320c8678b std::tr1::_Function_handler::_M_invoke() @ 0x7f9320c9d1ab ProtobufProcess::visit() @ 0x7f9320e4c774 process::MessageEvent::visit() @ 0x7f9320e40a1d process::ProcessManager::resume() @ 0x7f9320e41268 process::schedule() @ 0x7f932055973d start_thread @ 0x7f931ef3df6d clone The fix here is to take into account the number of links (st_nlinks), when determining whether we need to GC. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Issue Comment Deleted] (MESOS-391) Slave GarbageCollector needs to also take into account the number of links, when determining removal time.
[ https://issues.apache.org/jira/browse/MESOS-391?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ritwik Yadav updated MESOS-391: --- Comment: was deleted (was: That IMO is an excellent solution. However, what I fail to understand is the comparison between these two scenarios: 1) The slave process itself recursively deleting a directory recursively. 2) The slave process waiting on the GC process which deletes a directory recursively. The latter might be a good design option but does it affect run time in any way?) Slave GarbageCollector needs to also take into account the number of links, when determining removal time. -- Key: MESOS-391 URL: https://issues.apache.org/jira/browse/MESOS-391 Project: Mesos Issue Type: Bug Reporter: Benjamin Mahler Assignee: Bernd Mathiske Labels: twitter The slave garbage collector does not take into account the number of links present, which means that if we create a lot of executor directories (up to LINK_MAX), we won't necessarily GC. As a result of this, the slave crashes: F0313 21:40:02.926494 33746 paths.hpp:233] CHECK_SOME(mkdir) failed: Failed to create executor directory '/var/lib/mesos/slaves/201303090208-193162-5050-38880-267/frameworks/201103282247-19-/executors/thermos-1363210801777-mesos-meta_slave_0-27-e74e4b30-dcf1-4e88-8954-dd2b40b7dd89/runs/499fcc13-c391-421c-93d2-a56d1a4a931e': Too many links *** Check failure stack trace: *** @ 0x7f9320f82f9d google::LogMessage::Fail() @ 0x7f9320f88c07 google::LogMessage::SendToLog() @ 0x7f9320f8484c google::LogMessage::Flush() @ 0x7f9320f84ab6 google::LogMessageFatal::~LogMessageFatal() @ 0x7f9320c70312 _CheckSome::~_CheckSome() @ 0x7f9320c9dd5c mesos::internal::slave::paths::createExecutorDirectory() @ 0x7f9320c9e60d mesos::internal::slave::Framework::createExecutor() @ 0x7f9320c7a7f7 mesos::internal::slave::Slave::runTask() @ 0x7f9320c9cb43 ProtobufProcess::handler4() @ 0x7f9320c8678b std::tr1::_Function_handler::_M_invoke() @ 0x7f9320c9d1ab ProtobufProcess::visit() @ 0x7f9320e4c774 process::MessageEvent::visit() @ 0x7f9320e40a1d process::ProcessManager::resume() @ 0x7f9320e41268 process::schedule() @ 0x7f932055973d start_thread @ 0x7f931ef3df6d clone The fix here is to take into account the number of links (st_nlinks), when determining whether we need to GC. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2309) Mesos rejects ExecutorInfo as incompatible when there is no functional difference
[ https://issues.apache.org/jira/browse/MESOS-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353839#comment-14353839 ] Joerg Schad commented on MESOS-2309: Just tested and you are right :-). Mesos rejects ExecutorInfo as incompatible when there is no functional difference - Key: MESOS-2309 URL: https://issues.apache.org/jira/browse/MESOS-2309 Project: Mesos Issue Type: Bug Reporter: Zameer Manji Assignee: Joerg Schad Priority: Minor Labels: twitter In AURORA-1076 it was discovered that if an ExecutorInfo was changed such that a previously unset optional field with a default value was changed to have the field set with the default value, it would be rejected as not compatible. For example if we have an ExecutorInfo with a CommandInfo with the {{shell}} attribute unset and then we change the CommandInfo to set the {{shell}} attribute to true Mesos will reject the task with: {noformat} I0130 21:50:05.373389 50869 master.cpp:3441] Sending status update TASK_LOST (UUID: 82ef615c-0d59-4427-95d5-80cf0e52b3fc) for task system-gc-c89c0c05-200c-462e-958a-ecd7b9a76831 of framework 201103282247-19- 'Task has invalid ExecutorInfo (existing ExecutorInfo with same ExecutorID is not compatible). {noformat} This is not intuitive because the default value of the {{shell}} attribute is true. There should be no difference between not setting an optional field with a default value and setting that field to the default value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2108) Add configure flag or environment variable to enable SSL/libevent Socket
[ https://issues.apache.org/jira/browse/MESOS-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2108: -- Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6) Add configure flag or environment variable to enable SSL/libevent Socket Key: MESOS-2108 URL: https://issues.apache.org/jira/browse/MESOS-2108 Project: Mesos Issue Type: Task Reporter: Niklas Quarfot Nielsen Assignee: Joris Van Remoortere -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2110) Configurable Ping Timeouts
[ https://issues.apache.org/jira/browse/MESOS-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2110: -- Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6) Configurable Ping Timeouts -- Key: MESOS-2110 URL: https://issues.apache.org/jira/browse/MESOS-2110 Project: Mesos Issue Type: Improvement Components: master, slave Reporter: Adam B Assignee: Adam B Labels: master, network, slave, timeout After a series of ping-failures, the master considers the slave lost and calls shutdownSlave, requiring such a slave that reconnects to kill its tasks and re-register as a new slaveId. On the other side, after a similar timeout, the slave will consider the master lost and try to detect a new master. These timeouts are currently hardcoded constants (5 * 15s), which may not be well-suited for all scenarios. - Some clusters may tolerate a longer slave process restart period, and wouldn't want tasks to be killed upon reconnect. - Some clusters may have higher-latency networks (e.g. cross-datacenter, or for volunteer computing efforts), and would like to tolerate longer periods without communication. We should provide flags/mechanisms on the master to control its tolerance for non-communicative slaves, and (less importantly?) on the slave to tolerate missing masters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2351) Enable label and environment decorators (hooks) to remove label and environment entries
[ https://issues.apache.org/jira/browse/MESOS-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2351: -- Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6) Enable label and environment decorators (hooks) to remove label and environment entries --- Key: MESOS-2351 URL: https://issues.apache.org/jira/browse/MESOS-2351 Project: Mesos Issue Type: Task Reporter: Niklas Quarfot Nielsen Assignee: Niklas Quarfot Nielsen We need to change the semantics of decorators to be able to not only add labels and environment variables, but also remove them. The change is fairly small. The hook manager (and call site) use CopyFrom instead of MergeFrom and hook implementors pass on the labels and environment from task and executor commands respectively. In the future, we can tag labels such that only labels belonging to a hook type (across master and slave) can be inspected and changed. For now, the active hooks are selected by the operator and therefore be trusted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2074) Fetcher cache test fixture
[ https://issues.apache.org/jira/browse/MESOS-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2074: -- Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6) Fetcher cache test fixture -- Key: MESOS-2074 URL: https://issues.apache.org/jira/browse/MESOS-2074 Project: Mesos Issue Type: Improvement Components: fetcher, slave Reporter: Bernd Mathiske Assignee: Bernd Mathiske Original Estimate: 72h Remaining Estimate: 72h To accelerate providing good test coverage for the fetcher cache (MESOS-336), we can provide a framework that canonicalizes creating and running a number of tasks and allows easy parametrization with combinations of the following: - whether to cache or not - whether make what has been downloaded executable or not - whether to extract from an archive or not - whether to download from a file system, http, or... We can create a simple HHTP server in the test fixture to support the latter. Furthermore, the tests need to be robust wrt. varying numbers of StatusUpdate messages. An accumulating update message sink that reports the final state is needed. All this has already been programmed in this patch, just needs to be rebased: https://reviews.apache.org/r/21316/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1913) Create libevent/SSL-backed Socket implementation
[ https://issues.apache.org/jira/browse/MESOS-1913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-1913: -- Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6) Create libevent/SSL-backed Socket implementation Key: MESOS-1913 URL: https://issues.apache.org/jira/browse/MESOS-1913 Project: Mesos Issue Type: Task Reporter: Niklas Quarfot Nielsen Assignee: Joris Van Remoortere -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2155) Make docker containerizer killing orphan containers optional
[ https://issues.apache.org/jira/browse/MESOS-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2155: -- Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6) Make docker containerizer killing orphan containers optional Key: MESOS-2155 URL: https://issues.apache.org/jira/browse/MESOS-2155 Project: Mesos Issue Type: Improvement Components: docker Reporter: Timothy Chen Assignee: Timothy Chen Currently the docker containerizer on recover will kill containers that are not recognized by the containerizer. We want to make this behavior optional as there are certain situations we want to let the docker containers still continue to run. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2226) HookTest.VerifySlaveLaunchExecutorHook is flaky
[ https://issues.apache.org/jira/browse/MESOS-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2226: -- Sprint: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6) HookTest.VerifySlaveLaunchExecutorHook is flaky --- Key: MESOS-2226 URL: https://issues.apache.org/jira/browse/MESOS-2226 Project: Mesos Issue Type: Bug Components: test Affects Versions: 0.22.0 Reporter: Vinod Kone Assignee: Kapil Arya Labels: flaky-test Observed this on internal CI {code} [ RUN ] HookTest.VerifySlaveLaunchExecutorHook Using temporary directory '/tmp/HookTest_VerifySlaveLaunchExecutorHook_GjBgME' I0114 18:51:34.659353 4720 leveldb.cpp:176] Opened db in 1.255951ms I0114 18:51:34.662112 4720 leveldb.cpp:183] Compacted db in 596090ns I0114 18:51:34.662364 4720 leveldb.cpp:198] Created db iterator in 177877ns I0114 18:51:34.662719 4720 leveldb.cpp:204] Seeked to beginning of db in 19709ns I0114 18:51:34.663010 4720 leveldb.cpp:273] Iterated through 0 keys in the db in 18208ns I0114 18:51:34.663312 4720 replica.cpp:744] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0114 18:51:34.664266 4735 recover.cpp:449] Starting replica recovery I0114 18:51:34.664908 4735 recover.cpp:475] Replica is in EMPTY status I0114 18:51:34.667842 4734 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request I0114 18:51:34.669117 4735 recover.cpp:195] Received a recover response from a replica in EMPTY status I0114 18:51:34.677913 4735 recover.cpp:566] Updating replica status to STARTING I0114 18:51:34.683157 4735 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 137939ns I0114 18:51:34.683507 4735 replica.cpp:323] Persisted replica status to STARTING I0114 18:51:34.684013 4735 recover.cpp:475] Replica is in STARTING status I0114 18:51:34.685554 4738 replica.cpp:641] Replica in STARTING status received a broadcasted recover request I0114 18:51:34.696512 4736 recover.cpp:195] Received a recover response from a replica in STARTING status I0114 18:51:34.700552 4735 recover.cpp:566] Updating replica status to VOTING I0114 18:51:34.701128 4735 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 115624ns I0114 18:51:34.701478 4735 replica.cpp:323] Persisted replica status to VOTING I0114 18:51:34.701817 4735 recover.cpp:580] Successfully joined the Paxos group I0114 18:51:34.702569 4735 recover.cpp:464] Recover process terminated I0114 18:51:34.716439 4736 master.cpp:262] Master 20150114-185134-2272962752-57018-4720 (fedora-19) started on 192.168.122.135:57018 I0114 18:51:34.716913 4736 master.cpp:308] Master only allowing authenticated frameworks to register I0114 18:51:34.717136 4736 master.cpp:313] Master only allowing authenticated slaves to register I0114 18:51:34.717488 4736 credentials.hpp:36] Loading credentials for authentication from '/tmp/HookTest_VerifySlaveLaunchExecutorHook_GjBgME/credentials' I0114 18:51:34.718077 4736 master.cpp:357] Authorization enabled I0114 18:51:34.719238 4738 whitelist_watcher.cpp:65] No whitelist given I0114 18:51:34.719755 4737 hierarchical_allocator_process.hpp:285] Initialized hierarchical allocator process I0114 18:51:34.722584 4736 master.cpp:1219] The newly elected leader is master@192.168.122.135:57018 with id 20150114-185134-2272962752-57018-4720 I0114 18:51:34.722865 4736 master.cpp:1232] Elected as the leading master! I0114 18:51:34.723310 4736 master.cpp:1050] Recovering from registrar I0114 18:51:34.723760 4734 registrar.cpp:313] Recovering registrar I0114 18:51:34.725229 4740 log.cpp:660] Attempting to start the writer I0114 18:51:34.727893 4739 replica.cpp:477] Replica received implicit promise request with proposal 1 I0114 18:51:34.728425 4739 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 114781ns I0114 18:51:34.728662 4739 replica.cpp:345] Persisted promised to 1 I0114 18:51:34.731271 4741 coordinator.cpp:230] Coordinator attemping to fill missing position I0114 18:51:34.733223 4734 replica.cpp:378] Replica received explicit promise request for position 0 with proposal 2 I0114 18:51:34.734076 4734 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 87441ns I0114 18:51:34.734441 4734 replica.cpp:679] Persisted action at 0 I0114 18:51:34.740272 4739 replica.cpp:511] Replica received write request for position 0 I0114 18:51:34.740910 4739 leveldb.cpp:438] Reading position from leveldb took 59846ns I0114 18:51:34.741672 4739
[jira] [Updated] (MESOS-2160) Add support for allocator modules
[ https://issues.apache.org/jira/browse/MESOS-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2160: -- Sprint: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6) Add support for allocator modules - Key: MESOS-2160 URL: https://issues.apache.org/jira/browse/MESOS-2160 Project: Mesos Issue Type: Task Reporter: Niklas Quarfot Nielsen Assignee: Alexander Rukletsov Labels: mesosphere Currently Mesos supports only the DRF allocator, changing which requires hacking Mesos source code, which, in turn, sets a high entry barrier. Allocator modules give an easy possibility to tweak resource allocation policy. This will enable swapping allocation policies without the necessity to edit Mesos source code. Custom allocators may be written by everybody and does not need be distributed together with Mesos. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2057) Concurrency control for fetcher cache
[ https://issues.apache.org/jira/browse/MESOS-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2057: -- Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6) Concurrency control for fetcher cache - Key: MESOS-2057 URL: https://issues.apache.org/jira/browse/MESOS-2057 Project: Mesos Issue Type: Improvement Components: fetcher, slave Reporter: Bernd Mathiske Assignee: Bernd Mathiske Original Estimate: 96h Remaining Estimate: 96h Having added a URI flag to CommandInfo messages (in MESOS-2069) that indicates caching, caching files downloaded by the fetcher in a repository, now ensure that when a URI is cached, it is only ever downloaded once for the same user on the same slave as long as the slave keeps running. This even holds if multiple tasks request the same URI concurrently. If multiple requests for the same URI occur, perform only one of them and reuse the result. Make concurrent requests for the same URI wait for the one download. Different URIs from different CommandInfos can be downloaded concurrently. No cache eviction, cleanup or failover will be handled for now. Additional tickets will be filed for these enhancements. (So don't use this feature in production until the whole epic is complete.) Note that implementing this does not suffice for production use. This ticket contains the main part of the fetcher logic, though. See the epic MESOS-336 for the rest of the features that lead to a fully functional fetcher cache. The proposed general approach is to keep all bookkeeping about what is in which stage of being fetched and where it resides in the slave's MesosContainerizerProcess, so that all concurrent access is disambiguated and controlled by an actor (aka libprocess process). Depends on MESOS-2056 and MESOS-2069. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2157) Add /master/slaves and /master/frameworks/{framework}/tasks/{task} endpoints
[ https://issues.apache.org/jira/browse/MESOS-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2157: -- Sprint: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6) Add /master/slaves and /master/frameworks/{framework}/tasks/{task} endpoints Key: MESOS-2157 URL: https://issues.apache.org/jira/browse/MESOS-2157 Project: Mesos Issue Type: Task Components: master Reporter: Niklas Quarfot Nielsen Assignee: Alexander Rojas Priority: Trivial Labels: mesosphere, newbie master/state.json exports the entire state of the cluster and can, for large clusters, become massive (tens of megabytes of JSON). Often, a client only need information about subsets of the entire state, for example all connected slaves, or information (registration info, tasks, etc) belonging to a particular framework. We can partition state.json into many smaller endpoints, but for starters, being able to get slave information and tasks information per framework would be useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2119) Add Socket tests
[ https://issues.apache.org/jira/browse/MESOS-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2119: -- Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6) Add Socket tests Key: MESOS-2119 URL: https://issues.apache.org/jira/browse/MESOS-2119 Project: Mesos Issue Type: Task Reporter: Niklas Quarfot Nielsen Assignee: Joris Van Remoortere Add more Socket specific tests to get coverage while doing libev to libevent (w and wo SSL) move -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2373) DRFSorter needs to distinguish resources from different slaves.
[ https://issues.apache.org/jira/browse/MESOS-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2373: -- Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6) DRFSorter needs to distinguish resources from different slaves. --- Key: MESOS-2373 URL: https://issues.apache.org/jira/browse/MESOS-2373 Project: Mesos Issue Type: Bug Components: allocation Reporter: Michael Park Assignee: Michael Park Labels: mesosphere Currently the {{DRFSorter}} aggregates total and allocated resources across multiple slaves, which only works for scalar resources. We need to distinguish resources from different slaves. Suppose we have 2 slaves and 1 framework. The framework is allocated all resources from both slaves. {code} Resources slaveResources = Resources::parse(cpus:2;mem:512;ports:[31000-32000]).get(); DRFSorter sorter; sorter.add(slaveResources); // Add slave1 resources sorter.add(slaveResources); // Add slave2 resources // Total resources in sorter at this point is // cpus(*):4; mem(*):1024; ports(*):[31000-32000]. // The scalar resources get aggregated correctly but ports do not. sorter.add(F); // The 2 calls to allocated only works because we simply do: // allocation[name] += resources; // without checking that the 'resources' is available in the total. sorter.allocated(F, slaveResources); sorter.allocated(F, slaveResources); // At this point, sorter.allocation(F) is: // cpus(*):4; mem(*):1024; ports(*):[31000-32000]. {code} To provide some context, this issue came up while trying to reserve all unreserved resources from every offer. {code} for (const Offer offer : offers) { Resources unreserved = offer.resources().unreserved(); Resources reserved = unreserved.flatten(role, Resource::FRAMEWORK); Offer::Operation reserve; reserve.set_type(Offer::Operation::RESERVE); reserve.mutable_reserve()-mutable_resources()-CopyFrom(reserved); driver-acceptOffers({offer.id()}, {reserve}); } {code} Suppose the slave resources are the same as above: {quote} Slave1: {{cpus(\*):2; mem(\*):512; ports(\*):\[31000-32000\]}} Slave2: {{cpus(\*):2; mem(\*):512; ports(\*):\[31000-32000\]}} {quote} Initial (incorrect) total resources in the DRFSorter is: {quote} {{cpus(\*):4; mem(\*):1024; ports(\*):\[31000-32000\]}} {quote} We receive 2 offers, 1 from each slave: {quote} Offer1: {{cpus(\*):2; mem(\*):512; ports(\*):\[31000-32000\]}} Offer2: {{cpus(\*):2; mem(\*):512; ports(\*):\[31000-32000\]}} {quote} At this point, the resources allocated for the framework is: {quote} {{cpus(\*):4; mem(\*):1024; ports(\*):\[31000-32000\]}} {quote} After first {{RESERVE}} operation with Offer1: The allocated resources for the framework becomes: {quote} {{cpus(\*):2; mem(\*):512; cpus(role):2; mem(role):512; ports(role):\[31000-32000\]}} {quote} During second {{RESERVE}} operation with Offer2: {code:title=HierarchicalAllocatorProcess::updateAllocation} // ... FrameworkSorter* frameworkSorter = frameworkSorters[frameworks\[frameworkId\].role]; Resources allocation = frameworkSorter-allocation(frameworkId.value()); // Update the allocated resources. TryResources updatedAllocation = allocation.apply(operations); CHECK_SOME(updatedAllocation); // ... {code} {{allocation}} in the above code is: {quote} {{cpus(\*):2; mem(\*):512; cpus(role):2; mem(role):512; ports(role):\[31000-32000\]}} {quote} We try to {{apply}} a {{RESERVE}} operation and we fail to find {{ports(\*):\[31000-32000\]}} which leads to the {{CHECK}} fail at {{CHECK_SOME(updatedAllocation);}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2248) 0.22.0 release
[ https://issues.apache.org/jira/browse/MESOS-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2248: -- Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6) 0.22.0 release -- Key: MESOS-2248 URL: https://issues.apache.org/jira/browse/MESOS-2248 Project: Mesos Issue Type: Epic Reporter: Niklas Quarfot Nielsen Assignee: Niklas Quarfot Nielsen Mesos release 0.22.0 will include the following major feature(s): - Module Hooks (MESOS-2060) - Disk quota isolation in Mesos containerizer (MESOS-1587 and MESOS-1588) Minor features and fixes: - Task labels (MESOS-2120) - Service discovery info for tasks and executors (MESOS-2208) - Docker containerizer able to recover when running in a container (MESOS-2115) - Containerizer fixes (...) - Various bug fixes (...) Possible major features: - Container level network isolation (MESOS-1585) - Dynamic Reservations (MESOS-2018) This ticket will be used to track blockers to this release. For reference (per Jan 22nd) this has gone into Mesos since 0.21.1: https://gist.github.com/nqn/76aeb41a555625659ed8 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1831) Master should send PingSlaveMessage instead of PING
[ https://issues.apache.org/jira/browse/MESOS-1831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-1831: -- Sprint: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6) Master should send PingSlaveMessage instead of PING - Key: MESOS-1831 URL: https://issues.apache.org/jira/browse/MESOS-1831 Project: Mesos Issue Type: Task Reporter: Vinod Kone Assignee: Adam B Labels: mesosphere In 0.21.0 master sends PING message with an embedded PingSlaveMessage for backwards compatibility (https://reviews.apache.org/r/25867/). In 0.22.0, master should send PingSlaveMessage directly instead of PING. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2215) The Docker containerizer attempts to recover any task when checkpointing is enabled, not just docker tasks.
[ https://issues.apache.org/jira/browse/MESOS-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2215: -- Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6) The Docker containerizer attempts to recover any task when checkpointing is enabled, not just docker tasks. --- Key: MESOS-2215 URL: https://issues.apache.org/jira/browse/MESOS-2215 Project: Mesos Issue Type: Bug Components: docker Affects Versions: 0.21.0 Reporter: Steve Niemitz Assignee: Timothy Chen Once the slave restarts and recovers the task, I see this error in the log for all tasks that were recovered every second or so. Note, these were NOT docker tasks: W0113 16:01:00.790323 773142 monitor.cpp:213] Failed to get resource usage for container 7b729b89-dc7e-4d08-af97-8cd1af560a21 for executor thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd of framework 20150109-161713-715350282-5050-290797-: Failed to 'docker inspect mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21': exit status = exited with status 1 stderr = Error: No such image or container: mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21 However the tasks themselves are still healthy and running. The slave was launched with --containerizers=mesos,docker - More info: it looks like the docker containerizer is a little too ambitious about recovering containers, again this was not a docker task: I0113 15:59:59.476145 773142 docker.cpp:814] Recovering container '7b729b89-dc7e-4d08-af97-8cd1af560a21' for executor 'thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd' of framework 20150109-161713-715350282-5050-290797- Looking into the source, it looks like the problem is that the ComposingContainerize runs recover in parallel, but neither the docker containerizer nor mesos containerizer check if they should recover the task or not (because they were the ones that launched it). Perhaps this needs to be written into the checkpoint somewhere? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2072) Fetcher cache eviction
[ https://issues.apache.org/jira/browse/MESOS-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2072: -- Sprint: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6) Fetcher cache eviction -- Key: MESOS-2072 URL: https://issues.apache.org/jira/browse/MESOS-2072 Project: Mesos Issue Type: Improvement Components: fetcher, slave Reporter: Bernd Mathiske Assignee: Bernd Mathiske Original Estimate: 336h Remaining Estimate: 336h Delete files from the fetcher cache so that a given cache size is never exceeded. Succeed in doing so while concurrent downloads are on their way and new requests are pouring in. Idea: measure the size of each download before it begins, make enough room before the download. This means that only download mechanisms that divulge the size before the main download will be supported. AFAWK, those in use so far have this property. The calculation of how much space to free needs to be under concurrency control, accumulating all space needed for competing, incomplete download requests. (The Python script that performs fetcher caching for Aurora does not seem to implement this. See https://gist.github.com/zmanji/f41df77510ef9d00265a, imagine several of these programs running concurrently, each one's _cache_eviction() call succeeding, each perceiving the SAME free space being available.) Ultimately, a conflict resolution strategy is needed if just the downloads underway already exceed the cache capacity. Then, as a fallback, direct download into the work directory will be used for some tasks. TBD how to pick which task gets treated how. At first, only support copying of any downloaded files to the work directory for task execution. This isolates the task life cycle after starting a task from cache eviction considerations. (Later, we can add symbolic links that avoid copying. But then eviction of fetched files used by ongoing tasks must be blocked, which adds complexity. another future extension is MESOS-1667 Extract from URI while downloading into work dir). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2069) Basic fetcher cache functionality
[ https://issues.apache.org/jira/browse/MESOS-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2069: -- Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6) Basic fetcher cache functionality - Key: MESOS-2069 URL: https://issues.apache.org/jira/browse/MESOS-2069 Project: Mesos Issue Type: Improvement Components: fetcher, slave Reporter: Bernd Mathiske Assignee: Bernd Mathiske Labels: fetcher, slave Original Estimate: 48h Remaining Estimate: 48h Add a flag to CommandInfo URI protobufs that indicates that files downloaded by the fetcher shall be cached in a repository. To be followed by MESOS-2057 for concurrency control. Also see MESOS-336 for the overall goals for the fetcher cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2337) __init__.py not getting installed in $PREFIX/lib/pythonX.Y/site-packages/mesos
[ https://issues.apache.org/jira/browse/MESOS-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2337: -- Sprint: Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q1 Sprint 4 - 3/6) __init__.py not getting installed in $PREFIX/lib/pythonX.Y/site-packages/mesos -- Key: MESOS-2337 URL: https://issues.apache.org/jira/browse/MESOS-2337 Project: Mesos Issue Type: Bug Components: build, python api Reporter: Kapil Arya Assignee: Kapil Arya Priority: Blocker When doing a make install, the src/python/native/src/mesos/__init__.py file is not getting installed in ${PREFIX}/lib/pythonX.Y/site-packages/mesos/. This makes it impossible to do the following import when PYTHONPATH is set to the site-packages directory. {code} import mesos.interface.mesos_pb2 {code} The directories `${PREFIX}/lib/pythonX.Y/site-packages/mesos/{interface,native}/` do have their corresponding `__init__.py` files. Reproducing the bug: ../configure --prefix=$HOME/test-install make install -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2070) Implement simple slave recovery behavior for fetcher cache
[ https://issues.apache.org/jira/browse/MESOS-2070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2070: -- Sprint: Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6) Implement simple slave recovery behavior for fetcher cache -- Key: MESOS-2070 URL: https://issues.apache.org/jira/browse/MESOS-2070 Project: Mesos Issue Type: Improvement Components: fetcher, slave Reporter: Bernd Mathiske Assignee: Bernd Mathiske Labels: newbie Original Estimate: 6h Remaining Estimate: 6h Clean the fetcher cache completely upon slave restart/recovery. This implements correct, albeit not ideal behavior. More efficient schemes that restore knowledge about cached files or even resume downloads can be added later. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2016) docker_name_prefix is too generic
[ https://issues.apache.org/jira/browse/MESOS-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2016: -- Sprint: Mesosphere Q4 Sprint 2 - 11/14, Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q4 Sprint 2 - 11/14, Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6) docker_name_prefix is too generic - Key: MESOS-2016 URL: https://issues.apache.org/jira/browse/MESOS-2016 Project: Mesos Issue Type: Bug Reporter: Jay Buffington Assignee: Timothy Chen From docker.hpp and docker.cpp: {quote} // Prefix used to name Docker containers in order to distinguish those // created by Mesos from those created manually. extern std::string DOCKER_NAME_PREFIX; // TODO(benh): At some point to run multiple slaves we'll need to make // the Docker container name creation include the slave ID. string DOCKER_NAME_PREFIX = mesos-; {quote} This name is too generic. A common pattern in docker land is to run everything in a container and use volume mounts to share sockets do RPC between containers. CoreOS has popularized this technique. Inevitably, what people do is start a container named mesos-slave which runs the docker containerizer recovery code which removes all containers that start with mesos- And then ask huh, why did my mesos-slave docker container die? I don't see any error messages... Ideally, we should do what Ben suggested and add the slave id to the name prefix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1806) Substituting etcd or ReplicatedLog for Zookeeper
[ https://issues.apache.org/jira/browse/MESOS-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-1806: -- Sprint: Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6) Substituting etcd or ReplicatedLog for Zookeeper Key: MESOS-1806 URL: https://issues.apache.org/jira/browse/MESOS-1806 Project: Mesos Issue Type: Task Reporter: Ed Ropple Assignee: Cody Maloney Priority: Minor adam_mesos eropple: Could you also file a new JIRA for Mesos to drop ZK in favor of etcd or ReplicatedLog? Would love to get some momentum going on that one. -- Consider it filed. =) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2050) InMemoryAuxProp plugin used by Authenticators results in SEGFAULT
[ https://issues.apache.org/jira/browse/MESOS-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2050: -- Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6) InMemoryAuxProp plugin used by Authenticators results in SEGFAULT - Key: MESOS-2050 URL: https://issues.apache.org/jira/browse/MESOS-2050 Project: Mesos Issue Type: Bug Affects Versions: 0.21.0 Reporter: Vinod Kone Assignee: Till Toenshoff Observed this on ASF CI: Basically, as part of the recent Auth refactor for modules, the loading of secrets is being done once per Authenticator Process instead of once in the Master. Since, InMemoryAuxProp plugin manipulates static variables (e.g, 'properties') it results in SEGFAULT when one Authenticator (e.g., for slave) does load() while another Authenticator (e.g., for framework) does lookup(), as both these methods manipulate static 'properties'. {code} [ RUN ] MasterTest.LaunchDuplicateOfferTest Using temporary directory '/tmp/MasterTest_LaunchDuplicateOfferTest_XEBbvp' I1104 03:37:55.523553 28363 leveldb.cpp:176] Opened db in 2.270387ms I1104 03:37:55.524250 28363 leveldb.cpp:183] Compacted db in 662527ns I1104 03:37:55.524276 28363 leveldb.cpp:198] Created db iterator in 4964ns I1104 03:37:55.524284 28363 leveldb.cpp:204] Seeked to beginning of db in 702ns I1104 03:37:55.524291 28363 leveldb.cpp:273] Iterated through 0 keys in the db in 450ns I1104 03:37:55.524333 28363 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I1104 03:37:55.524852 28384 recover.cpp:437] Starting replica recovery I1104 03:37:55.525188 28384 recover.cpp:463] Replica is in EMPTY status I1104 03:37:55.526577 28378 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request I1104 03:37:55.527135 28378 master.cpp:318] Master 20141104-033755-3176252227-49988-28363 (proserpina.apache.org) started on 67.195.81.189:49988 I1104 03:37:55.527180 28378 master.cpp:364] Master only allowing authenticated frameworks to register I1104 03:37:55.527191 28378 master.cpp:369] Master only allowing authenticated slaves to register I1104 03:37:55.527217 28378 credentials.hpp:36] Loading credentials for authentication from '/tmp/MasterTest_LaunchDuplicateOfferTest_XEBbvp/credentials' I1104 03:37:55.527451 28378 master.cpp:408] Authorization enabled I1104 03:37:55.528081 28384 master.cpp:126] No whitelist given. Advertising offers for all slaves I1104 03:37:55.528548 28383 recover.cpp:188] Received a recover response from a replica in EMPTY status I1104 03:37:55.528645 28388 hierarchical_allocator_process.hpp:299] Initializing hierarchical allocator process with master : master@67.195.81.189:49988 I1104 03:37:55.529233 28388 master.cpp:1258] The newly elected leader is master@67.195.81.189:49988 with id 20141104-033755-3176252227-49988-28363 I1104 03:37:55.529266 28388 master.cpp:1271] Elected as the leading master! I1104 03:37:55.529289 28388 master.cpp:1089] Recovering from registrar I1104 03:37:55.529311 28385 recover.cpp:554] Updating replica status to STARTING I1104 03:37:55.529500 28384 registrar.cpp:313] Recovering registrar I1104 03:37:55.530037 28383 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 497965ns I1104 03:37:55.530083 28383 replica.cpp:320] Persisted replica status to STARTING I1104 03:37:55.530335 28387 recover.cpp:463] Replica is in STARTING status I1104 03:37:55.531343 28381 replica.cpp:638] Replica in STARTING status received a broadcasted recover request I1104 03:37:55.531739 28384 recover.cpp:188] Received a recover response from a replica in STARTING status I1104 03:37:55.532168 28379 recover.cpp:554] Updating replica status to VOTING I1104 03:37:55.532572 28381 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 293974ns I1104 03:37:55.532594 28381 replica.cpp:320] Persisted replica status to VOTING I1104 03:37:55.532790 28390 recover.cpp:568] Successfully joined the Paxos group I1104 03:37:55.533107 28390 recover.cpp:452] Recover process terminated I1104 03:37:55.533604 28382 log.cpp:656] Attempting to start the writer I1104 03:37:55.534840 28381 replica.cpp:474] Replica received implicit promise request with proposal 1 I1104 03:37:55.535188 28381 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 321021ns I1104 03:37:55.535212 28381 replica.cpp:342] Persisted promised to 1 I1104 03:37:55.535893 28378 coordinator.cpp:230] Coordinator attemping to fill missing position I1104 03:37:55.537318 28392 replica.cpp:375] Replica received explicit promise
[jira] [Updated] (MESOS-2085) Add support encrypted and non-encrypted communication in parallel for cluster upgrade
[ https://issues.apache.org/jira/browse/MESOS-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2085: -- Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6) Add support encrypted and non-encrypted communication in parallel for cluster upgrade - Key: MESOS-2085 URL: https://issues.apache.org/jira/browse/MESOS-2085 Project: Mesos Issue Type: Task Reporter: Niklas Quarfot Nielsen Assignee: Joris Van Remoortere During cluster upgrade from non-encrypted to encrypted communication, we need to support an interim where: 1) A master can have connections to both encrypted and non-encrypted slaves 2) A slave that supports encrypted communication connects to a master that has not yet been upgraded. 3) Frameworks are encrypted but the master has not been upgraded yet. 4) Master has been upgraded but frameworks haven't. 5) A slave process has upgraded but running executor processes haven't. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-2309) Mesos rejects ExecutorInfo as incompatible when there is no functional difference
[ https://issues.apache.org/jira/browse/MESOS-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353839#comment-14353839 ] Joerg Schad edited comment on MESOS-2309 at 3/9/15 11:10 PM: - Just tested and you are correct :-). was (Author: js84): Just tested and you are right :-). Mesos rejects ExecutorInfo as incompatible when there is no functional difference - Key: MESOS-2309 URL: https://issues.apache.org/jira/browse/MESOS-2309 Project: Mesos Issue Type: Bug Reporter: Zameer Manji Assignee: Joerg Schad Priority: Minor Labels: twitter In AURORA-1076 it was discovered that if an ExecutorInfo was changed such that a previously unset optional field with a default value was changed to have the field set with the default value, it would be rejected as not compatible. For example if we have an ExecutorInfo with a CommandInfo with the {{shell}} attribute unset and then we change the CommandInfo to set the {{shell}} attribute to true Mesos will reject the task with: {noformat} I0130 21:50:05.373389 50869 master.cpp:3441] Sending status update TASK_LOST (UUID: 82ef615c-0d59-4427-95d5-80cf0e52b3fc) for task system-gc-c89c0c05-200c-462e-958a-ecd7b9a76831 of framework 201103282247-19- 'Task has invalid ExecutorInfo (existing ExecutorInfo with same ExecutorID is not compatible). {noformat} This is not intuitive because the default value of the {{shell}} attribute is true. There should be no difference between not setting an optional field with a default value and setting that field to the default value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2333) Securing Sandboxes via Filebrowser Access Control
[ https://issues.apache.org/jira/browse/MESOS-2333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2333: -- Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6) Securing Sandboxes via Filebrowser Access Control - Key: MESOS-2333 URL: https://issues.apache.org/jira/browse/MESOS-2333 Project: Mesos Issue Type: Improvement Components: security Reporter: Adam B Assignee: Alexander Rojas Labels: authorization, filebrowser, mesosphere, security As it stands now, anybody with access to the master or slave web UI can use the filebrowser to view the contents of any attached/mounted paths on the master or slave. Currently, the attached paths include master and slave logs as well as executor/task sandboxes. While there's a chance that the master and slave logs could contain sensitive information, it's much more likely that sandboxes could contain customer data or other files that should not be globally accessible. Securing the sandboxes is the primary goal of this ticket. There are four filebrowser endpoints: browse, read, download, and debug. Here are some potential solutions. 1) We could easily provide flags that globally enable/disable each endpoint, allowing coarse-grained access control. This might be a reasonable short-term plan. We would also want to update the web UIs to display an Access Denied error, rather than showing links that open up blank pailers. 2) Each master and slave handles is own authn/authz. Slaves will need to have an authenticator, and there must be a way to provide each node with credentials and ACLs, and keep these in sync across the cluster. 3) Filter all slave communications through the master(s), which already has credentials and ACLs. We'll have to restrict access to the filebrowser (and other?) endpoints to the (leading?) master. Then the master can perform the authentication and authorization, only passing the request on to the slave if auth succeeds. 3a) The slave returns the browse/read/download response back through the master. This could be a network bottleneck. 3b) Upon authn/z success, the master redirects the request to the appropriate slave, which will send the response directly back to the requester. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2139) Enable master's Accept call handler to support Dynamic Reservation
[ https://issues.apache.org/jira/browse/MESOS-2139?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2139: -- Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6) Enable master's Accept call handler to support Dynamic Reservation --- Key: MESOS-2139 URL: https://issues.apache.org/jira/browse/MESOS-2139 Project: Mesos Issue Type: Task Components: master Reporter: Michael Park Assignee: Michael Park Labels: mesosphere The allocated resources in the allocator needs to be updated when a dynamic reservation is performed because we need to transition the {{Resources}} that are marked {{reservationType=STATIC}} to {{DYNAMIC}}. {{Resources::apply(Offer::Operation)}} is used to determine the resulting set of resources after an operation. This is to be used to update the resources in places such as the allocator and the total slave resources. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2229) Add max allowed age to Slave stats.json endpoint
[ https://issues.apache.org/jira/browse/MESOS-2229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2229: -- Sprint: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6) Add max allowed age to Slave stats.json endpoint -- Key: MESOS-2229 URL: https://issues.apache.org/jira/browse/MESOS-2229 Project: Mesos Issue Type: Improvement Components: json api Reporter: Sunil Abraham Assignee: Alexander Rojas Labels: mesosphere Currently max allowed age gets logged, but it would be great to have this in the slave's stats.json endpoint for programmatic access. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2161) AbstractState JNI check fails for Marathon framework
[ https://issues.apache.org/jira/browse/MESOS-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Swartz updated MESOS-2161: Attachment: mesos_core_dump_gdb.txt.bz2 GDB investigation of the core dump from Marathon including registers. AbstractState JNI check fails for Marathon framework Key: MESOS-2161 URL: https://issues.apache.org/jira/browse/MESOS-2161 Project: Mesos Issue Type: Bug Affects Versions: 0.21.0 Environment: Mesos 0.21.0 Marathon 0.7.5 Fedora 20 Reporter: Matthew Sanders Attachments: mesos_core_dump_gdb.txt.bz2 I've recently upgraded to mesos 0.21.0 and now it seems that every few minutes or so I see the following error, which kills marathon. Nov 25 18:12:42 gianthornet.trading.imc.intra marathon[5453]: [2014-11-25 18:12:42,064] INFO 10.133.128.26 - - [26/Nov/2014:00:12:41 +] GET /v2/apps HTTP/1.1 200 2321 http://marathon:8080/; Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0 (mesosphere.chaos.http.ChaosRequestLog:15) Nov 25 18:12:42 gianthornet.trading.imc.intra marathon[5453]: [2014-11-25 18:12:42,238] INFO 10.133.128.26 - - [26/Nov/2014:00:12:42 +] GET /v2/deployments HTTP/1.1 200 2 http://marathon:8080/; Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0 (mesosphere.chaos.http.ChaosRequestLog:15) Nov 25 18:12:42 gianthornet.trading.imc.intra marathon[5453]: [2014-11-25 18:12:42,961] INFO 10.192.221.95 - - [26/Nov/2014:00:12:42 +] GET /v2/apps HTTP/1.1 200 2321 http://marathon:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.65 Safari/537... Nov 25 18:12:43 gianthornet.trading.imc.intra marathon[5453]: [2014-11-25 18:12:43,032] INFO 10.192.221.95 - - [26/Nov/2014:00:12:42 +] GET /v2/deployments HTTP/1.1 200 2 http://marathon:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.65 Safari... Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: F1125 18:12:44.146260 5897 check.hpp:79] Check failed: f.isReady() Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: *** Check failure stack trace: *** Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 0x7f8176a2b17c google::LogMessage::Fail() Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 0x7f8176a2b0d5 google::LogMessage::SendToLog() Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 0x7f8176a2aab3 google::LogMessage::Flush() Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 0x7f8176a2da3b google::LogMessageFatal::~LogMessageFatal() Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 0x7f8176a1ea64 _checkReady() Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 0x7f8176a1d43b Java_org_apache_mesos_state_AbstractState__1_1names_1get Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 0x7f81f644ca70 (unknown) Nov 25 18:12:44 gianthornet.trading.imc.intra systemd[1]: marathon.service: main process exited, code=killed, status=6/ABRT Here's the command that mesos-master is being run with /usr/local/sbin/mesos-master --zk=zk://usint-zk-d01-node1chi:2191,usint-zk-d01-node2chi:2192,usint-zk-d01-node3chi:2193/mesos --port=5050 --log_dir=/var/log/mesos --quorum=1 --work_dir=/var/lib/mesos Here's the command that the slave is running with: /usr/local/sbin/mesos-slave --master=zk://usint-zk-d01-node1chi:2191,usint-zk-d01-node2chi:2192,usint-zk-d01-node3chi:2193/mesos --log_dir=/var/log/mesos --containerizers=docker,mesos --executor_registration_timeout=5mins --attributes=country:us;datacenter:njl3;environment:dev;region:amer;timezone:America/Chicago I realize this could also be filed to marathon, but it sort of looks like a c++ issue to me, which is why I came here to post this. Any help would be greatly appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2309) Mesos rejects ExecutorInfo as incompatible when there is no functional difference
[ https://issues.apache.org/jira/browse/MESOS-2309?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-2309: -- Assignee: Vinod Kone (was: Joerg Schad) Mesos rejects ExecutorInfo as incompatible when there is no functional difference - Key: MESOS-2309 URL: https://issues.apache.org/jira/browse/MESOS-2309 Project: Mesos Issue Type: Bug Reporter: Zameer Manji Assignee: Vinod Kone Priority: Minor Labels: twitter In AURORA-1076 it was discovered that if an ExecutorInfo was changed such that a previously unset optional field with a default value was changed to have the field set with the default value, it would be rejected as not compatible. For example if we have an ExecutorInfo with a CommandInfo with the {{shell}} attribute unset and then we change the CommandInfo to set the {{shell}} attribute to true Mesos will reject the task with: {noformat} I0130 21:50:05.373389 50869 master.cpp:3441] Sending status update TASK_LOST (UUID: 82ef615c-0d59-4427-95d5-80cf0e52b3fc) for task system-gc-c89c0c05-200c-462e-958a-ecd7b9a76831 of framework 201103282247-19- 'Task has invalid ExecutorInfo (existing ExecutorInfo with same ExecutorID is not compatible). {noformat} This is not intuitive because the default value of the {{shell}} attribute is true. There should be no difference between not setting an optional field with a default value and setting that field to the default value. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1921) Design and implement protobuf storage of IP addresses
[ https://issues.apache.org/jira/browse/MESOS-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353926#comment-14353926 ] Evelina Dumitrescu commented on MESOS-1921: --- I think that a better aproach would be to: - Design an additional protobuffer message inside the MasterInfo message message IPv6 { required uint32 s6_addr1 = 1; required uint32 s6_addr2 = 2; required uint32 s6_addr3 = 3; required uint32 s6_addr4 = 4; } - Add an optional field of Ipv6 type inside the MasterInfo - An additional field for the IPv6 hostname will be needed in SlaveInfo, Offer, ContainerInfo protobuffer messages. The resolved hostnames from the IPv4 and IPv6 addresses might differ. Moreover, if the hostname cannot be resolved then a string version of the IP address will be returned) Design and implement protobuf storage of IP addresses - Key: MESOS-1921 URL: https://issues.apache.org/jira/browse/MESOS-1921 Project: Mesos Issue Type: Task Reporter: Dominic Hamon Assignee: Evelina Dumitrescu We can use {{bytes}} type or statements like {{repeated uint32 data = 4[packed=true];}} {{string}} representations might add again some parsing overhead. An additional field might be necessary to specify the protocol family type (distinguish between IPv4/IPv6). For example, if we don't specify the family type we can't distinguish between these Ip addresses in the case of byte/array representation: 0:0:0:0:0:0:IPV4 and IPv4 (see http://tools.ietf.org/html/rfc4291#page-10) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2470) Create state abstraction stress test
Niklas Quarfot Nielsen created MESOS-2470: - Summary: Create state abstraction stress test Key: MESOS-2470 URL: https://issues.apache.org/jira/browse/MESOS-2470 Project: Mesos Issue Type: Task Reporter: Niklas Quarfot Nielsen Due to https://issues.apache.org/jira/browse/MESOS-2161, we need a way to stress test the state abstraction and show it's scalability properties. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2467) Allow --resources flag to take JSON.
[ https://issues.apache.org/jira/browse/MESOS-2467?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353874#comment-14353874 ] Michael Park commented on MESOS-2467: - I'm surprised we only support {{JSON::Object}} currently, but I guess we haven't needed the other ones. It looks like the other flags have the pattern of: See the X Protobuf in mesos.proto for the expected format. It seems that in order to keep consistency we would need a {{Resources}} message: {code}message Resources { repeated Resource resources; }{code} which of course would break badly since we already have a {{Resources}} C++ type. Unless there's a sane way to separate those, it totally makes sense to support JSON array of Resource objects instead and we can say something like, See the Resource protobuf in mesos.proto for the expected format of each of the elements. Allow --resources flag to take JSON. Key: MESOS-2467 URL: https://issues.apache.org/jira/browse/MESOS-2467 Project: Mesos Issue Type: Improvement Reporter: Jie Yu Currently, we used a customized format for --resources flag. As we introduce more and more stuffs (e.g., persistence, reservation) in Resource object, we need a more generic way to specify --resources. For backward compatibility, we can scan the first character. If it is '[', then we invoke the JSON parser. Otherwise, we use the existing parser. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2469) Mesos master/slave should be able to bind to 127.0.0.1 if explicitly requested
[ https://issues.apache.org/jira/browse/MESOS-2469?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353894#comment-14353894 ] Vinod Kone commented on MESOS-2469: --- https://reviews.apache.org/r/31872/ Mesos master/slave should be able to bind to 127.0.0.1 if explicitly requested -- Key: MESOS-2469 URL: https://issues.apache.org/jira/browse/MESOS-2469 Project: Mesos Issue Type: Bug Reporter: Vinod Kone Assignee: Vinod Kone With the current refactoring to IP it looks like master and slave can no longer bind to 127.0.0.1 even if explicitly requested via --ip flag. Among other things, this breaks the balloon framework test which uses this flag. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2427) Add Java binding for the acceptOffers API.
[ https://issues.apache.org/jira/browse/MESOS-2427?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353934#comment-14353934 ] Jie Yu commented on MESOS-2427: --- https://reviews.apache.org/r/31873/ Add Java binding for the acceptOffers API. -- Key: MESOS-2427 URL: https://issues.apache.org/jira/browse/MESOS-2427 Project: Mesos Issue Type: Task Components: java api Reporter: Jie Yu Assignee: Jie Yu We introduced the new acceptOffers API in C++ driver. We need to provide Java binding for this API as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-2427) Add Java binding for the acceptOffers API.
[ https://issues.apache.org/jira/browse/MESOS-2427?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu reassigned MESOS-2427: - Assignee: Jie Yu Add Java binding for the acceptOffers API. -- Key: MESOS-2427 URL: https://issues.apache.org/jira/browse/MESOS-2427 Project: Mesos Issue Type: Task Components: java api Reporter: Jie Yu Assignee: Jie Yu We introduced the new acceptOffers API in C++ driver. We need to provide Java binding for this API as well. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-1921) Design and implement protobuf storage of IP addresses
[ https://issues.apache.org/jira/browse/MESOS-1921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14353926#comment-14353926 ] Evelina Dumitrescu edited comment on MESOS-1921 at 3/10/15 12:06 AM: - I think that a better aproach would be to: - Design an additional protobuffer message inside the MasterInfo message {noformat} message IPv6 { required uint32 s6_addr1 = 1; required uint32 s6_addr2 = 2; required uint32 s6_addr3 = 3; required uint32 s6_addr4 = 4; } {noformat} - Add an optional field of Ipv6 type inside the MasterInfo - An additional field for the IPv6 hostname will be needed in SlaveInfo, Offer, ContainerInfo protobuffer messages. The resolved hostnames from the IPv4 and IPv6 addresses might differ. Moreover, if the hostname cannot be resolved then a string version of the IP address will be returned) was (Author: evelinad): I think that a better aproach would be to: - Design an additional protobuffer message inside the MasterInfo message message IPv6 { required uint32 s6_addr1 = 1; required uint32 s6_addr2 = 2; required uint32 s6_addr3 = 3; required uint32 s6_addr4 = 4; } - Add an optional field of Ipv6 type inside the MasterInfo - An additional field for the IPv6 hostname will be needed in SlaveInfo, Offer, ContainerInfo protobuffer messages. The resolved hostnames from the IPv4 and IPv6 addresses might differ. Moreover, if the hostname cannot be resolved then a string version of the IP address will be returned) Design and implement protobuf storage of IP addresses - Key: MESOS-1921 URL: https://issues.apache.org/jira/browse/MESOS-1921 Project: Mesos Issue Type: Task Reporter: Dominic Hamon Assignee: Evelina Dumitrescu We can use {{bytes}} type or statements like {{repeated uint32 data = 4[packed=true];}} {{string}} representations might add again some parsing overhead. An additional field might be necessary to specify the protocol family type (distinguish between IPv4/IPv6). For example, if we don't specify the family type we can't distinguish between these Ip addresses in the case of byte/array representation: 0:0:0:0:0:0:IPV4 and IPv4 (see http://tools.ietf.org/html/rfc4291#page-10) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2466) Write documentation for all the LIBPROCESS_* environment variables.
Alexander Rojas created MESOS-2466: -- Summary: Write documentation for all the LIBPROCESS_* environment variables. Key: MESOS-2466 URL: https://issues.apache.org/jira/browse/MESOS-2466 Project: Mesos Issue Type: Documentation Reporter: Alexander Rojas libprocess uses a set of environment variables to modify its behaviour; however, these variables are not documented anywhere, nor it is defined where the documentation should be. What would be needed is a decision whether the environment variables should be documented (a new doc file or reusing an existing one), and then add the documentation there. After searching in the code, these are the variables which need to be documented: # {{LIBPROCESS_ENABLE_PROFILER}} # {{LIBPROCESS_IP}} # {{LIBPROCESS_PORT}} # {{LIBPROCESS_STATISTICS_WINDOW}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2451) mesos c++ zookeeper code hangs from api operation from within watcher of CHANGE event
[ https://issues.apache.org/jira/browse/MESOS-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] craig bordelon updated MESOS-2451: -- Attachment: bug0.cpp mesos c++ zookeeper code hangs from api operation from within watcher of CHANGE event - Key: MESOS-2451 URL: https://issues.apache.org/jira/browse/MESOS-2451 Project: Mesos Issue Type: Bug Components: c++ api Affects Versions: 0.22.0 Environment: red hat linux 6.5 Reporter: craig bordelon Assignee: Benjamin Hindman Attachments: Makefile, bug.cpp, bug0.cpp, log.h We've observed that that the mesos 0.22.0-rc1 c++ zookeeper code appears to hang (two threads stuck in indefinite pthread condition waits) on a test case that as best we can tell is mesos issue and not issue with underlying apache zookeeper C binding. (that is we tried same type case using apache zookeeper C binding directly and saw no issues.) This happens with a properly running zookeeper (standalone is sufficient). Heres how we hung it: We issue a mesos zk set via int ZooKeeper::set ( const std::string path, const std::string data, int version ) then inside a Watcher we process on CHANGED event to issue a mesos zk get on the same path via int ZooKeeper::get ( const std::string path, boolwatch, std::string * result, Stat * stat ) we end up with two threads in the process both in pthread_cond_waits #0 0x00334e20b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7f6664ee1cf5 in Gate::arrive (this=0x7f6140, old=0) at ../../../3rdparty/libprocess/src/gate.hpp:82 #2 0x7f6664ecef6e in process::ProcessManager::wait (this=0x7f02e0, pid=...) at ../../../3rdparty/libprocess/src/process.cpp:2476 #3 0x7f6664ed2ce9 in process::wait (pid=..., duration=...) at ../../../3rdparty/libprocess/src/process.cpp:2958 #4 0x7f6664e90558 in process::Latch::await (this=0x7f6ba0, duration=...) at ../../../3rdparty/libprocess/src/latch.cpp:49 #5 0x7f66649452cc in process::Futureint::await (this=0x7fffa0fd9040, duration=...) at ../../3rdparty/libprocess/include/process/future.hpp:1156 #6 0x7f666493a04d in process::Futureint::get (this=0x7fffa0fd9040) at ../../3rdparty/libprocess/include/process/future.hpp:1167 #7 0x7f6664ab1aac in ZooKeeper::set (this=0x803ce0, path=/craig/mo, data= ... and #0 0x00334e20b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7f6664ee1cf5 in Gate::arrive (this=0x7f66380013f0, old=0) at ../../../3rdparty/libprocess/src/gate.hpp:82 #2 0x7f6664ecef6e in process::ProcessManager::wait (this=0x7f02e0, pid=...) at ../../../3rdparty/libprocess/src/process.cpp:2476 #3 0x7f6664ed2ce9 in process::wait (pid=..., duration=...) at ../../../3rdparty/libprocess/src/process.cpp:2958 #4 0x7f6664e90558 in process::Latch::await (this=0x7f6638000d00, duration=...) at ../../../3rdparty/libprocess/src/latch.cpp:49 #5 0x7f66649452cc in process::Futureint::await (this=0x7f66595fb6f0, duration=...) at ../../3rdparty/libprocess/include/process/future.hpp:1156 #6 0x7f666493a04d in process::Futureint::get (this=0x7f66595fb6f0) at ../../3rdparty/libprocess/include/process/future.hpp:1167 #7 0x7f6664ab18d3 in ZooKeeper::get (this=0x803ce0, path=/craig/mo, watch=false, We of course have a separate enhancement suggestion that the mesos C++ zookeeper api use timed waits and not block indefinitely for responses. But this case we think the mesos code itself is blocking on itself and not handling the responses. craig -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2451) mesos c++ zookeeper code hangs from api operation from within watcher of CHANGE event
[ https://issues.apache.org/jira/browse/MESOS-2451?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14354009#comment-14354009 ] craig bordelon commented on MESOS-2451: --- oh oh. Not looking like this is mesos c++ zookeeper api issue after all. I now attach the test case re-written to use just the zk c binding api and hangs just the same. So, looks like i was wrong when I created case to claim that underlying zookeeper was fine with the type of processing. I guess for anybody more familiar with apache zookeeper c binding api. just why cant a watcher handler on CHANGED perform asynchronous aget() on the same node. mesos c++ zookeeper code hangs from api operation from within watcher of CHANGE event - Key: MESOS-2451 URL: https://issues.apache.org/jira/browse/MESOS-2451 Project: Mesos Issue Type: Bug Components: c++ api Affects Versions: 0.22.0 Environment: red hat linux 6.5 Reporter: craig bordelon Assignee: Benjamin Hindman Attachments: Makefile, bug.cpp, log.h We've observed that that the mesos 0.22.0-rc1 c++ zookeeper code appears to hang (two threads stuck in indefinite pthread condition waits) on a test case that as best we can tell is mesos issue and not issue with underlying apache zookeeper C binding. (that is we tried same type case using apache zookeeper C binding directly and saw no issues.) This happens with a properly running zookeeper (standalone is sufficient). Heres how we hung it: We issue a mesos zk set via int ZooKeeper::set ( const std::string path, const std::string data, int version ) then inside a Watcher we process on CHANGED event to issue a mesos zk get on the same path via int ZooKeeper::get ( const std::string path, boolwatch, std::string * result, Stat * stat ) we end up with two threads in the process both in pthread_cond_waits #0 0x00334e20b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7f6664ee1cf5 in Gate::arrive (this=0x7f6140, old=0) at ../../../3rdparty/libprocess/src/gate.hpp:82 #2 0x7f6664ecef6e in process::ProcessManager::wait (this=0x7f02e0, pid=...) at ../../../3rdparty/libprocess/src/process.cpp:2476 #3 0x7f6664ed2ce9 in process::wait (pid=..., duration=...) at ../../../3rdparty/libprocess/src/process.cpp:2958 #4 0x7f6664e90558 in process::Latch::await (this=0x7f6ba0, duration=...) at ../../../3rdparty/libprocess/src/latch.cpp:49 #5 0x7f66649452cc in process::Futureint::await (this=0x7fffa0fd9040, duration=...) at ../../3rdparty/libprocess/include/process/future.hpp:1156 #6 0x7f666493a04d in process::Futureint::get (this=0x7fffa0fd9040) at ../../3rdparty/libprocess/include/process/future.hpp:1167 #7 0x7f6664ab1aac in ZooKeeper::set (this=0x803ce0, path=/craig/mo, data= ... and #0 0x00334e20b43c in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x7f6664ee1cf5 in Gate::arrive (this=0x7f66380013f0, old=0) at ../../../3rdparty/libprocess/src/gate.hpp:82 #2 0x7f6664ecef6e in process::ProcessManager::wait (this=0x7f02e0, pid=...) at ../../../3rdparty/libprocess/src/process.cpp:2476 #3 0x7f6664ed2ce9 in process::wait (pid=..., duration=...) at ../../../3rdparty/libprocess/src/process.cpp:2958 #4 0x7f6664e90558 in process::Latch::await (this=0x7f6638000d00, duration=...) at ../../../3rdparty/libprocess/src/latch.cpp:49 #5 0x7f66649452cc in process::Futureint::await (this=0x7f66595fb6f0, duration=...) at ../../3rdparty/libprocess/include/process/future.hpp:1156 #6 0x7f666493a04d in process::Futureint::get (this=0x7f66595fb6f0) at ../../3rdparty/libprocess/include/process/future.hpp:1167 #7 0x7f6664ab18d3 in ZooKeeper::get (this=0x803ce0, path=/craig/mo, watch=false, We of course have a separate enhancement suggestion that the mesos C++ zookeeper api use timed waits and not block indefinitely for responses. But this case we think the mesos code itself is blocking on itself and not handling the responses. craig -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2205) Add user documentation for Dynamic Reservation
[ https://issues.apache.org/jira/browse/MESOS-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Park updated MESOS-2205: Sprint: Mesosphere Q1 Sprint 5 - 3/20 Add user documentation for Dynamic Reservation -- Key: MESOS-2205 URL: https://issues.apache.org/jira/browse/MESOS-2205 Project: Mesos Issue Type: Documentation Components: framework Reporter: Michael Park Assignee: Michael Park Labels: mesosphere Add a document for dynamic reservations. Topics include motivation, use cases, correct usage, valid states and transitions of the (role, reservation_type) pair. -- This message was sent by Atlassian JIRA (v6.3.4#6332)