[jira] [Commented] (MESOS-2171) Compilation error on Ubuntu 12.04
[ https://issues.apache.org/jira/browse/MESOS-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14370833#comment-14370833 ] haosdent commented on MESOS-2171: - Sorry, I just found the status of this issue is Unresolved Compilation error on Ubuntu 12.04 -- Key: MESOS-2171 URL: https://issues.apache.org/jira/browse/MESOS-2171 Project: Mesos Issue Type: Bug Components: build Affects Versions: 0.22.0 Environment: Ubuntu 12.04 java version 1.6.0_33 gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) Reporter: Mark Luntzel Priority: Blocker Following http://mesos.apache.org/gettingstarted/ we get a compilation error: ../../../3rdparty/libprocess/src/clock.cpp:167:36: instantiated from here /usr/include/c++/4.6/tr1/functional:2040:46: error: invalid initialization of reference of type 'std::listprocess::Timer' from expression of type 'std::listprocess::Timer' /usr/include/c++/4.6/tr1/functional:2040:46: error: return-statement with a value, in function returning 'void' [-fpermissive] make[4]: *** [libprocess_la-clock.lo] Error 1 More output here: https://gist.github.com/luntzel/f5a3c62297aae812c986 Please advise. Thanks in advance! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2367) Improve slave resiliency in the face of orphan containers
[ https://issues.apache.org/jira/browse/MESOS-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14370808#comment-14370808 ] Ian Downes commented on MESOS-2367: --- My opinion is that we should not change the contract between the launcher and the isolators: Isolator::cleanup will only be called when all processes in the container have been terminated. Why? 1. The single launcher is responsible for container process lifetime, multiple isolators are responsible for isolating those processes 2. Many isolators cannot complete cleanup until all processes are destroyed - in which case they're all trying to do the same thing the launcher is doing, or what else could the isolators do differently? 3. Isolators are ordered arbitrarily and called concurrently, so there's no way to ensure, for example, the cpu isolator is called first. My suggestion is that we do the following: 1. Make orphan clean up failures non-fatal so the slave will start and we gain control over running tasks. 2. Add a counter for the number of containers that failed to be destroyed, separately counting those that fail on normal destroy and those orphans that fail to be destroyed. Operators can monitor these counters and act appropriately. 3. Add to the launcher destroy code to handle the case described here where there are processes terminating (unmapping pages) but not making progress because of the very low cpu quota (the minimum is 0.01). If cgroup::destroy() timed out out it would examine the process's cgroup (/proc/\[pid\]/cgroup) and increase the cpu quota to something like 0.5 or 1.0 cpu and try again. This is a workaround and it is going around the cpu isolator but I don't see a cleaner way to do it. The case that I triaged had a JVM process with 16 GB of anonymous pages to unmap and it took around 16 seconds once the cpu quota was increased. I expect one or two additional attempts at terminating the processes and cgroup::destroy to be successful in all but the most extreme cases of this scenario. Regardless of success (there are other potential failure modes), (1) and (2) would enable the slave to come back up and to alert operators. Improve slave resiliency in the face of orphan containers -- Key: MESOS-2367 URL: https://issues.apache.org/jira/browse/MESOS-2367 Project: Mesos Issue Type: Bug Components: slave Reporter: Joe Smith Priority: Critical Right now there's a case where a misbehaving executor can cause a slave process to flap: {panel:title=Quote From [~jieyu]} {quote} 1) User tries to kill an instance 2) Slave sends {{KillTaskMessage}} to executor 3) Executor sends kill signals to task processes 4) Executor sends {{TASK_KILLED}} to slave 5) Slave updates container cpu limit to be 0.01 cpus 6) A user-process is still processing the kill signal 7) the task process cannot exit since it has too little cpu share and is throttled 8) Executor itself terminates 9) Slave tries to destroy the container, but cannot because the user-process is stuck in the exit path. 10) Slave restarts, and is constantly flapping because it cannot kill orphan containers {quote} {panel} The slave's orphan container handling should be improved to deal with this case despite ill-behaved users (framework writers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2438) Improve support for streaming HTTP Responses in libprocess.
[ https://issues.apache.org/jira/browse/MESOS-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14370826#comment-14370826 ] Benjamin Mahler commented on MESOS-2438: Server-side sending streaming responses: {noformat} commit e76954abb37a30da5bb211829d7033e53d830a7f Author: Benjamin Mahler benjamin.mah...@gmail.com Date: Thu Mar 5 18:33:28 2015 -0800 Introduced an http::Pipe abstraction to simplify streaming HTTP Responses. Review: https://reviews.apache.org/r/31930 {noformat} Patches for making streaming http::get / http::post requests are coming soon. Improve support for streaming HTTP Responses in libprocess. --- Key: MESOS-2438 URL: https://issues.apache.org/jira/browse/MESOS-2438 Project: Mesos Issue Type: Improvement Components: libprocess Reporter: Benjamin Mahler Assignee: Benjamin Mahler Labels: twitter Currently libprocess' HTTP::Response supports a PIPE construct for doing streaming responses: {code} struct Response { ... // Either provide a body, an absolute path to a file, or a // pipe for streaming a response. Distinguish between the cases // using 'type' below. // // BODY: Uses 'body' as the body of the response. These may be // encoded using gzip for efficiency, if 'Content-Encoding' is not // already specified. // // PATH: Attempts to perform a 'sendfile' operation on the file // found at 'path'. // // PIPE: Splices data from 'pipe' using 'Transfer-Encoding=chunked'. // Note that the read end of the pipe will be closed by libprocess // either after the write end has been closed or if the socket the // data is being spliced to has been closed (i.e., nobody is // listening any longer). This can cause writes to the pipe to // generate a SIGPIPE (which will terminate your program unless you // explicitly ignore them or handle them). // // In all cases (BODY, PATH, PIPE), you are expected to properly // specify the 'Content-Type' header, but the 'Content-Length' and // or 'Transfer-Encoding' headers will be filled in for you. enum { NONE, BODY, PATH, PIPE } type; ... }; {code} This interface is too low level and difficult to program against: * Connection closure is signaled with SIGPIPE, which is difficult for callers to deal with (must suppress SIGPIPE locally or globally in order to get EPIPE instead). * Pipes are generally for inter-process communication, and the pipe has finite size. With a blocking pipe the caller must deal with blocking when the pipe's buffer limit is exceeded. With a non-blocking pipe, the caller must deal with retrying the write. We'll want to consider a few use cases: # Sending an HTTP::Response with streaming data. # Making a request with http::get and http::post in which the data is returned in a streaming manner. # Making a request in which the request content is streaming. This ticket will focus on 1 as it is required for the HTTP API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2522) Add reason field for framework errors
[ https://issues.apache.org/jira/browse/MESOS-2522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Adam B updated MESOS-2522: -- Description: Currently, the only insight into framework errors is a message string. Framework schedulers could probably be smarter about how to handle errors if the cause is known. Since there are only a handful of distinct cases that could trigger an error, they could be captured by an enumeration. One specific use case for this feature follows. Frameworks that intend to survive failover typicaly persist the FrameworkID somewhere. When a framework has been marked completed by the master for exceeding its configured failover timeout, then re-registration triggers a framework error. Probably, the scheduler wants to disambiguate this kind of framework error from others in order to invalidate the stashed FrameworkID for the next attempt at (re)registration. was: Currently, the only insight into framework errors is a message string. Framework schedulers could probably be smarter about how to handle errors if the cause is known. Since there are only a handful of distinct cases that could trigger an error, they could be captured by an enumeration. One specific use case for this feature follows. Frameworks that intend to survive failover typicaly persist the FrameworkID somewhere. When a framework has been marked completed by the master for exceeding its configured failover timeout, then re-registration triggers a framework error. Probably, the scheduler wants to disambiguate this kind of framework error from others in order to invalidate the stashed FrameworkID for the next attempt at (re)gistration. Add reason field for framework errors - Key: MESOS-2522 URL: https://issues.apache.org/jira/browse/MESOS-2522 Project: Mesos Issue Type: Improvement Components: master Affects Versions: 0.22.0 Reporter: Connor Doyle Priority: Minor Labels: mesosphere Currently, the only insight into framework errors is a message string. Framework schedulers could probably be smarter about how to handle errors if the cause is known. Since there are only a handful of distinct cases that could trigger an error, they could be captured by an enumeration. One specific use case for this feature follows. Frameworks that intend to survive failover typicaly persist the FrameworkID somewhere. When a framework has been marked completed by the master for exceeding its configured failover timeout, then re-registration triggers a framework error. Probably, the scheduler wants to disambiguate this kind of framework error from others in order to invalidate the stashed FrameworkID for the next attempt at (re)registration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-2368) Provide a backchannel for information to the framework
[ https://issues.apache.org/jira/browse/MESOS-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Timothy Chen reassigned MESOS-2368: --- Assignee: Timothy Chen (was: Henning Schmiedehausen) Provide a backchannel for information to the framework -- Key: MESOS-2368 URL: https://issues.apache.org/jira/browse/MESOS-2368 Project: Mesos Issue Type: Improvement Components: containerization, docker Reporter: Henning Schmiedehausen Assignee: Timothy Chen So that description is not very verbose. Here is my use case: In our usage of Mesos and Docker, we assign IPs when the container starts up. We can not allocate the IP ahead of time, but we must rely on docker to give our containers their IP. This IP can be examined through docker inspect. We added code to the docker containerizer that will pick up this information and add it to an optional protobuf struct in the TaskStatus message. Therefore, when the executor and slave report a task as running, the corresponding message will contain information about the IP address that the container was assigned by docker and we can pick up this information in our orchestration framework. E.g. to drive our load balancers. There was no good way to do that in stock Mesos, so we built that back channel. However, having a generic channel (not one for four pieces of arbitrary information) from the executor to a framework may be a good thing in general. Clearly, this information could be transferred out of band but having it in the standard Mesos communication protocol turned out to be very elegant. To turn our current, hacked, prototype into something useful, this is what I am thinking: - TaskStatus gains a new, optional field: - optional TaskContext task_context = 11; (better name suggestions very welcome) - TaskContext has optional fields: - optional ContainerizerContext containerizer_context = 1; - optional ExecutorContext executor_context = 2; Each executor and containerizer can add information to the TaskContext, which in turn is exposed in TaskStatus. To avoid crowding of the various fields, I want to experiment with the nested extensions as described here: http://www.indelible.org/ink/protobuf-polymorphism/ At the end of the day, the goal is that any piece that is involved in executing code on the slave side can send information back to the framework along with TaskStatus messages. Any of these fields should be optional to be backwards compatible and they should (same as any other messages back) be considered best effort, but it will allow an effective way to communicate execution environment state back to the framework and allow the framework to react on it. I am planning to work on this an present a cleaned up version of our prototype in a bit. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2508) Slave recovering a docker container results in Unknow container error
[ https://issues.apache.org/jira/browse/MESOS-2508?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14370870#comment-14370870 ] Timothy Chen commented on MESOS-2508: - Hi there, I think this is a known issue. Basically all containerizers attempts to recover all the tasks that aren't even launched by itself, so in this case Docker is trying to recover Mesos containerizer tasks . Slave recovering a docker container results in Unknow container error --- Key: MESOS-2508 URL: https://issues.apache.org/jira/browse/MESOS-2508 Project: Mesos Issue Type: Bug Components: containerization, docker, slave Affects Versions: 0.21.1 Environment: Ubuntu 14.04.2 LTS Docker 1.5.0 (same error with 1.4.1) Mesos 0.21.1 installed from mesosphere ubuntu repo Marathon 0.8.0 installed from mesosphere ubuntu repo Reporter: Geoffroy Jabouley I'm seeing some error logs occuring during a slave recovery of a Mesos task running into a docker container. It does not impede slave recovery process, as the mesos task is still active and running on the slave afeter the recovery. But there is something not working properly when the slave is recovering my docker container. The slave detects my container as an Unknown container Cluster status: - 1 mesos-master, 1 mesos-slave, 1 marathon framework running on the host. - checkpointing is activated on both slave and framework - use native docker containerizer - 1 mesos task, started using marathon, is running inside a docker container and is monitored by the mesos-slave Action: - restart the mesos-slave process (sudo restart mesos-slave) Expected: - docker container still running - mesos task still running - no error in the mesos slave log regarding recovery process Seen: - docker container still running - mesos task still running - {color:red}Several errors *Unknown container* in the mesos slave log during recovery process{color} --- For what it forth, here are my investigations: 1) The mesos task starts fine in the docker container *e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330*. Docker container name is set to *mesos-adb71dc4-c07d-42a9-8fed-264c241668ad* by Mesos docker containerizer _i guess_... {code} I0317 09:56:14.300439 2784 slave.cpp:1083] Got assigned task test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 for framework 20150311-150951-3982541578-5050-50860- I0317 09:56:14.380702 2784 slave.cpp:1193] Launching task test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 for framework 20150311-150951-3982541578-5050-50860- I0317 09:56:14.384466 2784 slave.cpp:3997] Launching executor test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework 20150311-150951-3982541578-5050-50860- in work directory '/tmp/mesos/slaves/20150312-145235-3982541578-5050-1421-S0/frameworks/20150311-150951-3982541578-5050-50860-/executors/test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799/runs/adb71dc4-c07d-42a9-8fed-264c241668ad' I0317 09:56:14.390207 2784 slave.cpp:1316] Queuing task 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' for executor test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework '20150311-150951-3982541578-5050-50860- I0317 09:56:14.421787 2782 docker.cpp:927] Starting container 'adb71dc4-c07d-42a9-8fed-264c241668ad' for task 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' (and executor 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799') of framework '20150311-150951-3982541578-5050-50860-' I0317 09:56:15.784143 2781 docker.cpp:633] Checkpointing pid 27080 to '/tmp/mesos/meta/slaves/20150312-145235-3982541578-5050-1421-S0/frameworks/20150311-150951-3982541578-5050-50860-/executors/test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799/runs/adb71dc4-c07d-42a9-8fed-264c241668ad/pids/forked.pid' I0317 09:56:15.789443 2784 slave.cpp:2840] Monitoring executor 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework '20150311-150951-3982541578-5050-50860-' in container 'adb71dc4-c07d-42a9-8fed-264c241668ad' I0317 09:56:15.862642 2784 slave.cpp:1860] Got registration for executor 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework 20150311-150951-3982541578-5050-50860- from executor(1)@10.195.96.237:36021 I0317 09:56:15.865319 2784 slave.cpp:1979] Flushing queued task test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 for executor 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework 20150311-150951-3982541578-5050-50860- I0317 09:56:15.885414 2787 slave.cpp:2215] Handling status update TASK_RUNNING (UUID: 79f49cec-92c7-4660-b54e-22dd19c1e67c) for task
[jira] [Commented] (MESOS-2522) Add reason field for framework errors
[ https://issues.apache.org/jira/browse/MESOS-2522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14370894#comment-14370894 ] Adam B commented on MESOS-2522: --- This request is very much in line with the TaskStatus Reason field recently added in https://issues.apache.org/jira/browse/MESOS-343 Add reason field for framework errors - Key: MESOS-2522 URL: https://issues.apache.org/jira/browse/MESOS-2522 Project: Mesos Issue Type: Improvement Components: master Affects Versions: 0.22.0 Reporter: Connor Doyle Priority: Minor Labels: mesosphere Currently, the only insight into framework errors is a message string. Framework schedulers could probably be smarter about how to handle errors if the cause is known. Since there are only a handful of distinct cases that could trigger an error, they could be captured by an enumeration. One specific use case for this feature follows. Frameworks that intend to survive failover typicaly persist the FrameworkID somewhere. When a framework has been marked completed by the master for exceeding its configured failover timeout, then re-registration triggers a framework error. Probably, the scheduler wants to disambiguate this kind of framework error from others in order to invalidate the stashed FrameworkID for the next attempt at (re)registration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2491) Persist the reservation state on the slave
[ https://issues.apache.org/jira/browse/MESOS-2491?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Park updated MESOS-2491: Sprint: Mesosphere Q1 Sprint 6 - 4/3 Persist the reservation state on the slave -- Key: MESOS-2491 URL: https://issues.apache.org/jira/browse/MESOS-2491 Project: Mesos Issue Type: Task Components: master, slave Reporter: Michael Park Assignee: Michael Park Labels: mesosphere h3. Goal The goal for this task is to persist the reservation state stored on the master on the corresponding slave. The {{needCheckpointing}} predicate is used to capture the condition for which a resource needs to be checkpointed. Currently the only condition is {{isPersistentVolume}}. We'll update this to include dynamically reserved resources. h3. Expected Outcome * The dynamically reserved resources will be persisted on the slave. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2438) Improve support for streaming HTTP Responses in libprocess.
[ https://issues.apache.org/jira/browse/MESOS-2438?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372339#comment-14372339 ] Benjamin Mahler commented on MESOS-2438: Did some cleanups along the way, these are the key reviews in the chain: https://reviews.apache.org/r/32346/ https://reviews.apache.org/r/32347/ https://reviews.apache.org/r/32351/ Improve support for streaming HTTP Responses in libprocess. --- Key: MESOS-2438 URL: https://issues.apache.org/jira/browse/MESOS-2438 Project: Mesos Issue Type: Improvement Components: libprocess Reporter: Benjamin Mahler Assignee: Benjamin Mahler Labels: twitter Currently libprocess' HTTP::Response supports a PIPE construct for doing streaming responses: {code} struct Response { ... // Either provide a body, an absolute path to a file, or a // pipe for streaming a response. Distinguish between the cases // using 'type' below. // // BODY: Uses 'body' as the body of the response. These may be // encoded using gzip for efficiency, if 'Content-Encoding' is not // already specified. // // PATH: Attempts to perform a 'sendfile' operation on the file // found at 'path'. // // PIPE: Splices data from 'pipe' using 'Transfer-Encoding=chunked'. // Note that the read end of the pipe will be closed by libprocess // either after the write end has been closed or if the socket the // data is being spliced to has been closed (i.e., nobody is // listening any longer). This can cause writes to the pipe to // generate a SIGPIPE (which will terminate your program unless you // explicitly ignore them or handle them). // // In all cases (BODY, PATH, PIPE), you are expected to properly // specify the 'Content-Type' header, but the 'Content-Length' and // or 'Transfer-Encoding' headers will be filled in for you. enum { NONE, BODY, PATH, PIPE } type; ... }; {code} This interface is too low level and difficult to program against: * Connection closure is signaled with SIGPIPE, which is difficult for callers to deal with (must suppress SIGPIPE locally or globally in order to get EPIPE instead). * Pipes are generally for inter-process communication, and the pipe has finite size. With a blocking pipe the caller must deal with blocking when the pipe's buffer limit is exceeded. With a non-blocking pipe, the caller must deal with retrying the write. We'll want to consider a few use cases: # Sending an HTTP::Response with streaming data. # Making a request with http::get and http::post in which the data is returned in a streaming manner. # Making a request in which the request content is streaming. This ticket will focus on 1 as it is required for the HTTP API. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2528) Symlink the namespace handle with ContainerID for the port mapping isolator.
Jie Yu created MESOS-2528: - Summary: Symlink the namespace handle with ContainerID for the port mapping isolator. Key: MESOS-2528 URL: https://issues.apache.org/jira/browse/MESOS-2528 Project: Mesos Issue Type: Improvement Reporter: Jie Yu This serves two purposes: 1) Allows us to enter the network namespace using container ID (instead of pid): ip netns exec ContainerID [commands] [args]. 2) Allows us to get container ID for orphan containers during recovery. This will be helpful for solving MESOS-2367. The challenge here is to solve it in a backward compatible way. I propose to create symlinks under /var/run/netns. For example: /var/run/netns/containerid -- /var/run/netns/12345 (12345 is the pid) The old code will only remove the bind mounts and leave the symlinks, which I think is fine since containerid is globally unique (uuid). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2525) Missing information in Python interface launchTasks scheduler method
Itamar Ostricher created MESOS-2525: --- Summary: Missing information in Python interface launchTasks scheduler method Key: MESOS-2525 URL: https://issues.apache.org/jira/browse/MESOS-2525 Project: Mesos Issue Type: Documentation Components: python api Affects Versions: 0.21.0 Reporter: Itamar Ostricher The docstring of the launchTasks scheduler method in the Python API should explicitly state that launching multiple tasks onto multiple offers is supported only as long as all offers are from the same slave. See mailing list thread: http://www.mail-archive.com/user@mesos.apache.org/msg02861.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2524) Mesos-containerizer not linked from main documentation page.
Joerg Schad created MESOS-2524: -- Summary: Mesos-containerizer not linked from main documentation page. Key: MESOS-2524 URL: https://issues.apache.org/jira/browse/MESOS-2524 Project: Mesos Issue Type: Documentation Reporter: Joerg Schad Assignee: Joerg Schad Priority: Minor Is there any reason that the mesos-containerizer (http://mesos.apache.org/documentation/latest/mesos-containerizer/) documentation is not linked from the main documentation page (http://mesos.apache.org/documentation/latest/)? Both docker and external conterizer pages are linked from here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2018) Dynamic Reservation
[ https://issues.apache.org/jira/browse/MESOS-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Park updated MESOS-2018: Description: This is a feature to provide better support for running stateful services on Mesos such as HDFS (Distributed Filesystem), Cassandra (Distributed Database), or MySQL (Local Database). Current resource reservations (henceforth called static reservations) are statically determined by the slave operator at slave start time, and individual frameworks have no authority to reserve resources themselves. Dynamic reservations allow a framework to dynamically reserve offered resources, such that those resources will only be re-offered to the same framework (or other frameworks with the same role). This is especially useful if the framework's task stored some state on the slave, and needs a guaranteed set of resources reserved so that it can re-launch a task on the same slave to recover that state. h3. Planned Stages [MESOS-2489| was: This is a feature to provide better support for running stateful services on Mesos such as HDFS (Distributed Filesystem), Cassandra (Distributed Database), or MySQL (Local Database). Current resource reservations (henceforth called static reservations) are statically determined by the slave operator at slave start time, and individual frameworks have no authority to reserve resources themselves. Dynamic reservations allow a framework to dynamically reserve offered resources, such that those resources will only be re-offered to the same framework (or other frameworks with the same role). This is especially useful if the framework's task stored some state on the slave, and needs a guaranteed set of resources reserved so that it can re-launch a task on the same slave to recover that state. Dynamic Reservation --- Key: MESOS-2018 URL: https://issues.apache.org/jira/browse/MESOS-2018 Project: Mesos Issue Type: Epic Components: allocation, framework, master, slave Reporter: Adam B Assignee: Michael Park Labels: mesosphere, offer, persistence, reservations, resource, stateful, storage This is a feature to provide better support for running stateful services on Mesos such as HDFS (Distributed Filesystem), Cassandra (Distributed Database), or MySQL (Local Database). Current resource reservations (henceforth called static reservations) are statically determined by the slave operator at slave start time, and individual frameworks have no authority to reserve resources themselves. Dynamic reservations allow a framework to dynamically reserve offered resources, such that those resources will only be re-offered to the same framework (or other frameworks with the same role). This is especially useful if the framework's task stored some state on the slave, and needs a guaranteed set of resources reserved so that it can re-launch a task on the same slave to recover that state. h3. Planned Stages [MESOS-2489| -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2525) Missing information in Python interface launchTasks scheduler method
[ https://issues.apache.org/jira/browse/MESOS-2525?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371434#comment-14371434 ] Itamar Ostricher commented on MESOS-2525: - I submitted a [review request|https://reviews.apache.org/r/32306/] Missing information in Python interface launchTasks scheduler method Key: MESOS-2525 URL: https://issues.apache.org/jira/browse/MESOS-2525 Project: Mesos Issue Type: Documentation Components: python api Affects Versions: 0.21.0 Reporter: Itamar Ostricher Labels: documentation, newbie, patch Original Estimate: 1m Remaining Estimate: 1m The docstring of the launchTasks scheduler method in the Python API should explicitly state that launching multiple tasks onto multiple offers is supported only as long as all offers are from the same slave. See mailing list thread: http://www.mail-archive.com/user@mesos.apache.org/msg02861.html -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2070) Implement simple slave recovery behavior for fetcher cache
[ https://issues.apache.org/jira/browse/MESOS-2070?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2070: -- Sprint: Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3 (was: Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20) Implement simple slave recovery behavior for fetcher cache -- Key: MESOS-2070 URL: https://issues.apache.org/jira/browse/MESOS-2070 Project: Mesos Issue Type: Improvement Components: fetcher, slave Reporter: Bernd Mathiske Assignee: Bernd Mathiske Labels: newbie Original Estimate: 6h Remaining Estimate: 6h Clean the fetcher cache completely upon slave restart/recovery. This implements correct, albeit not ideal behavior. More efficient schemes that restore knowledge about cached files or even resume downloads can be added later. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2425) TODO comment in mesos.proto is already implemented
[ https://issues.apache.org/jira/browse/MESOS-2425?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2425: -- Sprint: Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3 (was: Mesosphere Q1 Sprint 5 - 3/20) TODO comment in mesos.proto is already implemented -- Key: MESOS-2425 URL: https://issues.apache.org/jira/browse/MESOS-2425 Project: Mesos Issue Type: Bug Components: general Affects Versions: 0.20.1 Reporter: Aaron Bell Assignee: Aaron Bell Priority: Minor Labels: mesosphere Attachments: mesos-2425-1.diff These lines are redundant in mesos.proto, since CommandInfo is now implemented: https://github.com/apache/mesos/blob/master/include/mesos/mesos.proto#L169-L174 I'm creating a patch with edits on comment lines only. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2226) HookTest.VerifySlaveLaunchExecutorHook is flaky
[ https://issues.apache.org/jira/browse/MESOS-2226?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2226: -- Sprint: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3 (was: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20) HookTest.VerifySlaveLaunchExecutorHook is flaky --- Key: MESOS-2226 URL: https://issues.apache.org/jira/browse/MESOS-2226 Project: Mesos Issue Type: Bug Components: test Affects Versions: 0.22.0 Reporter: Vinod Kone Assignee: Kapil Arya Labels: flaky-test Observed this on internal CI {code} [ RUN ] HookTest.VerifySlaveLaunchExecutorHook Using temporary directory '/tmp/HookTest_VerifySlaveLaunchExecutorHook_GjBgME' I0114 18:51:34.659353 4720 leveldb.cpp:176] Opened db in 1.255951ms I0114 18:51:34.662112 4720 leveldb.cpp:183] Compacted db in 596090ns I0114 18:51:34.662364 4720 leveldb.cpp:198] Created db iterator in 177877ns I0114 18:51:34.662719 4720 leveldb.cpp:204] Seeked to beginning of db in 19709ns I0114 18:51:34.663010 4720 leveldb.cpp:273] Iterated through 0 keys in the db in 18208ns I0114 18:51:34.663312 4720 replica.cpp:744] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I0114 18:51:34.664266 4735 recover.cpp:449] Starting replica recovery I0114 18:51:34.664908 4735 recover.cpp:475] Replica is in EMPTY status I0114 18:51:34.667842 4734 replica.cpp:641] Replica in EMPTY status received a broadcasted recover request I0114 18:51:34.669117 4735 recover.cpp:195] Received a recover response from a replica in EMPTY status I0114 18:51:34.677913 4735 recover.cpp:566] Updating replica status to STARTING I0114 18:51:34.683157 4735 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 137939ns I0114 18:51:34.683507 4735 replica.cpp:323] Persisted replica status to STARTING I0114 18:51:34.684013 4735 recover.cpp:475] Replica is in STARTING status I0114 18:51:34.685554 4738 replica.cpp:641] Replica in STARTING status received a broadcasted recover request I0114 18:51:34.696512 4736 recover.cpp:195] Received a recover response from a replica in STARTING status I0114 18:51:34.700552 4735 recover.cpp:566] Updating replica status to VOTING I0114 18:51:34.701128 4735 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 115624ns I0114 18:51:34.701478 4735 replica.cpp:323] Persisted replica status to VOTING I0114 18:51:34.701817 4735 recover.cpp:580] Successfully joined the Paxos group I0114 18:51:34.702569 4735 recover.cpp:464] Recover process terminated I0114 18:51:34.716439 4736 master.cpp:262] Master 20150114-185134-2272962752-57018-4720 (fedora-19) started on 192.168.122.135:57018 I0114 18:51:34.716913 4736 master.cpp:308] Master only allowing authenticated frameworks to register I0114 18:51:34.717136 4736 master.cpp:313] Master only allowing authenticated slaves to register I0114 18:51:34.717488 4736 credentials.hpp:36] Loading credentials for authentication from '/tmp/HookTest_VerifySlaveLaunchExecutorHook_GjBgME/credentials' I0114 18:51:34.718077 4736 master.cpp:357] Authorization enabled I0114 18:51:34.719238 4738 whitelist_watcher.cpp:65] No whitelist given I0114 18:51:34.719755 4737 hierarchical_allocator_process.hpp:285] Initialized hierarchical allocator process I0114 18:51:34.722584 4736 master.cpp:1219] The newly elected leader is master@192.168.122.135:57018 with id 20150114-185134-2272962752-57018-4720 I0114 18:51:34.722865 4736 master.cpp:1232] Elected as the leading master! I0114 18:51:34.723310 4736 master.cpp:1050] Recovering from registrar I0114 18:51:34.723760 4734 registrar.cpp:313] Recovering registrar I0114 18:51:34.725229 4740 log.cpp:660] Attempting to start the writer I0114 18:51:34.727893 4739 replica.cpp:477] Replica received implicit promise request with proposal 1 I0114 18:51:34.728425 4739 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 114781ns I0114 18:51:34.728662 4739 replica.cpp:345] Persisted promised to 1 I0114 18:51:34.731271 4741 coordinator.cpp:230] Coordinator attemping to fill missing position I0114 18:51:34.733223 4734 replica.cpp:378] Replica received explicit promise request for position 0 with proposal 2 I0114 18:51:34.734076 4734 leveldb.cpp:343] Persisting action (8 bytes) to leveldb took 87441ns I0114 18:51:34.734441 4734 replica.cpp:679] Persisted action at 0 I0114 18:51:34.740272 4739 replica.cpp:511] Replica received write request for position 0 I0114 18:51:34.740910 4739 leveldb.cpp:438] Reading position
[jira] [Updated] (MESOS-2373) DRFSorter needs to distinguish resources from different slaves.
[ https://issues.apache.org/jira/browse/MESOS-2373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2373: -- Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3 (was: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20) DRFSorter needs to distinguish resources from different slaves. --- Key: MESOS-2373 URL: https://issues.apache.org/jira/browse/MESOS-2373 Project: Mesos Issue Type: Bug Components: allocation Reporter: Michael Park Assignee: Michael Park Labels: mesosphere Currently the {{DRFSorter}} aggregates total and allocated resources across multiple slaves, which only works for scalar resources. We need to distinguish resources from different slaves. Suppose we have 2 slaves and 1 framework. The framework is allocated all resources from both slaves. {code} Resources slaveResources = Resources::parse(cpus:2;mem:512;ports:[31000-32000]).get(); DRFSorter sorter; sorter.add(slaveResources); // Add slave1 resources sorter.add(slaveResources); // Add slave2 resources // Total resources in sorter at this point is // cpus(*):4; mem(*):1024; ports(*):[31000-32000]. // The scalar resources get aggregated correctly but ports do not. sorter.add(F); // The 2 calls to allocated only works because we simply do: // allocation[name] += resources; // without checking that the 'resources' is available in the total. sorter.allocated(F, slaveResources); sorter.allocated(F, slaveResources); // At this point, sorter.allocation(F) is: // cpus(*):4; mem(*):1024; ports(*):[31000-32000]. {code} To provide some context, this issue came up while trying to reserve all unreserved resources from every offer. {code} for (const Offer offer : offers) { Resources unreserved = offer.resources().unreserved(); Resources reserved = unreserved.flatten(role, Resource::FRAMEWORK); Offer::Operation reserve; reserve.set_type(Offer::Operation::RESERVE); reserve.mutable_reserve()-mutable_resources()-CopyFrom(reserved); driver-acceptOffers({offer.id()}, {reserve}); } {code} Suppose the slave resources are the same as above: {quote} Slave1: {{cpus(\*):2; mem(\*):512; ports(\*):\[31000-32000\]}} Slave2: {{cpus(\*):2; mem(\*):512; ports(\*):\[31000-32000\]}} {quote} Initial (incorrect) total resources in the DRFSorter is: {quote} {{cpus(\*):4; mem(\*):1024; ports(\*):\[31000-32000\]}} {quote} We receive 2 offers, 1 from each slave: {quote} Offer1: {{cpus(\*):2; mem(\*):512; ports(\*):\[31000-32000\]}} Offer2: {{cpus(\*):2; mem(\*):512; ports(\*):\[31000-32000\]}} {quote} At this point, the resources allocated for the framework is: {quote} {{cpus(\*):4; mem(\*):1024; ports(\*):\[31000-32000\]}} {quote} After first {{RESERVE}} operation with Offer1: The allocated resources for the framework becomes: {quote} {{cpus(\*):2; mem(\*):512; cpus(role):2; mem(role):512; ports(role):\[31000-32000\]}} {quote} During second {{RESERVE}} operation with Offer2: {code:title=HierarchicalAllocatorProcess::updateAllocation} // ... FrameworkSorter* frameworkSorter = frameworkSorters[frameworks\[frameworkId\].role]; Resources allocation = frameworkSorter-allocation(frameworkId.value()); // Update the allocated resources. TryResources updatedAllocation = allocation.apply(operations); CHECK_SOME(updatedAllocation); // ... {code} {{allocation}} in the above code is: {quote} {{cpus(\*):2; mem(\*):512; cpus(role):2; mem(role):512; ports(role):\[31000-32000\]}} {quote} We try to {{apply}} a {{RESERVE}} operation and we fail to find {{ports(\*):\[31000-32000\]}} which leads to the {{CHECK}} fail at {{CHECK_SOME(updatedAllocation);}} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2119) Add Socket tests
[ https://issues.apache.org/jira/browse/MESOS-2119?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2119: -- Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3 (was: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20) Add Socket tests Key: MESOS-2119 URL: https://issues.apache.org/jira/browse/MESOS-2119 Project: Mesos Issue Type: Task Reporter: Niklas Quarfot Nielsen Assignee: Joris Van Remoortere Add more Socket specific tests to get coverage while doing libev to libevent (w and wo SSL) move -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2069) Basic fetcher cache functionality
[ https://issues.apache.org/jira/browse/MESOS-2069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2069: -- Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3 (was: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20) Basic fetcher cache functionality - Key: MESOS-2069 URL: https://issues.apache.org/jira/browse/MESOS-2069 Project: Mesos Issue Type: Improvement Components: fetcher, slave Reporter: Bernd Mathiske Assignee: Bernd Mathiske Labels: fetcher, slave Original Estimate: 48h Remaining Estimate: 48h Add a flag to CommandInfo URI protobufs that indicates that files downloaded by the fetcher shall be cached in a repository. To be followed by MESOS-2057 for concurrency control. Also see MESOS-336 for the overall goals for the fetcher cache. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1806) Substituting etcd or ReplicatedLog for Zookeeper
[ https://issues.apache.org/jira/browse/MESOS-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-1806: -- Sprint: Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3 (was: Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20) Substituting etcd or ReplicatedLog for Zookeeper Key: MESOS-1806 URL: https://issues.apache.org/jira/browse/MESOS-1806 Project: Mesos Issue Type: Task Reporter: Ed Ropple Assignee: Cody Maloney Priority: Minor adam_mesos eropple: Could you also file a new JIRA for Mesos to drop ZK in favor of etcd or ReplicatedLog? Would love to get some momentum going on that one. -- Consider it filed. =) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2248) 0.22.0 release
[ https://issues.apache.org/jira/browse/MESOS-2248?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2248: -- Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3 (was: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20) 0.22.0 release -- Key: MESOS-2248 URL: https://issues.apache.org/jira/browse/MESOS-2248 Project: Mesos Issue Type: Epic Reporter: Niklas Quarfot Nielsen Assignee: Niklas Quarfot Nielsen Mesos release 0.22.0 will include the following major feature(s): - Module Hooks (MESOS-2060) - Disk quota isolation in Mesos containerizer (MESOS-1587 and MESOS-1588) Minor features and fixes: - Task labels (MESOS-2120) - Service discovery info for tasks and executors (MESOS-2208) - Docker containerizer able to recover when running in a container (MESOS-2115) - Containerizer fixes (...) - Various bug fixes (...) Possible major features: - Container level network isolation (MESOS-1585) - Dynamic Reservations (MESOS-2018) This ticket will be used to track blockers to this release. For reference (per Jan 22nd) this has gone into Mesos since 0.21.1: https://gist.github.com/nqn/76aeb41a555625659ed8 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2215) The Docker containerizer attempts to recover any task when checkpointing is enabled, not just docker tasks.
[ https://issues.apache.org/jira/browse/MESOS-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2215: -- Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3) The Docker containerizer attempts to recover any task when checkpointing is enabled, not just docker tasks. --- Key: MESOS-2215 URL: https://issues.apache.org/jira/browse/MESOS-2215 Project: Mesos Issue Type: Bug Components: docker Affects Versions: 0.21.0 Reporter: Steve Niemitz Assignee: Timothy Chen Once the slave restarts and recovers the task, I see this error in the log for all tasks that were recovered every second or so. Note, these were NOT docker tasks: W0113 16:01:00.790323 773142 monitor.cpp:213] Failed to get resource usage for container 7b729b89-dc7e-4d08-af97-8cd1af560a21 for executor thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd of framework 20150109-161713-715350282-5050-290797-: Failed to 'docker inspect mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21': exit status = exited with status 1 stderr = Error: No such image or container: mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21 However the tasks themselves are still healthy and running. The slave was launched with --containerizers=mesos,docker - More info: it looks like the docker containerizer is a little too ambitious about recovering containers, again this was not a docker task: I0113 15:59:59.476145 773142 docker.cpp:814] Recovering container '7b729b89-dc7e-4d08-af97-8cd1af560a21' for executor 'thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd' of framework 20150109-161713-715350282-5050-290797- Looking into the source, it looks like the problem is that the ComposingContainerize runs recover in parallel, but neither the docker containerizer nor mesos containerizer check if they should recover the task or not (because they were the ones that launched it). Perhaps this needs to be written into the checkpoint somewhere? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2018) Dynamic Reservation
[ https://issues.apache.org/jira/browse/MESOS-2018?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Park updated MESOS-2018: Description: This is a feature to provide better support for running stateful services on Mesos such as HDFS (Distributed Filesystem), Cassandra (Distributed Database), or MySQL (Local Database). Current resource reservations (henceforth called static reservations) are statically determined by the slave operator at slave start time, and individual frameworks have no authority to reserve resources themselves. Dynamic reservations allow a framework to dynamically reserve offered resources, such that those resources will only be re-offered to the same framework (or other frameworks with the same role). This is especially useful if the framework's task stored some state on the slave, and needs a guaranteed set of resources reserved so that it can re-launch a task on the same slave to recover that state. h3. Planned Stages 1. MESOS-2489: Enable a framework to perform reservation operations. The goal of this stage is to allow the framework to send back a Reserve/Unreserve operation which gets validated by the master and updates the allocator resources. The allocator's {{allocate}} logic is unchanged and therefore the resources get offered back to the framework's role rather than the specific framework. In the next stage, we'll teach the allocator to distinguish between role and framework reservations which will result in the resources being re-offered to the specific framework. 2. MESOS-2490: Enable the allocator to distinguish between role and framework reservations. The goal of this stage is to teach the allocator to offer resources reserved for a framework to only be sent to the particular framework rather than the framework's role. This will involve updating the {{allocate}} function to select only the resources that are unreserved, role-reserved for the framework's role and framework-reserved for the framework. 3. MESOS-2491: Persist the reservation state on the slave. The goal of this stage is to persist the reservation state on the slave. Currently the master knows to store the persistent volumes in the {{checkpointedResources}} data structure which gets sent to individual slaves to be checkpointed. We will update the master such that dynamically reserved resources are stored in the {{checkpointedResources}} as well. This stage also involves subtasks such as updating the slave re(register) logic to support slave re-starts. was: This is a feature to provide better support for running stateful services on Mesos such as HDFS (Distributed Filesystem), Cassandra (Distributed Database), or MySQL (Local Database). Current resource reservations (henceforth called static reservations) are statically determined by the slave operator at slave start time, and individual frameworks have no authority to reserve resources themselves. Dynamic reservations allow a framework to dynamically reserve offered resources, such that those resources will only be re-offered to the same framework (or other frameworks with the same role). This is especially useful if the framework's task stored some state on the slave, and needs a guaranteed set of resources reserved so that it can re-launch a task on the same slave to recover that state. h3. Planned Stages 1. MESOS-2489: Enable a framework to perform reservation operations. The goal of this stage is to allow the framework to send back a Reserve/Unreserve operation which gets validated by the master and updates the allocator resources. The allocator's {{allocate}} logic is unchanged and therefore the resources get offered back to the framework's role rather than the specific framework. In the next stage, we'll teach the allocator to distinguish between role and framework reservations which will result in the resources being re-offered to the specific framework. 2. MESOS-2490: Enable the allocator to distinguish between role and framework reservations. The goal of this stage of is to teach the allocator to offer resources reserved for a framework to only be sent to the particular framework rather than the framework's role. This will involve updating the {{allocate}} function to select only the resources that are unreserved, role-reserved for the framework's role and framework-reserved for the framework. 3. Dynamic Reservation --- Key: MESOS-2018 URL: https://issues.apache.org/jira/browse/MESOS-2018 Project: Mesos Issue Type: Epic Components: allocation, framework, master, slave Reporter: Adam B Assignee: Michael Park Labels: mesosphere, offer, persistence, reservations, resource, stateful, storage This is a feature to provide better support for running stateful services on Mesos such as HDFS
[jira] [Updated] (MESOS-1806) Substituting etcd or ReplicatedLog for Zookeeper
[ https://issues.apache.org/jira/browse/MESOS-1806?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-1806: -- Sprint: Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3) Substituting etcd or ReplicatedLog for Zookeeper Key: MESOS-1806 URL: https://issues.apache.org/jira/browse/MESOS-1806 Project: Mesos Issue Type: Task Reporter: Ed Ropple Assignee: Cody Maloney Priority: Minor adam_mesos eropple: Could you also file a new JIRA for Mesos to drop ZK in favor of etcd or ReplicatedLog? Would love to get some momentum going on that one. -- Consider it filed. =) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2155) Make docker containerizer killing orphan containers optional
[ https://issues.apache.org/jira/browse/MESOS-2155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2155: -- Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3) Make docker containerizer killing orphan containers optional Key: MESOS-2155 URL: https://issues.apache.org/jira/browse/MESOS-2155 Project: Mesos Issue Type: Improvement Components: docker Reporter: Timothy Chen Assignee: Timothy Chen Currently the docker containerizer on recover will kill containers that are not recognized by the containerizer. We want to make this behavior optional as there are certain situations we want to let the docker containers still continue to run. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1980) Benchmark RPC/s of linked Libprocess
[ https://issues.apache.org/jira/browse/MESOS-1980?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-1980: -- Sprint: Mesosphere Q4 Sprint 2 - 11/14, Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 6 - 4/3 (was: Mesosphere Q4 Sprint 2 - 11/14, Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23) Benchmark RPC/s of linked Libprocess Key: MESOS-1980 URL: https://issues.apache.org/jira/browse/MESOS-1980 Project: Mesos Issue Type: Improvement Components: libprocess Reporter: Joris Van Remoortere Assignee: Joris Van Remoortere ibprocess has some performance bottlenecks. Implement a benchmark where we can see regressions / improvements regarding RPCs performed per second. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2016) docker_name_prefix is too generic
[ https://issues.apache.org/jira/browse/MESOS-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2016: -- Sprint: Mesosphere Q4 Sprint 2 - 11/14, Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20 (was: Mesosphere Q4 Sprint 2 - 11/14, Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3) docker_name_prefix is too generic - Key: MESOS-2016 URL: https://issues.apache.org/jira/browse/MESOS-2016 Project: Mesos Issue Type: Bug Reporter: Jay Buffington Assignee: Timothy Chen From docker.hpp and docker.cpp: {code} // Prefix used to name Docker containers in order to distinguish those // created by Mesos from those created manually. extern std::string DOCKER_NAME_PREFIX; // TODO(benh): At some point to run multiple slaves we'll need to make // the Docker container name creation include the slave ID. string DOCKER_NAME_PREFIX = mesos-; {code} This name is too generic. A common pattern in docker land is to run everything in a container and use volume mounts to share sockets do RPC between containers. CoreOS has popularized this technique. Inevitably, what people do is start a container named mesos-slave which runs the docker containerizer recovery code which removes all containers that start with mesos- And then ask huh, why did my mesos-slave docker container die? I don't see any error messages... Ideally, we should do what Ben suggested and add the slave id to the name prefix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-905) Remove Framework.id in favor of FrameworkInfo.id
[ https://issues.apache.org/jira/browse/MESOS-905?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-905: - Sprint: Mesosphere Q1 Sprint 6 - 4/3 Remove Framework.id in favor of FrameworkInfo.id Key: MESOS-905 URL: https://issues.apache.org/jira/browse/MESOS-905 Project: Mesos Issue Type: Improvement Components: framework Reporter: Adam B Assignee: Adam B Priority: Minor Labels: easyfix Framework.id currently holds the correct FrameworkId, but Framework also contains a FrameworkInfo, and the FrameworkInfo.id is not necessarily set. I propose that we eliminate the Framework.id member variable and replace it with a Framework.id() accessor that references Framework.FrameworkInfo.id and ensure that it is correctly set. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-540) Executor health checking.
[ https://issues.apache.org/jira/browse/MESOS-540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372532#comment-14372532 ] Rohit Upadhyaya commented on MESOS-540: --- Is this being worked on by someone else? I see a 'health-check' folder in src/ having some sample code added about a month back. Executor health checking. - Key: MESOS-540 URL: https://issues.apache.org/jira/browse/MESOS-540 Project: Mesos Issue Type: Improvement Reporter: Benjamin Mahler Labels: newbie, twitter We currently do not health check running executors. At Twitter, this has led to out-of-band health checking of executors for an internal framework. For the Storm framework, this has led to out-of-band health checking via ZooKeeper. Health checking would allow Storm to use finer grained executors for better isolation. This also helps the Hadoop and Jenkins frameworks as well should health checking be desired. As for implementation, I would propose adding a call on the Executor interface: /** * Invoked by the ExecutorDriver to determine the health of the executor. * When this function returns, the Executor is considered healthy. */ void heartbeat(ExecutorDriver* driver) = 0; The driver can then heartbeat periodically and kill when the Executor is not responding to heartbeats. The driver should also detect the executor deadlocking on any of the other callbacks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2422) Use fq_codel qdisc for egress network traffic isolation
[ https://issues.apache.org/jira/browse/MESOS-2422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinod Kone updated MESOS-2422: -- Sprint: Twitter Mesos Q1 Sprint 4 (was: Twitter Mesos Q1 Sprint 4, Twitter Mesos Q1 Sprint 5) Use fq_codel qdisc for egress network traffic isolation --- Key: MESOS-2422 URL: https://issues.apache.org/jira/browse/MESOS-2422 Project: Mesos Issue Type: Task Reporter: Cong Wang Assignee: Cong Wang Labels: twitter -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2057) Concurrency control for fetcher cache
[ https://issues.apache.org/jira/browse/MESOS-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2057: -- Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3 (was: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20) Concurrency control for fetcher cache - Key: MESOS-2057 URL: https://issues.apache.org/jira/browse/MESOS-2057 Project: Mesos Issue Type: Improvement Components: fetcher, slave Reporter: Bernd Mathiske Assignee: Bernd Mathiske Original Estimate: 96h Remaining Estimate: 96h Having added a URI flag to CommandInfo messages (in MESOS-2069) that indicates caching, caching files downloaded by the fetcher in a repository, now ensure that when a URI is cached, it is only ever downloaded once for the same user on the same slave as long as the slave keeps running. This even holds if multiple tasks request the same URI concurrently. If multiple requests for the same URI occur, perform only one of them and reuse the result. Make concurrent requests for the same URI wait for the one download. Different URIs from different CommandInfos can be downloaded concurrently. No cache eviction, cleanup or failover will be handled for now. Additional tickets will be filed for these enhancements. (So don't use this feature in production until the whole epic is complete.) Note that implementing this does not suffice for production use. This ticket contains the main part of the fetcher logic, though. See the epic MESOS-336 for the rest of the features that lead to a fully functional fetcher cache. The proposed general approach is to keep all bookkeeping about what is in which stage of being fetched and where it resides in the slave's MesosContainerizerProcess, so that all concurrent access is disambiguated and controlled by an actor (aka libprocess process). Depends on MESOS-2056 and MESOS-2069. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2050) InMemoryAuxProp plugin used by Authenticators results in SEGFAULT
[ https://issues.apache.org/jira/browse/MESOS-2050?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2050: -- Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3 (was: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20) InMemoryAuxProp plugin used by Authenticators results in SEGFAULT - Key: MESOS-2050 URL: https://issues.apache.org/jira/browse/MESOS-2050 Project: Mesos Issue Type: Bug Affects Versions: 0.21.0 Reporter: Vinod Kone Assignee: Till Toenshoff Observed this on ASF CI: Basically, as part of the recent Auth refactor for modules, the loading of secrets is being done once per Authenticator Process instead of once in the Master. Since, InMemoryAuxProp plugin manipulates static variables (e.g, 'properties') it results in SEGFAULT when one Authenticator (e.g., for slave) does load() while another Authenticator (e.g., for framework) does lookup(), as both these methods manipulate static 'properties'. {code} [ RUN ] MasterTest.LaunchDuplicateOfferTest Using temporary directory '/tmp/MasterTest_LaunchDuplicateOfferTest_XEBbvp' I1104 03:37:55.523553 28363 leveldb.cpp:176] Opened db in 2.270387ms I1104 03:37:55.524250 28363 leveldb.cpp:183] Compacted db in 662527ns I1104 03:37:55.524276 28363 leveldb.cpp:198] Created db iterator in 4964ns I1104 03:37:55.524284 28363 leveldb.cpp:204] Seeked to beginning of db in 702ns I1104 03:37:55.524291 28363 leveldb.cpp:273] Iterated through 0 keys in the db in 450ns I1104 03:37:55.524333 28363 replica.cpp:741] Replica recovered with log positions 0 - 0 with 1 holes and 0 unlearned I1104 03:37:55.524852 28384 recover.cpp:437] Starting replica recovery I1104 03:37:55.525188 28384 recover.cpp:463] Replica is in EMPTY status I1104 03:37:55.526577 28378 replica.cpp:638] Replica in EMPTY status received a broadcasted recover request I1104 03:37:55.527135 28378 master.cpp:318] Master 20141104-033755-3176252227-49988-28363 (proserpina.apache.org) started on 67.195.81.189:49988 I1104 03:37:55.527180 28378 master.cpp:364] Master only allowing authenticated frameworks to register I1104 03:37:55.527191 28378 master.cpp:369] Master only allowing authenticated slaves to register I1104 03:37:55.527217 28378 credentials.hpp:36] Loading credentials for authentication from '/tmp/MasterTest_LaunchDuplicateOfferTest_XEBbvp/credentials' I1104 03:37:55.527451 28378 master.cpp:408] Authorization enabled I1104 03:37:55.528081 28384 master.cpp:126] No whitelist given. Advertising offers for all slaves I1104 03:37:55.528548 28383 recover.cpp:188] Received a recover response from a replica in EMPTY status I1104 03:37:55.528645 28388 hierarchical_allocator_process.hpp:299] Initializing hierarchical allocator process with master : master@67.195.81.189:49988 I1104 03:37:55.529233 28388 master.cpp:1258] The newly elected leader is master@67.195.81.189:49988 with id 20141104-033755-3176252227-49988-28363 I1104 03:37:55.529266 28388 master.cpp:1271] Elected as the leading master! I1104 03:37:55.529289 28388 master.cpp:1089] Recovering from registrar I1104 03:37:55.529311 28385 recover.cpp:554] Updating replica status to STARTING I1104 03:37:55.529500 28384 registrar.cpp:313] Recovering registrar I1104 03:37:55.530037 28383 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 497965ns I1104 03:37:55.530083 28383 replica.cpp:320] Persisted replica status to STARTING I1104 03:37:55.530335 28387 recover.cpp:463] Replica is in STARTING status I1104 03:37:55.531343 28381 replica.cpp:638] Replica in STARTING status received a broadcasted recover request I1104 03:37:55.531739 28384 recover.cpp:188] Received a recover response from a replica in STARTING status I1104 03:37:55.532168 28379 recover.cpp:554] Updating replica status to VOTING I1104 03:37:55.532572 28381 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 293974ns I1104 03:37:55.532594 28381 replica.cpp:320] Persisted replica status to VOTING I1104 03:37:55.532790 28390 recover.cpp:568] Successfully joined the Paxos group I1104 03:37:55.533107 28390 recover.cpp:452] Recover process terminated I1104 03:37:55.533604 28382 log.cpp:656] Attempting to start the writer I1104 03:37:55.534840 28381 replica.cpp:474] Replica received implicit promise request with proposal 1 I1104 03:37:55.535188 28381 leveldb.cpp:306] Persisting metadata (8 bytes) to leveldb took 321021ns I1104 03:37:55.535212 28381 replica.cpp:342] Persisted promised to 1 I1104 03:37:55.535893 28378 coordinator.cpp:230] Coordinator attemping to fill missing position I1104
[jira] [Updated] (MESOS-2110) Configurable Ping Timeouts
[ https://issues.apache.org/jira/browse/MESOS-2110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2110: -- Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3 (was: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20) Configurable Ping Timeouts -- Key: MESOS-2110 URL: https://issues.apache.org/jira/browse/MESOS-2110 Project: Mesos Issue Type: Improvement Components: master, slave Reporter: Adam B Assignee: Adam B Labels: master, network, slave, timeout After a series of ping-failures, the master considers the slave lost and calls shutdownSlave, requiring such a slave that reconnects to kill its tasks and re-register as a new slaveId. On the other side, after a similar timeout, the slave will consider the master lost and try to detect a new master. These timeouts are currently hardcoded constants (5 * 15s), which may not be well-suited for all scenarios. - Some clusters may tolerate a longer slave process restart period, and wouldn't want tasks to be killed upon reconnect. - Some clusters may have higher-latency networks (e.g. cross-datacenter, or for volunteer computing efforts), and would like to tolerate longer periods without communication. We should provide flags/mechanisms on the master to control its tolerance for non-communicative slaves, and (less importantly?) on the slave to tolerate missing masters. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2160) Add support for allocator modules
[ https://issues.apache.org/jira/browse/MESOS-2160?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2160: -- Sprint: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3 (was: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20) Add support for allocator modules - Key: MESOS-2160 URL: https://issues.apache.org/jira/browse/MESOS-2160 Project: Mesos Issue Type: Task Reporter: Niklas Quarfot Nielsen Assignee: Alexander Rukletsov Labels: mesosphere Currently Mesos supports only the DRF allocator, changing which requires hacking Mesos source code, which, in turn, sets a high entry barrier. Allocator modules give an easy possibility to tweak resource allocation policy. This will enable swapping allocation policies without the necessity to edit Mesos source code. Custom allocators may be written by everybody and does not need be distributed together with Mesos. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2074) Fetcher cache test fixture
[ https://issues.apache.org/jira/browse/MESOS-2074?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2074: -- Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3 (was: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20) Fetcher cache test fixture -- Key: MESOS-2074 URL: https://issues.apache.org/jira/browse/MESOS-2074 Project: Mesos Issue Type: Improvement Components: fetcher, slave Reporter: Bernd Mathiske Assignee: Bernd Mathiske Original Estimate: 72h Remaining Estimate: 72h To accelerate providing good test coverage for the fetcher cache (MESOS-336), we can provide a framework that canonicalizes creating and running a number of tasks and allows easy parametrization with combinations of the following: - whether to cache or not - whether make what has been downloaded executable or not - whether to extract from an archive or not - whether to download from a file system, http, or... We can create a simple HHTP server in the test fixture to support the latter. Furthermore, the tests need to be robust wrt. varying numbers of StatusUpdate messages. An accumulating update message sink that reports the final state is needed. All this has already been programmed in this patch, just needs to be rebased: https://reviews.apache.org/r/21316/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2165) When cyrus sasl MD5 isn't installed configure passes, tests fail without any output
[ https://issues.apache.org/jira/browse/MESOS-2165?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2165: -- Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3 (was: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20) When cyrus sasl MD5 isn't installed configure passes, tests fail without any output --- Key: MESOS-2165 URL: https://issues.apache.org/jira/browse/MESOS-2165 Project: Mesos Issue Type: Bug Reporter: Cody Maloney Assignee: Till Toenshoff Labels: mesosphere Sample Dockerfile to make such a host: {code} FROM centos:centos7 RUN yum install -y epel-release gcc python-devel RUN yum install -y python-pip RUN yum install -y rpm-build redhat-rpm-config autoconf make gcc gcc-c++ patch libtool git python-devel ruby-devel java-1.7.0-openjdk-devel zlib-devel libcurl-devel openssl-devel cyrus-sasl-devel rubygems apr-devel apr-util-devel subversion-devel maven libselinux-python {code} Use: 'docker run -i -t imagename /bin/bash' to run the image, get a shell inside where you can 'git clone' mesos and build/run the tests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2072) Fetcher cache eviction
[ https://issues.apache.org/jira/browse/MESOS-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2072: -- Sprint: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3 (was: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20) Fetcher cache eviction -- Key: MESOS-2072 URL: https://issues.apache.org/jira/browse/MESOS-2072 Project: Mesos Issue Type: Improvement Components: fetcher, slave Reporter: Bernd Mathiske Assignee: Bernd Mathiske Original Estimate: 336h Remaining Estimate: 336h Delete files from the fetcher cache so that a given cache size is never exceeded. Succeed in doing so while concurrent downloads are on their way and new requests are pouring in. Idea: measure the size of each download before it begins, make enough room before the download. This means that only download mechanisms that divulge the size before the main download will be supported. AFAWK, those in use so far have this property. The calculation of how much space to free needs to be under concurrency control, accumulating all space needed for competing, incomplete download requests. (The Python script that performs fetcher caching for Aurora does not seem to implement this. See https://gist.github.com/zmanji/f41df77510ef9d00265a, imagine several of these programs running concurrently, each one's _cache_eviction() call succeeding, each perceiving the SAME free space being available.) Ultimately, a conflict resolution strategy is needed if just the downloads underway already exceed the cache capacity. Then, as a fallback, direct download into the work directory will be used for some tasks. TBD how to pick which task gets treated how. At first, only support copying of any downloaded files to the work directory for task execution. This isolates the task life cycle after starting a task from cache eviction considerations. (Later, we can add symbolic links that avoid copying. But then eviction of fetched files used by ongoing tasks must be blocked, which adds complexity. another future extension is MESOS-1667 Extract from URI while downloading into work dir). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2351) Enable label and environment decorators (hooks) to remove label and environment entries
[ https://issues.apache.org/jira/browse/MESOS-2351?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2351: -- Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3 (was: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20) Enable label and environment decorators (hooks) to remove label and environment entries --- Key: MESOS-2351 URL: https://issues.apache.org/jira/browse/MESOS-2351 Project: Mesos Issue Type: Task Reporter: Niklas Quarfot Nielsen Assignee: Niklas Quarfot Nielsen We need to change the semantics of decorators to be able to not only add labels and environment variables, but also remove them. The change is fairly small. The hook manager (and call site) use CopyFrom instead of MergeFrom and hook implementors pass on the labels and environment from task and executor commands respectively. In the future, we can tag labels such that only labels belonging to a hook type (across master and slave) can be inspected and changed. For now, the active hooks are selected by the operator and therefore be trusted. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2500) Doxygen setup for libprocess
[ https://issues.apache.org/jira/browse/MESOS-2500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2500: -- Sprint: Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3 (was: Mesosphere Q1 Sprint 5 - 3/20) Doxygen setup for libprocess Key: MESOS-2500 URL: https://issues.apache.org/jira/browse/MESOS-2500 Project: Mesos Issue Type: Documentation Components: libprocess Reporter: Bernd Mathiske Assignee: Joerg Schad Original Estimate: 48h Remaining Estimate: 48h Goals: - Initial doxygen setup. - Enable interested developers to generate already available doxygen content locally in their workspace and view it. - Form the basis for future contributions of more doxygen content. 1. Devise a way to use Doxygen with Mesos source code. (For example, solve this by adding optional brew/apt-get installation to the Getting Started doc.) 2. Create a make target for libprocess documentation that can be manually triggered. 3. Create initial library top level documentation. 4. Enhance one header file with Doxygen. Make sure the generated output has all necessary links to navigate from the lib to the file and back, etc. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2333) Securing Sandboxes via Filebrowser Access Control
[ https://issues.apache.org/jira/browse/MESOS-2333?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2333: -- Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3 (was: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20) Securing Sandboxes via Filebrowser Access Control - Key: MESOS-2333 URL: https://issues.apache.org/jira/browse/MESOS-2333 Project: Mesos Issue Type: Improvement Components: security Reporter: Adam B Assignee: Alexander Rojas Labels: authorization, filebrowser, mesosphere, security As it stands now, anybody with access to the master or slave web UI can use the filebrowser to view the contents of any attached/mounted paths on the master or slave. Currently, the attached paths include master and slave logs as well as executor/task sandboxes. While there's a chance that the master and slave logs could contain sensitive information, it's much more likely that sandboxes could contain customer data or other files that should not be globally accessible. Securing the sandboxes is the primary goal of this ticket. There are four filebrowser endpoints: browse, read, download, and debug. Here are some potential solutions. 1) We could easily provide flags that globally enable/disable each endpoint, allowing coarse-grained access control. This might be a reasonable short-term plan. We would also want to update the web UIs to display an Access Denied error, rather than showing links that open up blank pailers. 2) Each master and slave handles is own authn/authz. Slaves will need to have an authenticator, and there must be a way to provide each node with credentials and ACLs, and keep these in sync across the cluster. 3) Filter all slave communications through the master(s), which already has credentials and ACLs. We'll have to restrict access to the filebrowser (and other?) endpoints to the (leading?) master. Then the master can perform the authentication and authorization, only passing the request on to the slave if auth succeeds. 3a) The slave returns the browse/read/download response back through the master. This could be a network bottleneck. 3b) Upon authn/z success, the master redirects the request to the appropriate slave, which will send the response directly back to the requester. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2372) Test suite for verifying compatibility between Mesos components
[ https://issues.apache.org/jira/browse/MESOS-2372?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2372: -- Sprint: Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3 (was: Mesosphere Q1 Sprint 5 - 3/20) Test suite for verifying compatibility between Mesos components --- Key: MESOS-2372 URL: https://issues.apache.org/jira/browse/MESOS-2372 Project: Mesos Issue Type: Task Reporter: Vinod Kone Assignee: Niklas Quarfot Nielsen While our current unit/integration test suite catches functional bugs, it doesn't catch compatibility bugs (e.g, MESOS-2371). This is really crucial to provide operators the ability to do seamless upgrades on live clusters. We should have a test suite / framework (ideally running on CI vetting each review on RB) that tests upgrade paths between master, slave, scheduler and executor. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2215) The Docker containerizer attempts to recover any task when checkpointing is enabled, not just docker tasks.
[ https://issues.apache.org/jira/browse/MESOS-2215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2215: -- Sprint: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3 (was: Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20) The Docker containerizer attempts to recover any task when checkpointing is enabled, not just docker tasks. --- Key: MESOS-2215 URL: https://issues.apache.org/jira/browse/MESOS-2215 Project: Mesos Issue Type: Bug Components: docker Affects Versions: 0.21.0 Reporter: Steve Niemitz Assignee: Timothy Chen Once the slave restarts and recovers the task, I see this error in the log for all tasks that were recovered every second or so. Note, these were NOT docker tasks: W0113 16:01:00.790323 773142 monitor.cpp:213] Failed to get resource usage for container 7b729b89-dc7e-4d08-af97-8cd1af560a21 for executor thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd of framework 20150109-161713-715350282-5050-290797-: Failed to 'docker inspect mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21': exit status = exited with status 1 stderr = Error: No such image or container: mesos-7b729b89-dc7e-4d08-af97-8cd1af560a21 However the tasks themselves are still healthy and running. The slave was launched with --containerizers=mesos,docker - More info: it looks like the docker containerizer is a little too ambitious about recovering containers, again this was not a docker task: I0113 15:59:59.476145 773142 docker.cpp:814] Recovering container '7b729b89-dc7e-4d08-af97-8cd1af560a21' for executor 'thermos-1421085237813-slipstream-prod-agent-3-8f769514-1835-4151-90d0-3f55dcc940dd' of framework 20150109-161713-715350282-5050-290797- Looking into the source, it looks like the problem is that the ComposingContainerize runs recover in parallel, but neither the docker containerizer nor mesos containerizer check if they should recover the task or not (because they were the ones that launched it). Perhaps this needs to be written into the checkpoint somewhere? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2161) AbstractState JNI check fails for Marathon framework
[ https://issues.apache.org/jira/browse/MESOS-2161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2161: -- Sprint: Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3 (was: Mesosphere Q1 Sprint 5 - 3/20) AbstractState JNI check fails for Marathon framework Key: MESOS-2161 URL: https://issues.apache.org/jira/browse/MESOS-2161 Project: Mesos Issue Type: Bug Affects Versions: 0.21.0 Environment: Mesos 0.21.0 Marathon 0.7.5 Fedora 20 Reporter: Matthew Sanders Assignee: Joris Van Remoortere Attachments: mesos_core_dump_gdb.txt.bz2 I've recently upgraded to mesos 0.21.0 and now it seems that every few minutes or so I see the following error, which kills marathon. Nov 25 18:12:42 gianthornet.trading.imc.intra marathon[5453]: [2014-11-25 18:12:42,064] INFO 10.133.128.26 - - [26/Nov/2014:00:12:41 +] GET /v2/apps HTTP/1.1 200 2321 http://marathon:8080/; Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0 (mesosphere.chaos.http.ChaosRequestLog:15) Nov 25 18:12:42 gianthornet.trading.imc.intra marathon[5453]: [2014-11-25 18:12:42,238] INFO 10.133.128.26 - - [26/Nov/2014:00:12:42 +] GET /v2/deployments HTTP/1.1 200 2 http://marathon:8080/; Mozilla/5.0 (X11; Linux x86_64; rv:31.0) Gecko/20100101 Firefox/31.0 (mesosphere.chaos.http.ChaosRequestLog:15) Nov 25 18:12:42 gianthornet.trading.imc.intra marathon[5453]: [2014-11-25 18:12:42,961] INFO 10.192.221.95 - - [26/Nov/2014:00:12:42 +] GET /v2/apps HTTP/1.1 200 2321 http://marathon:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.65 Safari/537... Nov 25 18:12:43 gianthornet.trading.imc.intra marathon[5453]: [2014-11-25 18:12:43,032] INFO 10.192.221.95 - - [26/Nov/2014:00:12:42 +] GET /v2/deployments HTTP/1.1 200 2 http://marathon:8080/; Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.65 Safari... Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: F1125 18:12:44.146260 5897 check.hpp:79] Check failed: f.isReady() Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: *** Check failure stack trace: *** Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 0x7f8176a2b17c google::LogMessage::Fail() Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 0x7f8176a2b0d5 google::LogMessage::SendToLog() Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 0x7f8176a2aab3 google::LogMessage::Flush() Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 0x7f8176a2da3b google::LogMessageFatal::~LogMessageFatal() Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 0x7f8176a1ea64 _checkReady() Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 0x7f8176a1d43b Java_org_apache_mesos_state_AbstractState__1_1names_1get Nov 25 18:12:44 gianthornet.trading.imc.intra marathon[5454]: @ 0x7f81f644ca70 (unknown) Nov 25 18:12:44 gianthornet.trading.imc.intra systemd[1]: marathon.service: main process exited, code=killed, status=6/ABRT Here's the command that mesos-master is being run with /usr/local/sbin/mesos-master --zk=zk://usint-zk-d01-node1chi:2191,usint-zk-d01-node2chi:2192,usint-zk-d01-node3chi:2193/mesos --port=5050 --log_dir=/var/log/mesos --quorum=1 --work_dir=/var/lib/mesos Here's the command that the slave is running with: /usr/local/sbin/mesos-slave --master=zk://usint-zk-d01-node1chi:2191,usint-zk-d01-node2chi:2192,usint-zk-d01-node3chi:2193/mesos --log_dir=/var/log/mesos --containerizers=docker,mesos --executor_registration_timeout=5mins --attributes=country:us;datacenter:njl3;environment:dev;region:amer;timezone:America/Chicago I realize this could also be filed to marathon, but it sort of looks like a c++ issue to me, which is why I came here to post this. Any help would be greatly appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2337) __init__.py not getting installed in $PREFIX/lib/pythonX.Y/site-packages/mesos
[ https://issues.apache.org/jira/browse/MESOS-2337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2337: -- Sprint: Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3 (was: Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20) __init__.py not getting installed in $PREFIX/lib/pythonX.Y/site-packages/mesos -- Key: MESOS-2337 URL: https://issues.apache.org/jira/browse/MESOS-2337 Project: Mesos Issue Type: Bug Components: build, python api Reporter: Kapil Arya Assignee: Kapil Arya Priority: Blocker When doing a make install, the src/python/native/src/mesos/__init__.py file is not getting installed in ${PREFIX}/lib/pythonX.Y/site-packages/mesos/. This makes it impossible to do the following import when PYTHONPATH is set to the site-packages directory. {code} import mesos.interface.mesos_pb2 {code} The directories `${PREFIX}/lib/pythonX.Y/site-packages/mesos/{interface,native}/` do have their corresponding `__init__.py` files. Reproducing the bug: ../configure --prefix=$HOME/test-install make install -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2016) docker_name_prefix is too generic
[ https://issues.apache.org/jira/browse/MESOS-2016?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2016: -- Sprint: Mesosphere Q4 Sprint 2 - 11/14, Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3 (was: Mesosphere Q4 Sprint 2 - 11/14, Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20) docker_name_prefix is too generic - Key: MESOS-2016 URL: https://issues.apache.org/jira/browse/MESOS-2016 Project: Mesos Issue Type: Bug Reporter: Jay Buffington Assignee: Timothy Chen From docker.hpp and docker.cpp: {code} // Prefix used to name Docker containers in order to distinguish those // created by Mesos from those created manually. extern std::string DOCKER_NAME_PREFIX; // TODO(benh): At some point to run multiple slaves we'll need to make // the Docker container name creation include the slave ID. string DOCKER_NAME_PREFIX = mesos-; {code} This name is too generic. A common pattern in docker land is to run everything in a container and use volume mounts to share sockets do RPC between containers. CoreOS has popularized this technique. Inevitably, what people do is start a container named mesos-slave which runs the docker containerizer recovery code which removes all containers that start with mesos- And then ask huh, why did my mesos-slave docker container die? I don't see any error messages... Ideally, we should do what Ben suggested and add the slave id to the name prefix. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2157) Add /master/slaves and /master/frameworks/{framework}/tasks/{task} endpoints
[ https://issues.apache.org/jira/browse/MESOS-2157?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2157: -- Sprint: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3 (was: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20) Add /master/slaves and /master/frameworks/{framework}/tasks/{task} endpoints Key: MESOS-2157 URL: https://issues.apache.org/jira/browse/MESOS-2157 Project: Mesos Issue Type: Task Components: master Reporter: Niklas Quarfot Nielsen Assignee: Alexander Rojas Priority: Trivial Labels: mesosphere, newbie master/state.json exports the entire state of the cluster and can, for large clusters, become massive (tens of megabytes of JSON). Often, a client only need information about subsets of the entire state, for example all connected slaves, or information (registration info, tasks, etc) belonging to a particular framework. We can partition state.json into many smaller endpoints, but for starters, being able to get slave information and tasks information per framework would be useful. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2205) Add user documentation for reservations
[ https://issues.apache.org/jira/browse/MESOS-2205?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2205: -- Sprint: Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3 (was: Mesosphere Q1 Sprint 5 - 3/20) Add user documentation for reservations --- Key: MESOS-2205 URL: https://issues.apache.org/jira/browse/MESOS-2205 Project: Mesos Issue Type: Documentation Components: documentation, framework Reporter: Michael Park Assignee: Michael Park Labels: mesosphere Add a user guide for reservations which describes basic usage of them, how ACLs are used to specify who can unreserve whose resources, and few advanced usage cases. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2115) Improve recovering Docker containers when slave is contained
[ https://issues.apache.org/jira/browse/MESOS-2115?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2115: -- Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3 (was: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20) Improve recovering Docker containers when slave is contained Key: MESOS-2115 URL: https://issues.apache.org/jira/browse/MESOS-2115 Project: Mesos Issue Type: Epic Components: docker Reporter: Timothy Chen Assignee: Timothy Chen Labels: docker Currently when docker containerizer is recovering it checks the checkpointed executor pids to recover which containers are still running, and remove the rest of the containers from docker ps that isn't recognized. This is problematic when the slave itself was in a docker container, as when the slave container dies all the forked processes are removed as well, so the checkpointed executor pids are no longer valid. We have to assume the docker containers might be still running even though the checkpointed executor pids are not. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1831) Master should send PingSlaveMessage instead of PING
[ https://issues.apache.org/jira/browse/MESOS-1831?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-1831: -- Sprint: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3 (was: Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20) Master should send PingSlaveMessage instead of PING - Key: MESOS-1831 URL: https://issues.apache.org/jira/browse/MESOS-1831 Project: Mesos Issue Type: Task Reporter: Vinod Kone Assignee: Adam B Labels: mesosphere In 0.21.0 master sends PING message with an embedded PingSlaveMessage for backwards compatibility (https://reviews.apache.org/r/25867/). In 0.22.0, master should send PingSlaveMessage directly instead of PING. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2375) Remove the checkpoint variable entirely from slave/flags.hpp
[ https://issues.apache.org/jira/browse/MESOS-2375?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2375: -- Sprint: Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3 (was: Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20) Remove the checkpoint variable entirely from slave/flags.hpp Key: MESOS-2375 URL: https://issues.apache.org/jira/browse/MESOS-2375 Project: Mesos Issue Type: Task Reporter: Joerg Schad Assignee: Joerg Schad Labels: checkpoint, mesosphere -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2108) Add configure flag or environment variable to enable SSL/libevent Socket
[ https://issues.apache.org/jira/browse/MESOS-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2108: -- Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3 (was: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20) Add configure flag or environment variable to enable SSL/libevent Socket Key: MESOS-2108 URL: https://issues.apache.org/jira/browse/MESOS-2108 Project: Mesos Issue Type: Task Reporter: Niklas Quarfot Nielsen Assignee: Joris Van Remoortere -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2085) Add support encrypted and non-encrypted communication in parallel for cluster upgrade
[ https://issues.apache.org/jira/browse/MESOS-2085?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Niklas Quarfot Nielsen updated MESOS-2085: -- Sprint: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20, Mesosphere Q1 Sprint 6 - 4/3 (was: Mesosphere Q4 Sprint 3 - 12/7, Mesosphere Q1 Sprint 1 - 1/23, Mesosphere Q1 Sprint 2 - 2/6, Mesosphere Q1 Sprint 3 - 2/20, Mesosphere Q1 Sprint 4 - 3/6, Mesosphere Q1 Sprint 5 - 3/20) Add support encrypted and non-encrypted communication in parallel for cluster upgrade - Key: MESOS-2085 URL: https://issues.apache.org/jira/browse/MESOS-2085 Project: Mesos Issue Type: Task Reporter: Niklas Quarfot Nielsen Assignee: Joris Van Remoortere During cluster upgrade from non-encrypted to encrypted communication, we need to support an interim where: 1) A master can have connections to both encrypted and non-encrypted slaves 2) A slave that supports encrypted communication connects to a master that has not yet been upgraded. 3) Frameworks are encrypted but the master has not been upgraded yet. 4) Master has been upgraded but frameworks haven't. 5) A slave process has upgraded but running executor processes haven't. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2524) Mesos-containerizer not linked from main documentation page.
[ https://issues.apache.org/jira/browse/MESOS-2524?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371797#comment-14371797 ] James Peach commented on MESOS-2524: FWIW, I'm happy to make this change in order to learn the contribution process ;) Mesos-containerizer not linked from main documentation page. Key: MESOS-2524 URL: https://issues.apache.org/jira/browse/MESOS-2524 Project: Mesos Issue Type: Documentation Reporter: Joerg Schad Assignee: Joerg Schad Priority: Minor Original Estimate: 0.5h Remaining Estimate: 0.5h Is there any reason that the mesos-containerizer (http://mesos.apache.org/documentation/latest/mesos-containerizer/) documentation is not linked from the main documentation page (http://mesos.apache.org/documentation/latest/)? Both docker and external conterizer pages are linked from here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2367) Improve slave resiliency in the face of orphan containers
[ https://issues.apache.org/jira/browse/MESOS-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371800#comment-14371800 ] Jie Yu commented on MESOS-2367: --- [~idownes] Vinod and I chatted about this, and he proposed a solution I think is pretty clean: 1) We modify the launcher recover interface to return a list of container IDs that it believes are orphans. 2) Isolator does not cleanup orphans during recovery, it simply recovers them. 3) The Mesos containerizer will take the list of orphans from launcher recovery and destroy them explicitly That matches the logics in the steady state. It's still a non trivial task because we need to modify the recovery path for each isolator. [~idownes], what do you think? Improve slave resiliency in the face of orphan containers -- Key: MESOS-2367 URL: https://issues.apache.org/jira/browse/MESOS-2367 Project: Mesos Issue Type: Bug Components: slave Reporter: Joe Smith Priority: Critical Right now there's a case where a misbehaving executor can cause a slave process to flap: {panel:title=Quote From [~jieyu]} {quote} 1) User tries to kill an instance 2) Slave sends {{KillTaskMessage}} to executor 3) Executor sends kill signals to task processes 4) Executor sends {{TASK_KILLED}} to slave 5) Slave updates container cpu limit to be 0.01 cpus 6) A user-process is still processing the kill signal 7) the task process cannot exit since it has too little cpu share and is throttled 8) Executor itself terminates 9) Slave tries to destroy the container, but cannot because the user-process is stuck in the exit path. 10) Slave restarts, and is constantly flapping because it cannot kill orphan containers {quote} {panel} The slave's orphan container handling should be improved to deal with this case despite ill-behaved users (framework writers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2367) Improve slave resiliency in the face of orphan containers
[ https://issues.apache.org/jira/browse/MESOS-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371855#comment-14371855 ] Ian Downes commented on MESOS-2367: --- Perhaps you're suggesting the containerizer has the logic (in this particular case) of increasing the cpu and then trying to destroy the container? Improve slave resiliency in the face of orphan containers -- Key: MESOS-2367 URL: https://issues.apache.org/jira/browse/MESOS-2367 Project: Mesos Issue Type: Bug Components: slave Reporter: Joe Smith Priority: Critical Right now there's a case where a misbehaving executor can cause a slave process to flap: {panel:title=Quote From [~jieyu]} {quote} 1) User tries to kill an instance 2) Slave sends {{KillTaskMessage}} to executor 3) Executor sends kill signals to task processes 4) Executor sends {{TASK_KILLED}} to slave 5) Slave updates container cpu limit to be 0.01 cpus 6) A user-process is still processing the kill signal 7) the task process cannot exit since it has too little cpu share and is throttled 8) Executor itself terminates 9) Slave tries to destroy the container, but cannot because the user-process is stuck in the exit path. 10) Slave restarts, and is constantly flapping because it cannot kill orphan containers {quote} {panel} The slave's orphan container handling should be improved to deal with this case despite ill-behaved users (framework writers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-2367) Improve slave resiliency in the face of orphan containers
[ https://issues.apache.org/jira/browse/MESOS-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371846#comment-14371846 ] Ian Downes edited comment on MESOS-2367 at 3/20/15 6:46 PM: This is similar to what I'm proposing but avoids the real issue of how to handle orphans that cannot be destroyed? i.e., what does the containerizer do with the orphans: (3) says it destroys them but this ultimately calls the same code that's failing to destroy a container now? was (Author: idownes): This is similar to what I'm proposing but skirts the real issue of how to handle orphans that cannot be destroyed? i.e., what does the containerizer do with the orphans: (3) says it destroys them but this ultimately calls the same code that's failing to destroy a container now? Improve slave resiliency in the face of orphan containers -- Key: MESOS-2367 URL: https://issues.apache.org/jira/browse/MESOS-2367 Project: Mesos Issue Type: Bug Components: slave Reporter: Joe Smith Priority: Critical Right now there's a case where a misbehaving executor can cause a slave process to flap: {panel:title=Quote From [~jieyu]} {quote} 1) User tries to kill an instance 2) Slave sends {{KillTaskMessage}} to executor 3) Executor sends kill signals to task processes 4) Executor sends {{TASK_KILLED}} to slave 5) Slave updates container cpu limit to be 0.01 cpus 6) A user-process is still processing the kill signal 7) the task process cannot exit since it has too little cpu share and is throttled 8) Executor itself terminates 9) Slave tries to destroy the container, but cannot because the user-process is stuck in the exit path. 10) Slave restarts, and is constantly flapping because it cannot kill orphan containers {quote} {panel} The slave's orphan container handling should be improved to deal with this case despite ill-behaved users (framework writers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2367) Improve slave resiliency in the face of orphan containers
[ https://issues.apache.org/jira/browse/MESOS-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371846#comment-14371846 ] Ian Downes commented on MESOS-2367: --- This is similar to what I'm proposing but skirts the real issue of how to handle orphans that cannot be destroyed? i.e., what does the containerizer do with the orphans: (3) says it destroys them but this ultimately calls the same code that's failing to destroy a container now? Improve slave resiliency in the face of orphan containers -- Key: MESOS-2367 URL: https://issues.apache.org/jira/browse/MESOS-2367 Project: Mesos Issue Type: Bug Components: slave Reporter: Joe Smith Priority: Critical Right now there's a case where a misbehaving executor can cause a slave process to flap: {panel:title=Quote From [~jieyu]} {quote} 1) User tries to kill an instance 2) Slave sends {{KillTaskMessage}} to executor 3) Executor sends kill signals to task processes 4) Executor sends {{TASK_KILLED}} to slave 5) Slave updates container cpu limit to be 0.01 cpus 6) A user-process is still processing the kill signal 7) the task process cannot exit since it has too little cpu share and is throttled 8) Executor itself terminates 9) Slave tries to destroy the container, but cannot because the user-process is stuck in the exit path. 10) Slave restarts, and is constantly flapping because it cannot kill orphan containers {quote} {panel} The slave's orphan container handling should be improved to deal with this case despite ill-behaved users (framework writers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2367) Improve slave resiliency in the face of orphan containers
[ https://issues.apache.org/jira/browse/MESOS-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371864#comment-14371864 ] Jie Yu commented on MESOS-2367: --- [~idownes] I think having orphans that we cannot cleanup is better than a flapping slave. We can address the issue of how to cleanup those orphans reliably later. Yeah, I am more comfortable letting the containerizer increase the cpu because it has the knowledge about what isolators and launcher are used. Improve slave resiliency in the face of orphan containers -- Key: MESOS-2367 URL: https://issues.apache.org/jira/browse/MESOS-2367 Project: Mesos Issue Type: Bug Components: slave Reporter: Joe Smith Priority: Critical Right now there's a case where a misbehaving executor can cause a slave process to flap: {panel:title=Quote From [~jieyu]} {quote} 1) User tries to kill an instance 2) Slave sends {{KillTaskMessage}} to executor 3) Executor sends kill signals to task processes 4) Executor sends {{TASK_KILLED}} to slave 5) Slave updates container cpu limit to be 0.01 cpus 6) A user-process is still processing the kill signal 7) the task process cannot exit since it has too little cpu share and is throttled 8) Executor itself terminates 9) Slave tries to destroy the container, but cannot because the user-process is stuck in the exit path. 10) Slave restarts, and is constantly flapping because it cannot kill orphan containers {quote} {panel} The slave's orphan container handling should be improved to deal with this case despite ill-behaved users (framework writers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-2526) MESOS_master in mesos-slave-env.sh is not work well by mesos-slave
[ https://issues.apache.org/jira/browse/MESOS-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371693#comment-14371693 ] Littlestar edited comment on MESOS-2526 at 3/20/15 5:32 PM: I changed /usr/sbin/mesos-daemon.sh, now it shows detail message in console. {noformat} #nohup ${exec_prefix}/sbin/${PROGRAM} ${@} /dev/null /dev/null 21 ${exec_prefix}/sbin/${PROGRAM} ${@} Failed to create a containerizer: Could not create MesosContainerizer: Could not create isolator cgroups/cpu: Failed to prepare hierarchy for cpu subsystem: Failed to mount cgroups hierarchy at '/sys/fs/cgroup/cpu': Failed to create directory '/sys/fs/cgroup/cpu': No such file or directory {noformat} was (Author: cnstar9988): I changed /usr/sbin/mesos-daemon.sh, now it shows detail message in console. #nohup ${exec_prefix}/sbin/${PROGRAM} ${@} /dev/null /dev/null 21 ${exec_prefix}/sbin/${PROGRAM} ${@} Failed to create a containerizer: Could not create MesosContainerizer: Could not create isolator cgroups/cpu: Failed to prepare hierarchy for cpu subsystem: Failed to mount cgroups hierarchy at '/sys/fs/cgroup/cpu': Failed to create directory '/sys/fs/cgroup/cpu': No such file or directory MESOS_master in mesos-slave-env.sh is not work well by mesos-slave --- Key: MESOS-2526 URL: https://issues.apache.org/jira/browse/MESOS-2526 Project: Mesos Issue Type: Bug Components: framework Affects Versions: 0.21.1 Reporter: Littlestar Priority: Minor mesos start-cluster.sh master node start ok, but slave node start fail. echo slave node has log message, but nothing seens fail. no mesos process running on slave node. == masters and slaves file is ok. the following work ok. export MESOS_master=mymaster:5050 mesos-slave why does it work fail in mesos-slave-env.sh? Thanks -bash-4.1# cat /usr/etc/mesos/mesos-slave-env.sh {noformat} # This file contains environment variables that are passed to mesos-slave. # To get a description of all options run mesos-slave --help; any option # supported as a command-line option is also supported as an environment # variable. # You must at least set MESOS_master. # The mesos master URL to contact. Should be host:port for # non-ZooKeeper based masters, otherwise a zk:// or file:// URL. export MESOS_master=mymaster:5050 # Other options you're likely to want to set: export MESOS_log_dir=/var/log/mesos export MESOS_work_dir=/var/run/mesos export MESOS_isolation=cgroups {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2367) Improve slave resiliency in the face of orphan containers
[ https://issues.apache.org/jira/browse/MESOS-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371702#comment-14371702 ] Jie Yu commented on MESOS-2367: --- I agree that we should not change the contract between launcher and isolators. In other words, isolators should not perform cleanup if there are still processes running. Otherwise, it'll create more complex issues that will be very hard to triage. For instance, if the network isolator cleans up the virtual links before processes running inside the container are killed, we could run into issues later when a new task tries to connect to the slave using the same local port (we've seen that before and it's painful to debug). [~idownes], I don't quite understand your proposal. Correct me if I were wrong: {quote} Make orphan clean up failures non-fatal so the slave will start and we gain control over running tasks. {quote} Are you saying that we don't fail the slave when launcher fails to cleanup orphans? Well, that means we still need to check the impl. (i.e., the recovery path) of each isolator to make sure they don't cleanup orphans if the launcher recovery does not report a full success. The isolators need to maintain data structure for those orphans so that the resources used by those orphans are not allocated to new containers. For instance, the network isolator should not allocate the ephemeral port ranges used by those orphans to new containers. Well, I think that's doable and we just need to be careful. Once we've done the above, I don't think we need 3 anymore, right? Improve slave resiliency in the face of orphan containers -- Key: MESOS-2367 URL: https://issues.apache.org/jira/browse/MESOS-2367 Project: Mesos Issue Type: Bug Components: slave Reporter: Joe Smith Priority: Critical Right now there's a case where a misbehaving executor can cause a slave process to flap: {panel:title=Quote From [~jieyu]} {quote} 1) User tries to kill an instance 2) Slave sends {{KillTaskMessage}} to executor 3) Executor sends kill signals to task processes 4) Executor sends {{TASK_KILLED}} to slave 5) Slave updates container cpu limit to be 0.01 cpus 6) A user-process is still processing the kill signal 7) the task process cannot exit since it has too little cpu share and is throttled 8) Executor itself terminates 9) Slave tries to destroy the container, but cannot because the user-process is stuck in the exit path. 10) Slave restarts, and is constantly flapping because it cannot kill orphan containers {quote} {panel} The slave's orphan container handling should be improved to deal with this case despite ill-behaved users (framework writers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2527) Add default bind to socket
[ https://issues.apache.org/jira/browse/MESOS-2527?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371711#comment-14371711 ] Joris Van Remoortere commented on MESOS-2527: - https://reviews.apache.org/r/28485/ Add default bind to socket -- Key: MESOS-2527 URL: https://issues.apache.org/jira/browse/MESOS-2527 Project: Mesos Issue Type: Improvement Components: libprocess Reporter: Joris Van Remoortere Assignee: Joris Van Remoortere Priority: Minor Labels: libprocess Fix For: 0.23.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2526) MESOS_master in mesos-slave-env.sh is not work well by mesos-slave
[ https://issues.apache.org/jira/browse/MESOS-2526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371693#comment-14371693 ] Littlestar commented on MESOS-2526: --- I changed /usr/sbin/mesos-daemon.sh, now it shows detail message in console. #nohup ${exec_prefix}/sbin/${PROGRAM} ${@} /dev/null /dev/null 21 ${exec_prefix}/sbin/${PROGRAM} ${@} Failed to create a containerizer: Could not create MesosContainerizer: Could not create isolator cgroups/cpu: Failed to prepare hierarchy for cpu subsystem: Failed to mount cgroups hierarchy at '/sys/fs/cgroup/cpu': Failed to create directory '/sys/fs/cgroup/cpu': No such file or directory MESOS_master in mesos-slave-env.sh is not work well by mesos-slave --- Key: MESOS-2526 URL: https://issues.apache.org/jira/browse/MESOS-2526 Project: Mesos Issue Type: Bug Components: framework Affects Versions: 0.21.1 Reporter: Littlestar Priority: Minor mesos start-cluster.sh master node start ok, but slave node start fail. echo slave node has log message, but nothing seens fail. no mesos process running on slave node. == masters and slaves file is ok. the following work ok. export MESOS_master=mymaster:5050 mesos-slave why does it work fail in mesos-slave-env.sh? Thanks -bash-4.1# cat /usr/etc/mesos/mesos-slave-env.sh {noformat} # This file contains environment variables that are passed to mesos-slave. # To get a description of all options run mesos-slave --help; any option # supported as a command-line option is also supported as an environment # variable. # You must at least set MESOS_master. # The mesos master URL to contact. Should be host:port for # non-ZooKeeper based masters, otherwise a zk:// or file:// URL. export MESOS_master=mymaster:5050 # Other options you're likely to want to set: export MESOS_log_dir=/var/log/mesos export MESOS_work_dir=/var/run/mesos export MESOS_isolation=cgroups {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2367) Improve slave resiliency in the face of orphan containers
[ https://issues.apache.org/jira/browse/MESOS-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371768#comment-14371768 ] Ian Downes commented on MESOS-2367: --- The difficulty here is that the containerizer builds a list of executors that it believes should be running and then the launcher and each of isolators independently clean up any other state they discover. Success is binary, they either recovery everything and clean up everything or fail. Instead, I think the launcher should provide more information about what is was able to do: a) attempt recovery of containers assumed to be running and report those it could not recover b) attempt clean up of other containers and report those it could not clean up Isolators should do a similar thing, as you've described. The critical part we're missing is that if clean up fails (including if the launcher could not clean up the container) the isolator should account for the resources of that container, e.g., for things like ephemeral port ranges. After recovery the containerizer should know about running containers recovered correctly and also containers that have not been fully destroyed (the union of those from launcher + isolators). This would need to be handled by the slave The key point is that an incomplete clean up of a container is not inherently bad, so long as we know about it and we can account for the resources it holds. Furthermore, we must make this visible with appropriate counters. Improve slave resiliency in the face of orphan containers -- Key: MESOS-2367 URL: https://issues.apache.org/jira/browse/MESOS-2367 Project: Mesos Issue Type: Bug Components: slave Reporter: Joe Smith Priority: Critical Right now there's a case where a misbehaving executor can cause a slave process to flap: {panel:title=Quote From [~jieyu]} {quote} 1) User tries to kill an instance 2) Slave sends {{KillTaskMessage}} to executor 3) Executor sends kill signals to task processes 4) Executor sends {{TASK_KILLED}} to slave 5) Slave updates container cpu limit to be 0.01 cpus 6) A user-process is still processing the kill signal 7) the task process cannot exit since it has too little cpu share and is throttled 8) Executor itself terminates 9) Slave tries to destroy the container, but cannot because the user-process is stuck in the exit path. 10) Slave restarts, and is constantly flapping because it cannot kill orphan containers {quote} {panel} The slave's orphan container handling should be improved to deal with this case despite ill-behaved users (framework writers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2526) MESOS_master in mesos-slave-env.sh is not work well by mesos-slave
Littlestar created MESOS-2526: - Summary: MESOS_master in mesos-slave-env.sh is not work well by mesos-slave Key: MESOS-2526 URL: https://issues.apache.org/jira/browse/MESOS-2526 Project: Mesos Issue Type: Bug Components: framework Affects Versions: 0.21.1 Reporter: Littlestar Priority: Minor mesos start-cluster.sh master node start ok, but slave node start fail. echo slave node has log message, but nothing seens fail. no mesos process running on slave node. == the following work ok. export MESOS_master=mymaster:5050 mesos-slave why does it work fail in mesos-slave-env.sh? Thanks -bash-4.1# cat /usr/etc/mesos/mesos-slave-env.sh {noformat} # This file contains environment variables that are passed to mesos-slave. # To get a description of all options run mesos-slave --help; any option # supported as a command-line option is also supported as an environment # variable. # You must at least set MESOS_master. # The mesos master URL to contact. Should be host:port for # non-ZooKeeper based masters, otherwise a zk:// or file:// URL. export MESOS_master=mymaster:5050 # Other options you're likely to want to set: export MESOS_log_dir=/var/log/mesos export MESOS_work_dir=/var/run/mesos export MESOS_isolation=cgroups {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-2527) Add default bind to socket
Joris Van Remoortere created MESOS-2527: --- Summary: Add default bind to socket Key: MESOS-2527 URL: https://issues.apache.org/jira/browse/MESOS-2527 Project: Mesos Issue Type: Improvement Components: libprocess Reporter: Joris Van Remoortere Assignee: Joris Van Remoortere Priority: Minor Fix For: 0.23.0 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2523) Executor directory has incorrect permissions
[ https://issues.apache.org/jira/browse/MESOS-2523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371556#comment-14371556 ] Michael Ngo commented on MESOS-2523: [~vinodkone] I narrowed it down by doing a test before and after this commit. The regression was introduced after this commit. Executor directory has incorrect permissions Key: MESOS-2523 URL: https://issues.apache.org/jira/browse/MESOS-2523 Project: Mesos Issue Type: Bug Reporter: Michael Ngo Attachments: CorrectPermissions.png, WrongPermissions.png Currently my setup involves a mesos master on one node (nodeMM) and a mesos slave on another node (nodeMS). NodeMM runs the mesos-master process as the flxjob user. The framework (Chronos) attached to nodeMM submits tasks as the flxjob user. NodeMS runs the mesos-slave process as root cause cgroups are being used. What's expected to happen is that the executed task will be executed by flxjob and that directory in which code is executed is also owned by flxjob. What actually happens is that the task is executed by flxjob, but the directory in which code is executed is owned by root. Here are the arguments used by each process. * Master {noformat}/usr/local/sbin/mesos-master --cluster=Mesos HA Cluster --log_dir=/var/log/mesos/master --work_dir=/var/lib/mesos/master --zk=zk://172.16.3.70:2181/mesos --hostname=ip-172-16-15-74 --quorum=1 --zk_session_timeout=10secs --no-root_submissions{noformat} * Slave {noformat}/usr/local/sbin/mesos-slave --log_dir=/var/log/mesos/slave --work_dir=/var/lib/mesos/slave --master=zk://172.16.3.70:2181/mesos --hostname=172.16.3.215 --ip=172.16.3.215 --cgroups_enable_cfs --cgroups_hierarchy=/cgroup --isolation=cgroups/cpu,cgroups/mem --cgroups_limit_swap{noformat} Here is the output for returning the user identity via the id process. Both the working (expected) and not working scenario yield the same output. {noformat} uid=501(flxjob) gid=501(flxjob) groups=501(flxjob),0(root) {noformat} I narrowed down where the issue was introduced. It was introduced by [this commit|https://github.com/apache/mesos/commit/25489e53e9f308c5fca3d0293aeceb716b53149d]. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-2171) Compilation error on Ubuntu 12.04
[ https://issues.apache.org/jira/browse/MESOS-2171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371546#comment-14371546 ] Mark Luntzel commented on MESOS-2171: - Sorry I can't test this anymore, as I'm no longer with the company that uses it. Compilation error on Ubuntu 12.04 -- Key: MESOS-2171 URL: https://issues.apache.org/jira/browse/MESOS-2171 Project: Mesos Issue Type: Bug Components: build Affects Versions: 0.22.0 Environment: Ubuntu 12.04 java version 1.6.0_33 gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) Reporter: Mark Luntzel Priority: Blocker Following http://mesos.apache.org/gettingstarted/ we get a compilation error: ../../../3rdparty/libprocess/src/clock.cpp:167:36: instantiated from here /usr/include/c++/4.6/tr1/functional:2040:46: error: invalid initialization of reference of type 'std::listprocess::Timer' from expression of type 'std::listprocess::Timer' /usr/include/c++/4.6/tr1/functional:2040:46: error: return-statement with a value, in function returning 'void' [-fpermissive] make[4]: *** [libprocess_la-clock.lo] Error 1 More output here: https://gist.github.com/luntzel/f5a3c62297aae812c986 Please advise. Thanks in advance! -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Re: [jira] [Created] (MESOS-2524) Mesos-containerizer not linked from main documentation page.
No reason. Feel free to fix. On Fri, Mar 20, 2015 at 6:02 AM, Joerg Schad (JIRA) j...@apache.org wrote: Joerg Schad created MESOS-2524: -- Summary: Mesos-containerizer not linked from main documentation page. Key: MESOS-2524 URL: https://issues.apache.org/jira/browse/MESOS-2524 Project: Mesos Issue Type: Documentation Reporter: Joerg Schad Assignee: Joerg Schad Priority: Minor Is there any reason that the mesos-containerizer ( http://mesos.apache.org/documentation/latest/mesos-containerizer/) documentation is not linked from the main documentation page ( http://mesos.apache.org/documentation/latest/)? Both docker and external conterizer pages are linked from here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (MESOS-1570) Make check Error when Building Mesos in a Docker container
[ https://issues.apache.org/jira/browse/MESOS-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Isabel Jimenez closed MESOS-1570. - Resolution: Fixed Make check Error when Building Mesos in a Docker container --- Key: MESOS-1570 URL: https://issues.apache.org/jira/browse/MESOS-1570 Project: Mesos Issue Type: Bug Reporter: Isabel Jimenez Assignee: Isabel Jimenez Priority: Minor Labels: Docker When building Mesos inside a Docker container, it's for the moment impossible to run tests even when you run Docker in --privileged mode. There is a test in stout that sets all the namespaces and libcontainer does not support setting 'user' namespace (more information [here|https://github.com/docker/libcontainer/blob/master/namespaces/nsenter.go#L136]). This is the error: {code:title=Make check failed test|borderStyle=solid} [--] 1 test from OsSetnsTest [ RUN ] OsSetnsTest.setns ../../../../3rdparty/libprocess/3rdparty/stout/tests/os/setns_tests.cpp:43: Failure os::setns(::getpid(), ns): Invalid argument [ FAILED ] OsSetnsTest.setns (7 ms) [--] 1 test from OsSetnsTest (7 ms total) [ FAILED ] 1 test, listed below: [ FAILED ] OsSetnsTest.setns 1 FAILED TEST {code} This can be disable as Mesos does not need to set 'user' namespace. I don't know if Docker will support setting user namespace one day since it's a new kernel feature, what could be the best approach to this issue? (disabling set for 'user' namespace on stout, disabling just this test..) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1570) Make check Error when Building Mesos in a Docker container
[ https://issues.apache.org/jira/browse/MESOS-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14371949#comment-14371949 ] Isabel Jimenez commented on MESOS-1570: --- Namespace functions were moved from stout to /linux. Tests were changed. This is not an issue anymore. https://reviews.apache.org/r/27092 Make check Error when Building Mesos in a Docker container --- Key: MESOS-1570 URL: https://issues.apache.org/jira/browse/MESOS-1570 Project: Mesos Issue Type: Bug Reporter: Isabel Jimenez Assignee: Isabel Jimenez Priority: Minor Labels: Docker When building Mesos inside a Docker container, it's for the moment impossible to run tests even when you run Docker in --privileged mode. There is a test in stout that sets all the namespaces and libcontainer does not support setting 'user' namespace (more information [here|https://github.com/docker/libcontainer/blob/master/namespaces/nsenter.go#L136]). This is the error: {code:title=Make check failed test|borderStyle=solid} [--] 1 test from OsSetnsTest [ RUN ] OsSetnsTest.setns ../../../../3rdparty/libprocess/3rdparty/stout/tests/os/setns_tests.cpp:43: Failure os::setns(::getpid(), ns): Invalid argument [ FAILED ] OsSetnsTest.setns (7 ms) [--] 1 test from OsSetnsTest (7 ms total) [ FAILED ] 1 test, listed below: [ FAILED ] OsSetnsTest.setns 1 FAILED TEST {code} This can be disable as Mesos does not need to set 'user' namespace. I don't know if Docker will support setting user namespace one day since it's a new kernel feature, what could be the best approach to this issue? (disabling set for 'user' namespace on stout, disabling just this test..) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2367) Improve slave resiliency in the face of orphan containers
[ https://issues.apache.org/jira/browse/MESOS-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu updated MESOS-2367: -- Sprint: Twitter Mesos Q1 Sprint 5 Story Points: 5 Improve slave resiliency in the face of orphan containers -- Key: MESOS-2367 URL: https://issues.apache.org/jira/browse/MESOS-2367 Project: Mesos Issue Type: Bug Components: slave Reporter: Joe Smith Assignee: Jie Yu Priority: Critical Right now there's a case where a misbehaving executor can cause a slave process to flap: {panel:title=Quote From [~jieyu]} {quote} 1) User tries to kill an instance 2) Slave sends {{KillTaskMessage}} to executor 3) Executor sends kill signals to task processes 4) Executor sends {{TASK_KILLED}} to slave 5) Slave updates container cpu limit to be 0.01 cpus 6) A user-process is still processing the kill signal 7) the task process cannot exit since it has too little cpu share and is throttled 8) Executor itself terminates 9) Slave tries to destroy the container, but cannot because the user-process is stuck in the exit path. 10) Slave restarts, and is constantly flapping because it cannot kill orphan containers {quote} {panel} The slave's orphan container handling should be improved to deal with this case despite ill-behaved users (framework writers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (MESOS-2367) Improve slave resiliency in the face of orphan containers
[ https://issues.apache.org/jira/browse/MESOS-2367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jie Yu reassigned MESOS-2367: - Assignee: Jie Yu Improve slave resiliency in the face of orphan containers -- Key: MESOS-2367 URL: https://issues.apache.org/jira/browse/MESOS-2367 Project: Mesos Issue Type: Bug Components: slave Reporter: Joe Smith Assignee: Jie Yu Priority: Critical Right now there's a case where a misbehaving executor can cause a slave process to flap: {panel:title=Quote From [~jieyu]} {quote} 1) User tries to kill an instance 2) Slave sends {{KillTaskMessage}} to executor 3) Executor sends kill signals to task processes 4) Executor sends {{TASK_KILLED}} to slave 5) Slave updates container cpu limit to be 0.01 cpus 6) A user-process is still processing the kill signal 7) the task process cannot exit since it has too little cpu share and is throttled 8) Executor itself terminates 9) Slave tries to destroy the container, but cannot because the user-process is stuck in the exit path. 10) Slave restarts, and is constantly flapping because it cannot kill orphan containers {quote} {panel} The slave's orphan container handling should be improved to deal with this case despite ill-behaved users (framework writers). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2508) Slave recovering a docker container results in Unknow container error
[ https://issues.apache.org/jira/browse/MESOS-2508?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Geoffroy Jabouley updated MESOS-2508: - Priority: Minor (was: Major) Changed priority to minor. The fix is to start mesos-slave with only docker containerizer activated. It is fine if only docker tasks are started with Mesos. Slave recovering a docker container results in Unknow container error --- Key: MESOS-2508 URL: https://issues.apache.org/jira/browse/MESOS-2508 Project: Mesos Issue Type: Bug Components: containerization, docker, slave Affects Versions: 0.21.1 Environment: Ubuntu 14.04.2 LTS Docker 1.5.0 (same error with 1.4.1) Mesos 0.21.1 installed from mesosphere ubuntu repo Marathon 0.8.0 installed from mesosphere ubuntu repo Reporter: Geoffroy Jabouley Priority: Minor I'm seeing some error logs occuring during a slave recovery of a Mesos task running into a docker container. It does not impede slave recovery process, as the mesos task is still active and running on the slave afeter the recovery. But there is something not working properly when the slave is recovering my docker container. The slave detects my container as an Unknown container Cluster status: - 1 mesos-master, 1 mesos-slave, 1 marathon framework running on the host. - checkpointing is activated on both slave and framework - use native docker containerizer - 1 mesos task, started using marathon, is running inside a docker container and is monitored by the mesos-slave Action: - restart the mesos-slave process (sudo restart mesos-slave) Expected: - docker container still running - mesos task still running - no error in the mesos slave log regarding recovery process Seen: - docker container still running - mesos task still running - {color:red}Several errors *Unknown container* in the mesos slave log during recovery process{color} --- For what it forth, here are my investigations: 1) The mesos task starts fine in the docker container *e4b0de57edf3658046405eff2fbe2f91ac451e04360fc437c20fcfe448297330*. Docker container name is set to *mesos-adb71dc4-c07d-42a9-8fed-264c241668ad* by Mesos docker containerizer _i guess_... {code} I0317 09:56:14.300439 2784 slave.cpp:1083] Got assigned task test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 for framework 20150311-150951-3982541578-5050-50860- I0317 09:56:14.380702 2784 slave.cpp:1193] Launching task test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 for framework 20150311-150951-3982541578-5050-50860- I0317 09:56:14.384466 2784 slave.cpp:3997] Launching executor test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework 20150311-150951-3982541578-5050-50860- in work directory '/tmp/mesos/slaves/20150312-145235-3982541578-5050-1421-S0/frameworks/20150311-150951-3982541578-5050-50860-/executors/test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799/runs/adb71dc4-c07d-42a9-8fed-264c241668ad' I0317 09:56:14.390207 2784 slave.cpp:1316] Queuing task 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' for executor test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework '20150311-150951-3982541578-5050-50860- I0317 09:56:14.421787 2782 docker.cpp:927] Starting container 'adb71dc4-c07d-42a9-8fed-264c241668ad' for task 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' (and executor 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799') of framework '20150311-150951-3982541578-5050-50860-' I0317 09:56:15.784143 2781 docker.cpp:633] Checkpointing pid 27080 to '/tmp/mesos/meta/slaves/20150312-145235-3982541578-5050-1421-S0/frameworks/20150311-150951-3982541578-5050-50860-/executors/test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799/runs/adb71dc4-c07d-42a9-8fed-264c241668ad/pids/forked.pid' I0317 09:56:15.789443 2784 slave.cpp:2840] Monitoring executor 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework '20150311-150951-3982541578-5050-50860-' in container 'adb71dc4-c07d-42a9-8fed-264c241668ad' I0317 09:56:15.862642 2784 slave.cpp:1860] Got registration for executor 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework 20150311-150951-3982541578-5050-50860- from executor(1)@10.195.96.237:36021 I0317 09:56:15.865319 2784 slave.cpp:1979] Flushing queued task test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 for executor 'test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799' of framework 20150311-150951-3982541578-5050-50860- I0317 09:56:15.885414 2787 slave.cpp:2215] Handling status update TASK_RUNNING (UUID: 79f49cec-92c7-4660-b54e-22dd19c1e67c) for task test-app-bveaf.7733257e-cc83-11e4-b930-56847afe9799 of framework
[jira] [Commented] (MESOS-2490) Enable the allocator to distinguish between role and framework reservations
[ https://issues.apache.org/jira/browse/MESOS-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372263#comment-14372263 ] Michael Park commented on MESOS-2490: - https://reviews.apache.org/r/32333/ Enable the allocator to distinguish between role and framework reservations --- Key: MESOS-2490 URL: https://issues.apache.org/jira/browse/MESOS-2490 Project: Mesos Issue Type: Task Components: allocation Reporter: Michael Park Assignee: Michael Park Labels: mesosphere h3. Goal This is the subsequent task after [MESOS-2489|https://issues.apache.org/jira/browse/MESOS-2489] which enables a framework to send back a reservation offer operation to reserve resources for its role. The goal for this ticket is to teach the allocator to distinguish between a role reservation and framework reservation. Note in particular that this means updating the sorter is out of scope of this task. The goal is strictly to teach the allocator how to send offers to a particular framework rather than a role. h3. Expected Outcome * The framework can send back reservation operations to (un)reserve resources for itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2489) Enable a framework to perform reservation operations.
[ https://issues.apache.org/jira/browse/MESOS-2489?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Park updated MESOS-2489: Sprint: Mesosphere Q1 Sprint 6 - 4/3 Enable a framework to perform reservation operations. - Key: MESOS-2489 URL: https://issues.apache.org/jira/browse/MESOS-2489 Project: Mesos Issue Type: Task Components: master Reporter: Michael Park Assignee: Michael Park Labels: mesosphere h3. Goal This is the first step to supporting dynamic reservations. The goal of this task is to enable a framework to reply to a resource offer with *Reserve* and *Unreserve* offer operations as defined by {{Offer::Operation}} in {{mesos.proto}}. h3. Overview It's divided into a few subtasks so that it's clear what the small chunks to be addressed are. In summary, we need to introduce the {{Resource::ReservationInfo}} protobuf message to encapsulate the reservation information, enable the C++ {{Resources}} class to handle it then enable the master to handle reservation operations. h3. Expected Outcome * The framework will be able to send back reservation operations to (un)reserve resources. * The reservations are kept only in the master since we don't send the {{CheckpointResources}} message to checkpoint the reservations on the slave yet. * The reservations are considered to be reserved for the framework's role. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-2490) Enable the allocator to distinguish between role and framework reservations
[ https://issues.apache.org/jira/browse/MESOS-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14372263#comment-14372263 ] Michael Park edited comment on MESOS-2490 at 3/20/15 10:47 PM: --- [r32333|https://reviews.apache.org/r/32333/] was (Author: mcypark): https://reviews.apache.org/r/32333/ Enable the allocator to distinguish between role and framework reservations --- Key: MESOS-2490 URL: https://issues.apache.org/jira/browse/MESOS-2490 Project: Mesos Issue Type: Task Components: allocation Reporter: Michael Park Assignee: Michael Park Labels: mesosphere h3. Goal This is the subsequent task after [MESOS-2489|https://issues.apache.org/jira/browse/MESOS-2489] which enables a framework to send back a reservation offer operation to reserve resources for its role. The goal for this ticket is to teach the allocator to distinguish between a role reservation and framework reservation. Note in particular that this means updating the sorter is out of scope of this task. The goal is strictly to teach the allocator how to send offers to a particular framework rather than a role. h3. Expected Outcome * The framework can send back reservation operations to (un)reserve resources for itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-2490) Enable the allocator to distinguish between role and framework reservations
[ https://issues.apache.org/jira/browse/MESOS-2490?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Park updated MESOS-2490: Sprint: Mesosphere Q1 Sprint 6 - 4/3 Enable the allocator to distinguish between role and framework reservations --- Key: MESOS-2490 URL: https://issues.apache.org/jira/browse/MESOS-2490 Project: Mesos Issue Type: Task Components: allocation Reporter: Michael Park Assignee: Michael Park Labels: mesosphere h3. Goal This is the subsequent task after [MESOS-2489|https://issues.apache.org/jira/browse/MESOS-2489] which enables a framework to send back a reservation offer operation to reserve resources for its role. The goal for this ticket is to teach the allocator to distinguish between a role reservation and framework reservation. Note in particular that this means updating the sorter is out of scope of this task. The goal is strictly to teach the allocator how to send offers to a particular framework rather than a role. h3. Expected Outcome * The framework can send back reservation operations to (un)reserve resources for itself. -- This message was sent by Atlassian JIRA (v6.3.4#6332)