[jira] [Commented] (MESOS-1660) Lower ReaperProcess::wait() delay to 500ms or 250ms
[ https://issues.apache.org/jira/browse/MESOS-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144532#comment-14144532 ] Till Toenshoff commented on MESOS-1660: --- Hey Guys, this is very promising. Would you need any help to get this committed? Lower ReaperProcess::wait() delay to 500ms or 250ms --- Key: MESOS-1660 URL: https://issues.apache.org/jira/browse/MESOS-1660 Project: Mesos Issue Type: Improvement Components: libprocess Reporter: Craig Hansen-Sturm Assignee: Craig Hansen-Sturm See https://issues.apache.org/jira/browse/MESOS-1199 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1143) Add a TASK_ERROR task status.
[ https://issues.apache.org/jira/browse/MESOS-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144634#comment-14144634 ] Alexander Rukletsov commented on MESOS-1143: Maybe we should agree on a general approach to adding new task states or substates. There are several related issues, like [tasks that are implicitly killed|https://issues.apache.org/jira/browse/MESOS-1736] as a reaction to {{UnregisterFrameworkMessage}}, or [exposing additional info|https://issues.apache.org/jira/browse/MESOS-343] through substates. I would suggest we introduce new task states rather sparingly, but rather create substates instead. I would opt for a single protobuf enum, something like {{TaskStateExplained}}. There we can have fine-grained states, e.g. whether the task failed because of an OOM event or whether it was killed explicitly or implicitly. Add a TASK_ERROR task status. - Key: MESOS-1143 URL: https://issues.apache.org/jira/browse/MESOS-1143 Project: Mesos Issue Type: Improvement Components: framework, master Reporter: Benjamin Hindman During task validation we drop tasks that have errors and send TASK_LOST status updates. In most circumstances a framework will want to relaunch a task that has gone lost, and in the event the task is actually malformed (thus invalid) this will result in an infinite loop of sending a task and having it go lost. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1791) Introduce Master / Offer Resource Reservations
[ https://issues.apache.org/jira/browse/MESOS-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14144879#comment-14144879 ] Timothy St. Clair commented on MESOS-1791: -- Doesn't reservations indirectly imply preemption? Introduce Master / Offer Resource Reservations -- Key: MESOS-1791 URL: https://issues.apache.org/jira/browse/MESOS-1791 Project: Mesos Issue Type: Epic Reporter: Tom Arnfeld Currently Mesos supports the ability to reserve resources (for a given role) on a per-slave basis, as introduced in MESOS-505. This allows you to almost statically partition off a set of resources on a set of machines, to guarantee certain types of frameworks get some resources. This is very useful, though it is also very useful to be able to control these reservations through the master (instead of per-slave) for when I don't care which nodes I get on, as long as I get X cpu and Y RAM, or Z sets of (X,Y). I'm not sure what structure this could take, but apparently it has already been discussed. Would this be a CLI flag? Could there be a (authenticated) web interface to control these reservations? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1827) Blog post link is incorrect
[ https://issues.apache.org/jira/browse/MESOS-1827?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jon Bringhurst updated MESOS-1827: -- Description: On the page at: https://mesos.apache.org/blog/mesos-0-20-0-released/ The following text: {noformat} On top of network monitoring, evaluating and providing support for per-container network isolation (MESOS-1585) {noformat} Links to MESOS-1407. However, it should probably link to MESOS-1585 instead. was: On the page at: https://mesos.apache.org/blog/mesos-0-20-0-released/ The following text: On top of network monitoring, evaluating and providing support for per-container network isolation (MESOS-1585) Links to MESOS-1407. However, it should probably link to MESOS-1585 instead. Blog post link is incorrect --- Key: MESOS-1827 URL: https://issues.apache.org/jira/browse/MESOS-1827 Project: Mesos Issue Type: Bug Components: project website Reporter: Jon Bringhurst On the page at: https://mesos.apache.org/blog/mesos-0-20-0-released/ The following text: {noformat} On top of network monitoring, evaluating and providing support for per-container network isolation (MESOS-1585) {noformat} Links to MESOS-1407. However, it should probably link to MESOS-1585 instead. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1240) Update Mesos in Wikipedia
[ https://issues.apache.org/jira/browse/MESOS-1240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145079#comment-14145079 ] Alina Nicolae commented on MESOS-1240: -- I would like to try and solve this issue, but I have one question: what official Mesos docs should appear on the Wikipedia page? Update Mesos in Wikipedia - Key: MESOS-1240 URL: https://issues.apache.org/jira/browse/MESOS-1240 Project: Mesos Issue Type: Documentation Components: documentation Reporter: Adam B Priority: Minor Labels: documentation, wikipedia Right now, Wikipedia's Mesos disambiguation page (https://en.wikipedia.org/wiki/Mesos) links to a section of the GrandLogic/JobServer page: https://en.wikipedia.org/wiki/JobServer#Mesos_Clustering We should grab some of the official Mesos docs and create a legit Mesos Wikipedia page (with relevant links). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1199) Subprocess is slow - gated by process::reap poll interval
[ https://issues.apache.org/jira/browse/MESOS-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145123#comment-14145123 ] Ian Downes commented on MESOS-1199: --- Understood. This race has existed in the codebase for a long time. We could consider looking at /proc/{pid}/exe to confirm that the pid at least corresponds to the expected executable - still not perfect though. Subprocess is slow - gated by process::reap poll interval Key: MESOS-1199 URL: https://issues.apache.org/jira/browse/MESOS-1199 Project: Mesos Issue Type: Improvement Affects Versions: 0.18.0 Reporter: Ian Downes Assignee: Craig Hansen-Sturm Attachments: wiatpid.pdf Subprocess uses process::reap to wait on the subprocess pid and set the exit status. However, process::reap polls with a one second interval resulting in a delay up to the interval duration before the status future is set. This means if you need to wait for the subprocess to complete you get hit with E(delay) = 0.5 seconds, independent of the execution time. For example, the MesosContainerizer uses mesos-fetcher in a Subprocess to fetch the executor during launch. At Twitter we fetch a local file, i.e., a very fast operation, but the launch is blocked until the mesos-fetcher pid is reaped - adding 0 to 1 seconds for every launch! The problem is even worse with a chain of short Subprocesses because after the first Subprocess completes you'll be synchronized with the reap interval and you'll see nearly the full interval before notification, i.e., 10 Subprocesses each of 1 second duration with take ~10 seconds! This has become particularly apparent in some new tests I'm working on where test durations are now greatly extended with each taking several seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1199) Subprocess is slow - gated by process::reap poll interval
[ https://issues.apache.org/jira/browse/MESOS-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145147#comment-14145147 ] Ian Downes commented on MESOS-1199: --- new review with dynamic poll interval: https://reviews.apache.org/r/25947/ Subprocess is slow - gated by process::reap poll interval Key: MESOS-1199 URL: https://issues.apache.org/jira/browse/MESOS-1199 Project: Mesos Issue Type: Improvement Affects Versions: 0.18.0 Reporter: Ian Downes Assignee: Craig Hansen-Sturm Attachments: wiatpid.pdf Subprocess uses process::reap to wait on the subprocess pid and set the exit status. However, process::reap polls with a one second interval resulting in a delay up to the interval duration before the status future is set. This means if you need to wait for the subprocess to complete you get hit with E(delay) = 0.5 seconds, independent of the execution time. For example, the MesosContainerizer uses mesos-fetcher in a Subprocess to fetch the executor during launch. At Twitter we fetch a local file, i.e., a very fast operation, but the launch is blocked until the mesos-fetcher pid is reaped - adding 0 to 1 seconds for every launch! The problem is even worse with a chain of short Subprocesses because after the first Subprocess completes you'll be synchronized with the reap interval and you'll see nearly the full interval before notification, i.e., 10 Subprocesses each of 1 second duration with take ~10 seconds! This has become particularly apparent in some new tests I'm working on where test durations are now greatly extended with each taking several seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (MESOS-1199) Subprocess is slow - gated by process::reap poll interval
[ https://issues.apache.org/jira/browse/MESOS-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145123#comment-14145123 ] Ian Downes edited comment on MESOS-1199 at 9/23/14 6:07 PM: Understood. This race has existed in the codebase for a long time. We could consider looking at /proc/\{pid\}/exe to confirm that the pid at least corresponds to the expected executable - still not perfect though. was (Author: idownes): Understood. This race has existed in the codebase for a long time. We could consider looking at /proc/{pid}/exe to confirm that the pid at least corresponds to the expected executable - still not perfect though. Subprocess is slow - gated by process::reap poll interval Key: MESOS-1199 URL: https://issues.apache.org/jira/browse/MESOS-1199 Project: Mesos Issue Type: Improvement Affects Versions: 0.18.0 Reporter: Ian Downes Assignee: Craig Hansen-Sturm Attachments: wiatpid.pdf Subprocess uses process::reap to wait on the subprocess pid and set the exit status. However, process::reap polls with a one second interval resulting in a delay up to the interval duration before the status future is set. This means if you need to wait for the subprocess to complete you get hit with E(delay) = 0.5 seconds, independent of the execution time. For example, the MesosContainerizer uses mesos-fetcher in a Subprocess to fetch the executor during launch. At Twitter we fetch a local file, i.e., a very fast operation, but the launch is blocked until the mesos-fetcher pid is reaped - adding 0 to 1 seconds for every launch! The problem is even worse with a chain of short Subprocesses because after the first Subprocess completes you'll be synchronized with the reap interval and you'll see nearly the full interval before notification, i.e., 10 Subprocesses each of 1 second duration with take ~10 seconds! This has become particularly apparent in some new tests I'm working on where test durations are now greatly extended with each taking several seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (MESOS-1199) Subprocess is slow - gated by process::reap poll interval
[ https://issues.apache.org/jira/browse/MESOS-1199?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ian Downes updated MESOS-1199: -- Story Points: 1 Subprocess is slow - gated by process::reap poll interval Key: MESOS-1199 URL: https://issues.apache.org/jira/browse/MESOS-1199 Project: Mesos Issue Type: Improvement Affects Versions: 0.18.0 Reporter: Ian Downes Assignee: Ian Downes Attachments: wiatpid.pdf Subprocess uses process::reap to wait on the subprocess pid and set the exit status. However, process::reap polls with a one second interval resulting in a delay up to the interval duration before the status future is set. This means if you need to wait for the subprocess to complete you get hit with E(delay) = 0.5 seconds, independent of the execution time. For example, the MesosContainerizer uses mesos-fetcher in a Subprocess to fetch the executor during launch. At Twitter we fetch a local file, i.e., a very fast operation, but the launch is blocked until the mesos-fetcher pid is reaped - adding 0 to 1 seconds for every launch! The problem is even worse with a chain of short Subprocesses because after the first Subprocess completes you'll be synchronized with the reap interval and you'll see nearly the full interval before notification, i.e., 10 Subprocesses each of 1 second duration with take ~10 seconds! This has become particularly apparent in some new tests I'm working on where test durations are now greatly extended with each taking several seconds. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1107) Create screencast for Getting Started with Mesos
[ https://issues.apache.org/jira/browse/MESOS-1107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145154#comment-14145154 ] Alina Nicolae commented on MESOS-1107: -- Should it also be with audio instructions or just video? Create screencast for Getting Started with Mesos Key: MESOS-1107 URL: https://issues.apache.org/jira/browse/MESOS-1107 Project: Mesos Issue Type: Task Components: documentation Reporter: Dave Lester Priority: Minor It'd be great to have a video screencast that walks people through the steps of building Mesos and running example framework instructions that are included on the Getting Started page http://mesos.apache.org/gettingstarted/ -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1675) Decouple version of the mesos library from the package release version
[ https://issues.apache.org/jira/browse/MESOS-1675?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145273#comment-14145273 ] Timothy St. Clair commented on MESOS-1675: -- If we remove the non-standard major version #-ing that is tacked on to the .so file, then I would say yes. I think I will need to revise the patch to reflect. Decouple version of the mesos library from the package release version -- Key: MESOS-1675 URL: https://issues.apache.org/jira/browse/MESOS-1675 Project: Mesos Issue Type: Bug Reporter: Vinod Kone This discussion should be rolled into the larger discussion around how to version Mesos (APIs, packages, libraries etc). Some notes from libtool docs. http://www.gnu.org/software/libtool/manual/html_node/Updating-version-info.html http://www.gnu.org/software/libtool/manual/html_node/Release-numbers.html#Release-numbers -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (MESOS-1791) Introduce Master / Offer Resource Reservations
[ https://issues.apache.org/jira/browse/MESOS-1791?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14145449#comment-14145449 ] Adam B commented on MESOS-1791: --- Preemption (or at least inverse offers) would certainly help in the case where there are not enough available resources to satisfy a new reservation, but the reservations could also be seen as a desired state to try to reach once tasks complete and enough resources become available to reallocate to the new role/framework. A few definitions/clarifications: Master reservations are a way for the master to provide centralized control over the reservations on every slave. These may be the same static role-based reservations, except they are configurable through the master, perhaps as an (authenticated) web/REST interface. It could be as simple as the same reservation info forwarded to all slaves (reserve 1GB RAM on every slave), or configurable per-slave. Offer reservations are not tied to particular slaves, but allow a framework to request a certain amount of global resources (perhaps split into multiple sets, one per host), regardless of placement. An example would be make sure my framework/role always has 20cpus and 20GB RAM available somewhere in the cluster, but I don't care where. In this scenario, slaves do not need to specifically reserve any of their resources for that framework, so long as the offer reservation can be satisfied by resources from the rest of the cluster. The master will be responsible for enforcing these reservations. The framework/role gets the benefit of defining its desired resources at a higher level of abstraction, without knowledge of how many machines there are or how much resources are available on each. Introduce Master / Offer Resource Reservations -- Key: MESOS-1791 URL: https://issues.apache.org/jira/browse/MESOS-1791 Project: Mesos Issue Type: Epic Reporter: Tom Arnfeld Currently Mesos supports the ability to reserve resources (for a given role) on a per-slave basis, as introduced in MESOS-505. This allows you to almost statically partition off a set of resources on a set of machines, to guarantee certain types of frameworks get some resources. This is very useful, though it is also very useful to be able to control these reservations through the master (instead of per-slave) for when I don't care which nodes I get on, as long as I get X cpu and Y RAM, or Z sets of (X,Y). I'm not sure what structure this could take, but apparently it has already been discussed. Would this be a CLI flag? Could there be a (authenticated) web interface to control these reservations? -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (MESOS-1828) The Mesos UI should link to framework UIs
Tobi Knaup created MESOS-1828: - Summary: The Mesos UI should link to framework UIs Key: MESOS-1828 URL: https://issues.apache.org/jira/browse/MESOS-1828 Project: Mesos Issue Type: Wish Reporter: Tobi Knaup Most frameworks have a web UI or HTTP API. It would be nice to show a direct link from the Mesos web UI so it's easy to navigate there. Currently a user needs to know where schedulers are running, Mesos doesn't have that knowledge. A framework should provide a URL to its API/UI when it connects, as part of FrameworkInfo. This could be explicit proto fields (e.g. gui_url, api_url), or more generic key/value pairs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)