Re: Trying to debug an issue in mesos task tracking
Have you checked the mesos-slave and mesos-master logs for that task id? There should be logs in there for task state updates, including FINISHED. There can be specific cases where sometimes the task status is not reliably sent to your scheduler (due to mesos-master restarts, leader election changes, etc.). There is a task reconciliation support in Mesos. A periodic call to reconcile tasks from the scheduler can be helpful. There are also newer enhancements coming to the task reconciliation. In the mean time, there are other strategies such as what I use, which is periodic heartbeats from my custom executor to my scheduler (out of band). The timeouts for task runtimes are similar to heartbeats, except, you need a priori knowledge of all tasks' runtimes. Task runtime limits are not support inherently, as far as I know. Your executor can implement it, and that may be one simple way to do it. That could also be a good way to implement shell's rlimit*, in general. On Wed, Jan 21, 2015 at 1:22 AM, Itamar Ostricher ita...@yowza3d.com wrote: I'm using a custom internal framework, loosely based on MesosSubmit. The phenomenon I'm seeing is something like this: 1. Task X is assigned to slave S. 2. I know this task should run for ~10minutes. 3. On the master dashboard, I see that task X is in the Running state for several *hours*. 4. I SSH into slave S, and see that task X is *not* running. According to the local logs on that slave, task X finished a long time ago, and seemed to finish OK. 5. According to the scheduler logs, it never got any update from task X after the Staging-Running update. The phenomenon occurs pretty often, but it's not consistent or deterministic. I'd appreciate your input on how to go about debugging it, and/or implement a workaround to avoid wasted resources. I'm pretty sure the executor on the slave sends the TASK_FINISHED status update (how can I verify that beyond my own logging?). I'm pretty sure the scheduler never receives that update (again, how can I verify that beyond my own logging?). I have no idea if the master got the update and passed it through (how can I check that?). My scheduler and executor are written in Python. As for a workaround - setting a timeout on a task should do the trick. I did not see any timeout field in the TaskInfo message. Does mesos support the concept of per-task timeouts? Or should I implement my own task tracking and timeouting mechanism in the scheduler?
Re: Accessing stdout/stderr of a task programmattically?
Is it possible to know the container_id prior when you submit the TaskInfo? If not, how can you find it out? On Wed, Jan 21, 2015 at 1:17 PM, Ian Downes idow...@twitter.com wrote: The final component is the container_id. Take a look in src/slave/paths.hpp to see the directory layout. On Wed, Jan 21, 2015 at 8:50 AM, David Greenberg dsg123456...@gmail.com wrote: So, I've looked into this more, and the UUID in runs doesn't appear appear to be the task-id, executor-id, or framework-id. do you have any idea what it could be? On Tue, Jan 13, 2015 at 5:21 PM, David Greenberg dsg123456...@gmail.com wrote: Thank you for your answers! On Tue, Jan 13, 2015 at 5:15 PM, Tim Chen t...@mesosphere.io wrote: You can get the slave_id, framework_id and executor_id of a task all from state.json. ie: - { - executor_id: 20141231-115728-16777343-5050-49193-S0, - framework_id: 20141231-115728-16777343-5050-49193-, - id: 1, - labels: [ ], - name: Task 1, - resources: { - cpus: 6, - disk: 0, - mem: 13312 }, - slave_id: 20141231-115728-16777343-5050-49193-S0, - state: TASK_KILLED, - statuses: [ - { - state: TASK_RUNNING, - timestamp: 1420056049.88177 }, - { - state: TASK_KILLED, - timestamp: 1420056124.66483 } ] }, On Tue, Jan 13, 2015 at 1:48 PM, David Greenberg dsg123456...@gmail.com wrote: I was trying to figure out how to programmatically access a task's stdout stderr, and I don't fully understand how the URL is constructed. It seems to be of the form http:// $slave_url:5050/read.json?$work_dir/work/slaves/$slave_id/frameworks/$framework_id/executors/$executor_id/runs/$something What is the $something? Is there an easier way, given just the task_id, to find where the output is? Thanks, David
Re: Accessing stdout/stderr of a task programmattically?
The final component is the container_id. Take a look in src/slave/paths.hpp to see the directory layout. On Wed, Jan 21, 2015 at 8:50 AM, David Greenberg dsg123456...@gmail.com wrote: So, I've looked into this more, and the UUID in runs doesn't appear appear to be the task-id, executor-id, or framework-id. do you have any idea what it could be? On Tue, Jan 13, 2015 at 5:21 PM, David Greenberg dsg123456...@gmail.com wrote: Thank you for your answers! On Tue, Jan 13, 2015 at 5:15 PM, Tim Chen t...@mesosphere.io wrote: You can get the slave_id, framework_id and executor_id of a task all from state.json. ie: - { - executor_id: 20141231-115728-16777343-5050-49193-S0, - framework_id: 20141231-115728-16777343-5050-49193-, - id: 1, - labels: [ ], - name: Task 1, - resources: { - cpus: 6, - disk: 0, - mem: 13312 }, - slave_id: 20141231-115728-16777343-5050-49193-S0, - state: TASK_KILLED, - statuses: [ - { - state: TASK_RUNNING, - timestamp: 1420056049.88177 }, - { - state: TASK_KILLED, - timestamp: 1420056124.66483 } ] }, On Tue, Jan 13, 2015 at 1:48 PM, David Greenberg dsg123456...@gmail.com wrote: I was trying to figure out how to programmatically access a task's stdout stderr, and I don't fully understand how the URL is constructed. It seems to be of the form http:// $slave_url:5050/read.json?$work_dir/work/slaves/$slave_id/frameworks/$framework_id/executors/$executor_id/runs/$something What is the $something? Is there an easier way, given just the task_id, to find where the output is? Thanks, David
Re: Storm on Mesos, Anyone Using?
This is what I sent to Cory, but not to the mailing list: In my mind, most of the issues were with Storm itself, rather than Mesos. One annoying thing is that Nimbus is stateful (no HA), so you have to figure out a way to manage the assets on disk in a safe manner. We also used reserved resources for Storm (via Mesos roles), because we were multi-tenant. Without this, it might make for a bad experience (i.e., topologies wouldn't be able to launch correctly due to insufficient resource offers). On Tue, Jan 20, 2015 at 9:51 PM, Srinivas Murthy srinimur...@gmail.com wrote: Brenden, could you please elaborate a bit on those shortcomings :-) On Tue, Jan 20, 2015 at 11:06 AM, Brenden Matthews bren...@diddyinc.com wrote: Hi Cory, We were using the project in production at Airbnb. It may have some shortcomings, but it does, in fact, work. Hello all! I'm interested in Storm on Mesos, but my coworkers don't wanna be guinea pigs. Is anyone using mesos/storm https://github.com/mesos/storm in production? I see the repo is at least active. :) -- Cory Watson Principal Infrastructure Engineer // Keen IO
Re: Accessing stdout/stderr of a task programmattically?
So, I've looked into this more, and the UUID in runs doesn't appear appear to be the task-id, executor-id, or framework-id. do you have any idea what it could be? On Tue, Jan 13, 2015 at 5:21 PM, David Greenberg dsg123456...@gmail.com wrote: Thank you for your answers! On Tue, Jan 13, 2015 at 5:15 PM, Tim Chen t...@mesosphere.io wrote: You can get the slave_id, framework_id and executor_id of a task all from state.json. ie: - { - executor_id: 20141231-115728-16777343-5050-49193-S0, - framework_id: 20141231-115728-16777343-5050-49193-, - id: 1, - labels: [ ], - name: Task 1, - resources: { - cpus: 6, - disk: 0, - mem: 13312 }, - slave_id: 20141231-115728-16777343-5050-49193-S0, - state: TASK_KILLED, - statuses: [ - { - state: TASK_RUNNING, - timestamp: 1420056049.88177 }, - { - state: TASK_KILLED, - timestamp: 1420056124.66483 } ] }, On Tue, Jan 13, 2015 at 1:48 PM, David Greenberg dsg123456...@gmail.com wrote: I was trying to figure out how to programmatically access a task's stdout stderr, and I don't fully understand how the URL is constructed. It seems to be of the form http:// $slave_url:5050/read.json?$work_dir/work/slaves/$slave_id/frameworks/$framework_id/executors/$executor_id/runs/$something What is the $something? Is there an easier way, given just the task_id, to find where the output is? Thanks, David
Re: Unable to follow Sandbox links from Mesos UI.
Hey Dan, The UI will attempt to pull that info directly from the slave so you need to make sure the host is resolvable and routeable from your browser. Cheers, Ryan From my phone On Wednesday, 21 January 2015, Dan Dong dongda...@gmail.com wrote: Hi, All, When I try to access sandbox on mesos UI, I see the following info( The same error appears on every slave sandbox.): Failed to connect to slave '20150115-144719-3205108908-5050-4552-S0' on 'centos-2.local:5051'. Potential reasons: The slave's hostname, 'centos-2.local', is not accessible from your network The slave's port, '5051', is not accessible from your network I checked that: slave centos-2.local can be login from any machine in the cluster without password by ssh centos-2.local ; port 5051 on slave centos-2.local could be connected from master by telnet centos-2.local 5051 The stdout and stderr are there on each slave's /tmp/mesos/..., but seems mesos UI just could not access it. (and Both master and slaves are on the same network IP ranges). Should I open any port on slaves? Any hint what's the problem here? Cheers, Dan
Re: Accessing stdout/stderr of a task programmattically?
No, the container id is generated by the slave when it launches the executor for a task (see Framework::launchExecutor() in src/slave/slave.cpp). However, the 'latest' symlink will point to the most recent container_id directory so you can likely just use that unless your framework is re-using executor_ids (which would mean a new container for each run). On Wed, Jan 21, 2015 at 11:52 AM, David Greenberg dsg123456...@gmail.com wrote: Is it possible to know the container_id prior when you submit the TaskInfo? If not, how can you find it out? On Wed, Jan 21, 2015 at 1:17 PM, Ian Downes idow...@twitter.com wrote: The final component is the container_id. Take a look in src/slave/paths.hpp to see the directory layout. On Wed, Jan 21, 2015 at 8:50 AM, David Greenberg dsg123456...@gmail.com wrote: So, I've looked into this more, and the UUID in runs doesn't appear appear to be the task-id, executor-id, or framework-id. do you have any idea what it could be? On Tue, Jan 13, 2015 at 5:21 PM, David Greenberg dsg123456...@gmail.com wrote: Thank you for your answers! On Tue, Jan 13, 2015 at 5:15 PM, Tim Chen t...@mesosphere.io wrote: You can get the slave_id, framework_id and executor_id of a task all from state.json. ie: - { - executor_id: 20141231-115728-16777343-5050-49193-S0, - framework_id: 20141231-115728-16777343-5050-49193-, - id: 1, - labels: [ ], - name: Task 1, - resources: { - cpus: 6, - disk: 0, - mem: 13312 }, - slave_id: 20141231-115728-16777343-5050-49193-S0, - state: TASK_KILLED, - statuses: [ - { - state: TASK_RUNNING, - timestamp: 1420056049.88177 }, - { - state: TASK_KILLED, - timestamp: 1420056124.66483 } ] }, On Tue, Jan 13, 2015 at 1:48 PM, David Greenberg dsg123456...@gmail.com wrote: I was trying to figure out how to programmatically access a task's stdout stderr, and I don't fully understand how the URL is constructed. It seems to be of the form http:// $slave_url:5050/read.json?$work_dir/work/slaves/$slave_id/frameworks/$framework_id/executors/$executor_id/runs/$something What is the $something? Is there an easier way, given just the task_id, to find where the output is? Thanks, David
cluster wide init
Hello all, I was reading about Marathon: Marathon scheduler processes were started outside of Mesos using init, upstart, or a similar tool [1] So my related questions are Does Marathon work with mesos + Openrc as the init system? Are there any other frameworks that work with Mesos + Openrc? James [1] http://mesosphere.github.io/marathon/
Re: Accessing stdout/stderr of a task programmattically?
It seems that if I take the URL that the Download button for stderr points to and curl it, I get the file. But, if I change the container_id to latest instead of the UUID, then I get a 404. Is there another way to resolve what the container_id is, since it seems critical to get files programmatically. On Wed, Jan 21, 2015 at 3:17 PM, Ian Downes idow...@twitter.com wrote: No, the container id is generated by the slave when it launches the executor for a task (see Framework::launchExecutor() in src/slave/slave.cpp). However, the 'latest' symlink will point to the most recent container_id directory so you can likely just use that unless your framework is re-using executor_ids (which would mean a new container for each run). On Wed, Jan 21, 2015 at 11:52 AM, David Greenberg dsg123456...@gmail.com wrote: Is it possible to know the container_id prior when you submit the TaskInfo? If not, how can you find it out? On Wed, Jan 21, 2015 at 1:17 PM, Ian Downes idow...@twitter.com wrote: The final component is the container_id. Take a look in src/slave/paths.hpp to see the directory layout. On Wed, Jan 21, 2015 at 8:50 AM, David Greenberg dsg123456...@gmail.com wrote: So, I've looked into this more, and the UUID in runs doesn't appear appear to be the task-id, executor-id, or framework-id. do you have any idea what it could be? On Tue, Jan 13, 2015 at 5:21 PM, David Greenberg dsg123456...@gmail.com wrote: Thank you for your answers! On Tue, Jan 13, 2015 at 5:15 PM, Tim Chen t...@mesosphere.io wrote: You can get the slave_id, framework_id and executor_id of a task all from state.json. ie: - { - executor_id: 20141231-115728-16777343-5050-49193-S0, - framework_id: 20141231-115728-16777343-5050-49193-, - id: 1, - labels: [ ], - name: Task 1, - resources: { - cpus: 6, - disk: 0, - mem: 13312 }, - slave_id: 20141231-115728-16777343-5050-49193-S0, - state: TASK_KILLED, - statuses: [ - { - state: TASK_RUNNING, - timestamp: 1420056049.88177 }, - { - state: TASK_KILLED, - timestamp: 1420056124.66483 } ] }, On Tue, Jan 13, 2015 at 1:48 PM, David Greenberg dsg123456...@gmail.com wrote: I was trying to figure out how to programmatically access a task's stdout stderr, and I don't fully understand how the URL is constructed. It seems to be of the form http:// $slave_url:5050/read.json?$work_dir/work/slaves/$slave_id/frameworks/$framework_id/executors/$executor_id/runs/$something What is the $something? Is there an easier way, given just the task_id, to find where the output is? Thanks, David
Unable to follow Sandbox links from Mesos UI.
Hi, All, When I try to access sandbox on mesos UI, I see the following info( The same error appears on every slave sandbox.): Failed to connect to slave '20150115-144719-3205108908-5050-4552-S0' on 'centos-2.local:5051'. Potential reasons: The slave's hostname, 'centos-2.local', is not accessible from your network The slave's port, '5051', is not accessible from your network I checked that: slave centos-2.local can be login from any machine in the cluster without password by ssh centos-2.local ; port 5051 on slave centos-2.local could be connected from master by telnet centos-2.local 5051 The stdout and stderr are there on each slave's /tmp/mesos/..., but seems mesos UI just could not access it. (and Both master and slaves are on the same network IP ranges). Should I open any port on slaves? Any hint what's the problem here? Cheers, Dan
Re: Unable to follow Sandbox links from Mesos UI.
Also see https://issues.apache.org/jira/browse/MESOS-2129 if you want to track progress on changing this. Unfortunately it is on hold for me at the moment to fix. Cody On Wed, Jan 21, 2015 at 2:07 PM, Ryan Thomas r.n.tho...@gmail.com wrote: Hey Dan, The UI will attempt to pull that info directly from the slave so you need to make sure the host is resolvable and routeable from your browser. Cheers, Ryan From my phone On Wednesday, 21 January 2015, Dan Dong dongda...@gmail.com wrote: Hi, All, When I try to access sandbox on mesos UI, I see the following info( The same error appears on every slave sandbox.): Failed to connect to slave '20150115-144719-3205108908-5050-4552-S0' on 'centos-2.local:5051'. Potential reasons: The slave's hostname, 'centos-2.local', is not accessible from your network The slave's port, '5051', is not accessible from your network I checked that: slave centos-2.local can be login from any machine in the cluster without password by ssh centos-2.local ; port 5051 on slave centos-2.local could be connected from master by telnet centos-2.local 5051 The stdout and stderr are there on each slave's /tmp/mesos/..., but seems mesos UI just could not access it. (and Both master and slaves are on the same network IP ranges). Should I open any port on slaves? Any hint what's the problem here? Cheers, Dan
Re: Mesos 0.22.0
Cosmin: 0.21.1-rc2 is actually the same as 0.21.1. Both are tagged to commit 2ae1ba91e64f92ec71d327e10e6ba9e8ad5477e8 On Wed, Jan 21, 2015 at 3:52 PM, Cosmin Lehene cleh...@adobe.com wrote: Also, the release page on github shows 0.21.1-rc2 as being after the 0.21.1 release... https://github.com/apache/mesos/releases Cosmin -- *From:* Tim Chen t...@mesosphere.io *Sent:* Tuesday, January 20, 2015 1:36 PM *To:* Dave Lester *Cc:* user@mesos.apache.org *Subject:* Re: Mesos 0.22.0 Hi Dave, Sorry about the blog post, I lost track of it in the middle of other tasks. I'm going to update the website and the blog post very soon. Tim On Tue, Jan 20, 2015 at 12:37 PM, Dave Lester d...@davelester.org wrote: Thanks Niklas for kicking off this thread. +1 to you as release manager, could you please create a JIRA ticket to track the progress so we could subscribe? A minor correction to your email, Mesos 0.21.1 was voted on in late December (see http://markmail.org/message/e2iam7guxukl3r6c), however the website wasn't updated nor was blogged about like we normally do. Tim (cc'd), do you still plan to make this update? Any way others can help? I'd like to see this updated before we cut another release. +1 to Chris' suggestion of a page to plan future release managers, this would bring some longer-term clarity to who is driving feature releases and what they include. Dave On Tue, Jan 20, 2015, at 12:03 PM, Chris Aniszczyk wrote: definite +1, lets keep the release rhythm going! maybe some space on the wiki for release planning / release managers would be a step forward On Tue, Jan 20, 2015 at 1:59 PM, Joe Stein joe.st...@stealth.ly wrote: +1 so excited for the persistence primitives, awesome! /*** Joe Stein Founder, Principal Consultant Big Data Open Source Security LLC http://www.stealth.ly Twitter: @allthingshadoop http://www.twitter.com/allthingshadoop / On Tue, Jan 20, 2015 at 2:55 PM, John Pampuch j...@mesosphere.io wrote: +1! -John On Tue, Jan 20, 2015 at 11:52 AM, Niklas Nielsen nik...@mesosphere.io wrote: Hi all, We have been releasing major versions of Mesos roughly every second month (current average is ~66 days) and we are now 2 months after the 0.21.0 release, so I would like to propose that we start planning for 0.22.0 Not only in terms of timing, but also because we have some exciting features which are getting ready, including persistence primitives, modules and SSL support (I probably forgot a ton - please chime in). Since we are stakeholders in SSL and Modules, I would like to volunteer as release manager. Like in previous releases, I'd be happy to collaborate with co-release managers to make 0.22.0 a successful release. Niklas -- Cheers, Chris Aniszczyk | Open Source | Twitter, Inc. @cra | +1 512 961 6719
Re: Marathon stability and use-case
Looping in Connor and Dario. On 21 January 2015 at 17:21, Benjamin Mahler benjamin.mah...@gmail.com wrote: Hm.. I'm not sure if any of the Marathon developers are on this list. They have a mailing list here: https://groups.google.com/forum/?hl=en#!forum/marathon-framework On Mon, Jan 19, 2015 at 4:07 AM, Antonin Kral a.k...@bobek.cz wrote: Hi all, first of all, than you for all the hard work on Mesos and related stuff. We are running fairly small mesos/marathon cluster (3 masters + 9 slaves + 3 ZK nodes). All servers are hosted at http://www.hetzner.de/ . This means that we are sometime facing a network issues, frequently caused by some DDoS attack running against other servers in datacenters. We are then facing huge problems with our Marathon installation. Typical behavior would be that Marathon will abandon the tasks. So it will report the lower number of tasks is running (frequently 0) then requested with scaling. So it will try to scale up, which will fail as workers are occupied with previous jobs, which are correctly reported in Mesos. We have not been able to pinpoint anything helpful in the log files of Marathon. We have tried running in 1 master as well as 3 masters modes. 3 node mode seemed actually a bit worse. The only working solution so far is to stop everything. Wipe ZK and kill all jobs on Mesos and then start all components again. So I would like to ask couple questions: - what is the actual use-case for Marathon? Is it expected to have larger number of apps/jobs (right now we have something like 50 apps) or rather to have like 5 of them, which are Mesos frameworks? - Is there a way how to tell Marathon to take ownership of currently running jobs? Honestly, not really sure how this could work as I possibly don't have any state information about them. - What should be the command line to get some helpful information for you guyz to debug the problem next time? As you can see, the problem is that problems are quite random. We didn't have any problem during December, but already had like 3 total breakdowns last week. Thanks a lot, Antonin
Re: cluster wide init
You can always write the init wrapper scripts for marathon. There is an official debian package, which you can find in mesos's apt repo. On Thu, Jan 22, 2015 at 4:20 AM, CCAAT cc...@tampabay.rr.com wrote: Hello all, I was reading about Marathon: Marathon scheduler processes were started outside of Mesos using init, upstart, or a similar tool [1] This means So my related questions are Does Marathon work with mesos + Openrc as the init system? Are there any other frameworks that work with Mesos + Openrc? James [1] http://mesosphere.github.io/marathon/
Re: Marathon stability and use-case
Thanks Niklas. Hi Antonin, Marathon should be able to handle tjousands of tasks and that is exactly what it's made for. Unfortunately the latest release (0.7.6) has been very unstable. We fixed a lot of bugs that caused this unstability and just tagged an RC for 0.8.0 yesterday: https://github.com/mesosphere/marathon/releases/tag/v0.8.0-RC1. It would be great if you could try this RC and report if you still see these issues. I will add the Linux packages and some information about the changes later today. Cheers, Dario On 22.01.2015, at 04:35, Niklas Nielsen nik...@mesosphere.io wrote: Looping in Connor and Dario. On 21 January 2015 at 17:21, Benjamin Mahler benjamin.mah...@gmail.com wrote: Hm.. I'm not sure if any of the Marathon developers are on this list. They have a mailing list here: https://groups.google.com/forum/?hl=en#!forum/marathon-framework On Mon, Jan 19, 2015 at 4:07 AM, Antonin Kral a.k...@bobek.cz wrote: Hi all, first of all, than you for all the hard work on Mesos and related stuff. We are running fairly small mesos/marathon cluster (3 masters + 9 slaves + 3 ZK nodes). All servers are hosted at http://www.hetzner.de/ . This means that we are sometime facing a network issues, frequently caused by some DDoS attack running against other servers in datacenters. We are then facing huge problems with our Marathon installation. Typical behavior would be that Marathon will abandon the tasks. So it will report the lower number of tasks is running (frequently 0) then requested with scaling. So it will try to scale up, which will fail as workers are occupied with previous jobs, which are correctly reported in Mesos. We have not been able to pinpoint anything helpful in the log files of Marathon. We have tried running in 1 master as well as 3 masters modes. 3 node mode seemed actually a bit worse. The only working solution so far is to stop everything. Wipe ZK and kill all jobs on Mesos and then start all components again. So I would like to ask couple questions: - what is the actual use-case for Marathon? Is it expected to have larger number of apps/jobs (right now we have something like 50 apps) or rather to have like 5 of them, which are Mesos frameworks? - Is there a way how to tell Marathon to take ownership of currently running jobs? Honestly, not really sure how this could work as I possibly don't have any state information about them. - What should be the command line to get some helpful information for you guyz to debug the problem next time? As you can see, the problem is that problems are quite random. We didn't have any problem during December, but already had like 3 total breakdowns last week. Thanks a lot, Antonin
Trying to debug an issue in mesos task tracking
I'm using a custom internal framework, loosely based on MesosSubmit. The phenomenon I'm seeing is something like this: 1. Task X is assigned to slave S. 2. I know this task should run for ~10minutes. 3. On the master dashboard, I see that task X is in the Running state for several *hours*. 4. I SSH into slave S, and see that task X is *not* running. According to the local logs on that slave, task X finished a long time ago, and seemed to finish OK. 5. According to the scheduler logs, it never got any update from task X after the Staging-Running update. The phenomenon occurs pretty often, but it's not consistent or deterministic. I'd appreciate your input on how to go about debugging it, and/or implement a workaround to avoid wasted resources. I'm pretty sure the executor on the slave sends the TASK_FINISHED status update (how can I verify that beyond my own logging?). I'm pretty sure the scheduler never receives that update (again, how can I verify that beyond my own logging?). I have no idea if the master got the update and passed it through (how can I check that?). My scheduler and executor are written in Python. As for a workaround - setting a timeout on a task should do the trick. I did not see any timeout field in the TaskInfo message. Does mesos support the concept of per-task timeouts? Or should I implement my own task tracking and timeouting mechanism in the scheduler?
Re: Architecture question
You should also look into Chronos for workflow dependency management of batch jobs (also supports cron-like scheduling). On Fri, Jan 9, 2015 at 2:12 PM, Srinimurthy srinimur...@gmail.com wrote: Tim, This is a SAAS environment where the jobs running on each of these nodes are varying depending on the workflow run by each company, resources (JVMs) are allocated per size and need of the job involved Srinivas On Jan 9, 2015, at 1:59 PM, Tim Chen t...@mesosphere.io wrote: Hi Srinivas, Can you elaborate more about what does maintaining a dynamic count of executors? You can always write a custom framework that provides the scheduling, similiar to what Marathon or Aurora is doing if it doesn't fit your need. Tim On Fri, Jan 9, 2015 at 1:18 PM, Srinivas Murthy srinimur...@gmail.com wrote: Thanks Vinod. I need to deal with a very conservative management that needs a lot of selling for each additional open source framework. I have glossed over Marathon so far. I was hoping to hear there's some way I could override the Scheduler and work with what I have, but I hear you say that isn't the route I should be pursuing :-) On Fri, Jan 9, 2015 at 11:43 AM, Vinod Kone vinodk...@apache.org wrote: Have you looked at Aurora or Marathon? They have some (most?) of the features you are looking for. On Fri, Jan 9, 2015 at 10:59 AM, Srinivas Murthy srinimur...@gmail.com wrote: We have a legacy system with home brewn workflows defined in XPDL, running across multiple dozens of nodes. Resources are mapped in XML definition files, and availability of resource to a given task at hand managed by a custom written job scheduler. Jobs communicate status with callback/JMS messages. Job completion decides steps in the workflow. To this eco system now comes some Hadoop/Spark jobs. I am tentatively exploring Mesos to manage this disparate set of clusters. How can I maintain a dynamic count of Executors, how can I provide dynamic workflow orchestration to pull off above architecture in the Mesos world? Sorry for the noob question!
Re: Architecture question
@some point I'd hope the litany of existing DAG generators that exist for legacy batch systems would make it's way to support this ecosystem. /me coughs Makeflow, pegasus ... | for that matter, one might redux a high throughput systems in a (Docker) world where NP-hard matching no longer makes any sense, b/c it's all cattle. Cheers, Tim - Original Message - From: Adam Bordelon a...@mesosphere.io To: user@mesos.apache.org Sent: Wednesday, January 21, 2015 3:41:40 AM Subject: Re: Architecture question You should also look into Chronos for workflow dependency management of batch jobs (also supports cron-like scheduling). On Fri, Jan 9, 2015 at 2:12 PM, Srinimurthy srinimur...@gmail.com wrote: Tim, This is a SAAS environment where the jobs running on each of these nodes are varying depending on the workflow run by each company, resources (JVMs) are allocated per size and need of the job involved Srinivas On Jan 9, 2015, at 1:59 PM, Tim Chen t...@mesosphere.io wrote: Hi Srinivas, Can you elaborate more about what does maintaining a dynamic count of executors? You can always write a custom framework that provides the scheduling, similiar to what Marathon or Aurora is doing if it doesn't fit your need. Tim On Fri, Jan 9, 2015 at 1:18 PM, Srinivas Murthy srinimur...@gmail.com wrote: Thanks Vinod. I need to deal with a very conservative management that needs a lot of selling for each additional open source framework. I have glossed over Marathon so far. I was hoping to hear there's some way I could override the Scheduler and work with what I have, but I hear you say that isn't the route I should be pursuing :-) On Fri, Jan 9, 2015 at 11:43 AM, Vinod Kone vinodk...@apache.org wrote: Have you looked at Aurora or Marathon? They have some (most?) of the features you are looking for. On Fri, Jan 9, 2015 at 10:59 AM, Srinivas Murthy srinimur...@gmail.com wrote: We have a legacy system with home brewn workflows defined in XPDL, running across multiple dozens of nodes. Resources are mapped in XML definition files, and availability of resource to a given task at hand managed by a custom written job scheduler. Jobs communicate status with callback/JMS messages. Job completion decides steps in the workflow. To this eco system now comes some Hadoop/Spark jobs. I am tentatively exploring Mesos to manage this disparate set of clusters. How can I maintain a dynamic count of Executors, how can I provide dynamic workflow orchestration to pull off above architecture in the Mesos world? Sorry for the noob question! -- Cheers, Timothy St. Clair Red Hat Inc.