Re: Trying to debug an issue in mesos task tracking

2015-01-21 Thread Sharma Podila
Have you checked the mesos-slave and mesos-master logs for that task id?
There should be logs in there for task state updates, including FINISHED.
There can be specific cases where sometimes the task status is not reliably
sent to your scheduler (due to mesos-master restarts, leader election
changes, etc.). There is a task reconciliation support in Mesos. A periodic
call to reconcile tasks from the scheduler can be helpful. There are also
newer enhancements coming to the task reconciliation. In the mean time,
there are other strategies such as what I use, which is periodic heartbeats
from my custom executor to my scheduler (out of band). The timeouts for
task runtimes are similar to heartbeats, except, you need a priori
knowledge of all tasks' runtimes.

Task runtime limits are not support inherently, as far as I know. Your
executor can implement it, and that may be one simple way to do it. That
could also be a good way to implement shell's rlimit*, in general.



On Wed, Jan 21, 2015 at 1:22 AM, Itamar Ostricher ita...@yowza3d.com
wrote:

 I'm using a custom internal framework, loosely based on MesosSubmit.
 The phenomenon I'm seeing is something like this:
 1. Task X is assigned to slave S.
 2. I know this task should run for ~10minutes.
 3. On the master dashboard, I see that task X is in the Running state
 for several *hours*.
 4. I SSH into slave S, and see that task X is *not* running. According to
 the local logs on that slave, task X finished a long time ago, and seemed
 to finish OK.
 5. According to the scheduler logs, it never got any update from task X
 after the Staging-Running update.

 The phenomenon occurs pretty often, but it's not consistent or
 deterministic.

 I'd appreciate your input on how to go about debugging it, and/or
 implement a workaround to avoid wasted resources.

 I'm pretty sure the executor on the slave sends the TASK_FINISHED status
 update (how can I verify that beyond my own logging?).
 I'm pretty sure the scheduler never receives that update (again, how can I
 verify that beyond my own logging?).
 I have no idea if the master got the update and passed it through (how can
 I check that?).
 My scheduler and executor are written in Python.

 As for a workaround - setting a timeout on a task should do the trick. I
 did not see any timeout field in the TaskInfo message. Does mesos support
 the concept of per-task timeouts? Or should I implement my own task
 tracking and timeouting mechanism in the scheduler?



Re: Accessing stdout/stderr of a task programmattically?

2015-01-21 Thread David Greenberg
Is it possible to know the container_id prior when you submit the TaskInfo?
If not, how can you find it out?

On Wed, Jan 21, 2015 at 1:17 PM, Ian Downes idow...@twitter.com wrote:

 The final component is the container_id. Take a look in
 src/slave/paths.hpp to see the directory layout.

 On Wed, Jan 21, 2015 at 8:50 AM, David Greenberg dsg123456...@gmail.com
 wrote:

 So, I've looked into this more, and the UUID in runs doesn't appear
 appear to be the task-id, executor-id, or framework-id. do you have any
 idea what it could be?

 On Tue, Jan 13, 2015 at 5:21 PM, David Greenberg dsg123456...@gmail.com
 wrote:

 Thank you for your answers!

 On Tue, Jan 13, 2015 at 5:15 PM, Tim Chen t...@mesosphere.io wrote:

 You can get the slave_id, framework_id and executor_id of a task all
 from state.json.

 ie:


- {
   - executor_id: 20141231-115728-16777343-5050-49193-S0,
   - framework_id: 20141231-115728-16777343-5050-49193-,
   - id: 1,
   - labels: [ ],
   - name: Task 1,
   - resources:
   {
  - cpus: 6,
  - disk: 0,
  - mem: 13312
  },
   - slave_id: 20141231-115728-16777343-5050-49193-S0,
   - state: TASK_KILLED,
   - statuses:
   [
  -
  {
 - state: TASK_RUNNING,
 - timestamp: 1420056049.88177
 },
  -
  {
 - state: TASK_KILLED,
 - timestamp: 1420056124.66483
 }
  ]
   },


 On Tue, Jan 13, 2015 at 1:48 PM, David Greenberg 
 dsg123456...@gmail.com wrote:

 I was trying to figure out how to programmatically access a task's
 stdout  stderr, and I don't fully understand how the URL is constructed.
 It seems to be of the form http://
 $slave_url:5050/read.json?$work_dir/work/slaves/$slave_id/frameworks/$framework_id/executors/$executor_id/runs/$something

 What is the $something? Is there an easier way, given just the
 task_id, to find where the output is?

 Thanks,
 David








Re: Accessing stdout/stderr of a task programmattically?

2015-01-21 Thread Ian Downes
The final component is the container_id. Take a look in src/slave/paths.hpp
to see the directory layout.

On Wed, Jan 21, 2015 at 8:50 AM, David Greenberg dsg123456...@gmail.com
wrote:

 So, I've looked into this more, and the UUID in runs doesn't appear
 appear to be the task-id, executor-id, or framework-id. do you have any
 idea what it could be?

 On Tue, Jan 13, 2015 at 5:21 PM, David Greenberg dsg123456...@gmail.com
 wrote:

 Thank you for your answers!

 On Tue, Jan 13, 2015 at 5:15 PM, Tim Chen t...@mesosphere.io wrote:

 You can get the slave_id, framework_id and executor_id of a task all
 from state.json.

 ie:


- {
   - executor_id: 20141231-115728-16777343-5050-49193-S0,
   - framework_id: 20141231-115728-16777343-5050-49193-,
   - id: 1,
   - labels: [ ],
   - name: Task 1,
   - resources:
   {
  - cpus: 6,
  - disk: 0,
  - mem: 13312
  },
   - slave_id: 20141231-115728-16777343-5050-49193-S0,
   - state: TASK_KILLED,
   - statuses:
   [
  -
  {
 - state: TASK_RUNNING,
 - timestamp: 1420056049.88177
 },
  -
  {
 - state: TASK_KILLED,
 - timestamp: 1420056124.66483
 }
  ]
   },


 On Tue, Jan 13, 2015 at 1:48 PM, David Greenberg dsg123456...@gmail.com
  wrote:

 I was trying to figure out how to programmatically access a task's
 stdout  stderr, and I don't fully understand how the URL is constructed.
 It seems to be of the form http://
 $slave_url:5050/read.json?$work_dir/work/slaves/$slave_id/frameworks/$framework_id/executors/$executor_id/runs/$something

 What is the $something? Is there an easier way, given just the task_id,
 to find where the output is?

 Thanks,
 David







Re: Storm on Mesos, Anyone Using?

2015-01-21 Thread Brenden Matthews
This is what I sent to Cory, but not to the mailing list:

In my mind, most of the issues were with Storm itself, rather than Mesos.
One annoying thing is that Nimbus is stateful (no HA), so you have to
figure out a way to manage the assets on disk in a safe manner.

We also used reserved resources for Storm (via Mesos roles), because we
were multi-tenant. Without this, it might make for a bad experience (i.e.,
topologies wouldn't be able to launch correctly due to insufficient
resource offers).

On Tue, Jan 20, 2015 at 9:51 PM, Srinivas Murthy srinimur...@gmail.com
wrote:

 Brenden, could you please elaborate a bit on those shortcomings :-)

 On Tue, Jan 20, 2015 at 11:06 AM, Brenden Matthews bren...@diddyinc.com
 wrote:

 Hi Cory,

 We were using the project in production at Airbnb. It may have some
 shortcomings, but it does, in fact, work.


 Hello all!
 I'm interested in Storm on Mesos, but my coworkers don't wanna be guinea
 pigs. Is anyone using mesos/storm https://github.com/mesos/storm in
 production? I see the repo is at least active. :)


 --
 Cory Watson
 Principal Infrastructure Engineer // Keen IO





Re: Accessing stdout/stderr of a task programmattically?

2015-01-21 Thread David Greenberg
So, I've looked into this more, and the UUID in runs doesn't appear
appear to be the task-id, executor-id, or framework-id. do you have any
idea what it could be?

On Tue, Jan 13, 2015 at 5:21 PM, David Greenberg dsg123456...@gmail.com
wrote:

 Thank you for your answers!

 On Tue, Jan 13, 2015 at 5:15 PM, Tim Chen t...@mesosphere.io wrote:

 You can get the slave_id, framework_id and executor_id of a task all from
 state.json.

 ie:


- {
   - executor_id: 20141231-115728-16777343-5050-49193-S0,
   - framework_id: 20141231-115728-16777343-5050-49193-,
   - id: 1,
   - labels: [ ],
   - name: Task 1,
   - resources:
   {
  - cpus: 6,
  - disk: 0,
  - mem: 13312
  },
   - slave_id: 20141231-115728-16777343-5050-49193-S0,
   - state: TASK_KILLED,
   - statuses:
   [
  -
  {
 - state: TASK_RUNNING,
 - timestamp: 1420056049.88177
 },
  -
  {
 - state: TASK_KILLED,
 - timestamp: 1420056124.66483
 }
  ]
   },


 On Tue, Jan 13, 2015 at 1:48 PM, David Greenberg dsg123456...@gmail.com
 wrote:

 I was trying to figure out how to programmatically access a task's
 stdout  stderr, and I don't fully understand how the URL is constructed.
 It seems to be of the form http://
 $slave_url:5050/read.json?$work_dir/work/slaves/$slave_id/frameworks/$framework_id/executors/$executor_id/runs/$something

 What is the $something? Is there an easier way, given just the task_id,
 to find where the output is?

 Thanks,
 David






Re: Unable to follow Sandbox links from Mesos UI.

2015-01-21 Thread Ryan Thomas
Hey Dan,

The UI will attempt to pull that info directly from the slave so you need
to make sure the host is resolvable  and routeable from your browser.

Cheers,

Ryan

From my phone

On Wednesday, 21 January 2015, Dan Dong dongda...@gmail.com wrote:

 Hi, All,
  When I try to access sandbox  on mesos UI, I see the following info( The
  same error appears on every slave sandbox.):

  Failed to connect to slave '20150115-144719-3205108908-5050-4552-S0'
  on 'centos-2.local:5051'.

  Potential reasons:
  The slave's hostname, 'centos-2.local', is not accessible from your
 network  The slave's port, '5051', is not accessible from your network


  I checked that:
  slave centos-2.local can be login from any machine in the cluster without
  password by ssh centos-2.local ;

  port 5051 on slave centos-2.local could be connected from master by
  telnet centos-2.local 5051
 The stdout and stderr are there on each slave's /tmp/mesos/..., but seems 
 mesos UI just could not access it.
 (and Both master and slaves are on the same network IP ranges).  Should I 
 open any port on slaves? Any hint what's the problem here?

  Cheers,
  Dan




Re: Accessing stdout/stderr of a task programmattically?

2015-01-21 Thread Ian Downes
No, the container id is generated by the slave when it launches the
executor for a task (see Framework::launchExecutor() in
src/slave/slave.cpp).

However, the 'latest' symlink will point to the most recent container_id
directory so you can likely just use that unless your framework is re-using
executor_ids (which would mean a new container for each run).

On Wed, Jan 21, 2015 at 11:52 AM, David Greenberg dsg123456...@gmail.com
wrote:

 Is it possible to know the container_id prior when you submit the
 TaskInfo? If not, how can you find it out?

 On Wed, Jan 21, 2015 at 1:17 PM, Ian Downes idow...@twitter.com wrote:

 The final component is the container_id. Take a look in
 src/slave/paths.hpp to see the directory layout.

 On Wed, Jan 21, 2015 at 8:50 AM, David Greenberg dsg123456...@gmail.com
 wrote:

 So, I've looked into this more, and the UUID in runs doesn't appear
 appear to be the task-id, executor-id, or framework-id. do you have any
 idea what it could be?

 On Tue, Jan 13, 2015 at 5:21 PM, David Greenberg dsg123456...@gmail.com
  wrote:

 Thank you for your answers!

 On Tue, Jan 13, 2015 at 5:15 PM, Tim Chen t...@mesosphere.io wrote:

 You can get the slave_id, framework_id and executor_id of a task all
 from state.json.

 ie:


- {
   - executor_id: 20141231-115728-16777343-5050-49193-S0,
   - framework_id: 20141231-115728-16777343-5050-49193-,
   - id: 1,
   - labels: [ ],
   - name: Task 1,
   - resources:
   {
  - cpus: 6,
  - disk: 0,
  - mem: 13312
  },
   - slave_id: 20141231-115728-16777343-5050-49193-S0,
   - state: TASK_KILLED,
   - statuses:
   [
  -
  {
 - state: TASK_RUNNING,
 - timestamp: 1420056049.88177
 },
  -
  {
 - state: TASK_KILLED,
 - timestamp: 1420056124.66483
 }
  ]
   },


 On Tue, Jan 13, 2015 at 1:48 PM, David Greenberg 
 dsg123456...@gmail.com wrote:

 I was trying to figure out how to programmatically access a task's
 stdout  stderr, and I don't fully understand how the URL is constructed.
 It seems to be of the form http://
 $slave_url:5050/read.json?$work_dir/work/slaves/$slave_id/frameworks/$framework_id/executors/$executor_id/runs/$something

 What is the $something? Is there an easier way, given just the
 task_id, to find where the output is?

 Thanks,
 David









cluster wide init

2015-01-21 Thread CCAAT

Hello all,

I was reading about Marathon: Marathon scheduler processes were started 
outside of Mesos using init, upstart, or a similar tool [1]

So my related questions are

Does Marathon work with mesos + Openrc as the init system?

Are there any other frameworks that work with Mesos + Openrc?


James



[1] http://mesosphere.github.io/marathon/


Re: Accessing stdout/stderr of a task programmattically?

2015-01-21 Thread David Greenberg
It seems that if I take the URL that the Download button for stderr
points to and curl it, I get the file. But, if I change the container_id to
latest instead of the UUID, then I get a 404. Is there another way to
resolve what the container_id is, since it seems critical to get files
programmatically.

On Wed, Jan 21, 2015 at 3:17 PM, Ian Downes idow...@twitter.com wrote:

 No, the container id is generated by the slave when it launches the
 executor for a task (see Framework::launchExecutor() in
 src/slave/slave.cpp).

 However, the 'latest' symlink will point to the most recent container_id
 directory so you can likely just use that unless your framework is re-using
 executor_ids (which would mean a new container for each run).

 On Wed, Jan 21, 2015 at 11:52 AM, David Greenberg dsg123456...@gmail.com
 wrote:

 Is it possible to know the container_id prior when you submit the
 TaskInfo? If not, how can you find it out?

 On Wed, Jan 21, 2015 at 1:17 PM, Ian Downes idow...@twitter.com wrote:

 The final component is the container_id. Take a look in
 src/slave/paths.hpp to see the directory layout.

 On Wed, Jan 21, 2015 at 8:50 AM, David Greenberg dsg123456...@gmail.com
  wrote:

 So, I've looked into this more, and the UUID in runs doesn't appear
 appear to be the task-id, executor-id, or framework-id. do you have any
 idea what it could be?

 On Tue, Jan 13, 2015 at 5:21 PM, David Greenberg 
 dsg123456...@gmail.com wrote:

 Thank you for your answers!

 On Tue, Jan 13, 2015 at 5:15 PM, Tim Chen t...@mesosphere.io wrote:

 You can get the slave_id, framework_id and executor_id of a task all
 from state.json.

 ie:


- {
   - executor_id: 20141231-115728-16777343-5050-49193-S0,
   - framework_id: 20141231-115728-16777343-5050-49193-,
   - id: 1,
   - labels: [ ],
   - name: Task 1,
   - resources:
   {
  - cpus: 6,
  - disk: 0,
  - mem: 13312
  },
   - slave_id: 20141231-115728-16777343-5050-49193-S0,
   - state: TASK_KILLED,
   - statuses:
   [
  -
  {
 - state: TASK_RUNNING,
 - timestamp: 1420056049.88177
 },
  -
  {
 - state: TASK_KILLED,
 - timestamp: 1420056124.66483
 }
  ]
   },


 On Tue, Jan 13, 2015 at 1:48 PM, David Greenberg 
 dsg123456...@gmail.com wrote:

 I was trying to figure out how to programmatically access a task's
 stdout  stderr, and I don't fully understand how the URL is 
 constructed.
 It seems to be of the form http://
 $slave_url:5050/read.json?$work_dir/work/slaves/$slave_id/frameworks/$framework_id/executors/$executor_id/runs/$something

 What is the $something? Is there an easier way, given just the
 task_id, to find where the output is?

 Thanks,
 David










Unable to follow Sandbox links from Mesos UI.

2015-01-21 Thread Dan Dong
Hi, All,
 When I try to access sandbox  on mesos UI, I see the following info( The
 same error appears on every slave sandbox.):

 Failed to connect to slave '20150115-144719-3205108908-5050-4552-S0'
 on 'centos-2.local:5051'.

 Potential reasons:
 The slave's hostname, 'centos-2.local', is not accessible from your
network  The slave's port, '5051', is not accessible from your network


 I checked that:
 slave centos-2.local can be login from any machine in the cluster without
 password by ssh centos-2.local ;

 port 5051 on slave centos-2.local could be connected from master by
 telnet centos-2.local 5051
The stdout and stderr are there on each slave's /tmp/mesos/..., but
seems mesos UI just could not access it.
(and Both master and slaves are on the same network IP ranges).
Should I open any port on slaves? Any hint what's the problem here?

 Cheers,
 Dan


Re: Unable to follow Sandbox links from Mesos UI.

2015-01-21 Thread Cody Maloney
Also see https://issues.apache.org/jira/browse/MESOS-2129 if you want to
track progress on changing this.

Unfortunately it is on hold for me at the moment to fix.

Cody

On Wed, Jan 21, 2015 at 2:07 PM, Ryan Thomas r.n.tho...@gmail.com wrote:

 Hey Dan,

 The UI will attempt to pull that info directly from the slave so you need
 to make sure the host is resolvable  and routeable from your browser.

 Cheers,

 Ryan

 From my phone


 On Wednesday, 21 January 2015, Dan Dong dongda...@gmail.com wrote:

 Hi, All,
  When I try to access sandbox  on mesos UI, I see the following info( The
  same error appears on every slave sandbox.):

  Failed to connect to slave '20150115-144719-3205108908-5050-4552-S0'
  on 'centos-2.local:5051'.

  Potential reasons:
  The slave's hostname, 'centos-2.local', is not accessible from your
 network  The slave's port, '5051', is not accessible from your network


  I checked that:
  slave centos-2.local can be login from any machine in the cluster without
  password by ssh centos-2.local ;

  port 5051 on slave centos-2.local could be connected from master by
  telnet centos-2.local 5051
 The stdout and stderr are there on each slave's /tmp/mesos/..., but seems 
 mesos UI just could not access it.
 (and Both master and slaves are on the same network IP ranges).  Should I 
 open any port on slaves? Any hint what's the problem here?

  Cheers,
  Dan




Re: Mesos 0.22.0

2015-01-21 Thread Adam Bordelon
Cosmin: 0.21.1-rc2 is actually the same as 0.21.1. Both are tagged to
commit 2ae1ba91e64f92ec71d327e10e6ba9e8ad5477e8

On Wed, Jan 21, 2015 at 3:52 PM, Cosmin Lehene cleh...@adobe.com wrote:

  Also, the release page on github shows 0.21.1-rc2 as being after the
 0.21.1 release... https://github.com/apache/mesos/releases


  Cosmin


  --
 *From:* Tim Chen t...@mesosphere.io
 *Sent:* Tuesday, January 20, 2015 1:36 PM
 *To:* Dave Lester
 *Cc:* user@mesos.apache.org
 *Subject:* Re: Mesos 0.22.0

  Hi Dave,

  Sorry about the blog post, I lost track of it in the middle of other
 tasks.

  I'm going to update the website and the blog post very soon.

  Tim

 On Tue, Jan 20, 2015 at 12:37 PM, Dave Lester d...@davelester.org wrote:

  Thanks Niklas for kicking off this thread. +1 to you as release
 manager, could you please create a JIRA ticket to track the progress so we
 could subscribe?

 A minor correction to your email, Mesos 0.21.1 was voted on in late
 December (see http://markmail.org/message/e2iam7guxukl3r6c), however the
 website wasn't updated nor was blogged about like we normally do. Tim
 (cc'd), do you still plan to make this update? Any way others can help? I'd
 like to see this updated before we cut another release.

 +1 to Chris' suggestion of a page to plan future release managers, this
 would bring some longer-term clarity to who is driving feature releases and
 what they include.

 Dave

 On Tue, Jan 20, 2015, at 12:03 PM, Chris Aniszczyk wrote:

  definite +1, lets keep the release rhythm going!

 maybe some space on the wiki for release planning / release managers
 would be a step forward

  On Tue, Jan 20, 2015 at 1:59 PM, Joe Stein joe.st...@stealth.ly wrote:

  +1

  so excited for the persistence primitives, awesome!

   /***
   Joe Stein
   Founder, Principal Consultant
   Big Data Open Source Security LLC
  http://www.stealth.ly
   Twitter: @allthingshadoop http://www.twitter.com/allthingshadoop
  /

  On Tue, Jan 20, 2015 at 2:55 PM, John Pampuch j...@mesosphere.io
 wrote:

 +1!

 -John


 On Tue, Jan 20, 2015 at 11:52 AM, Niklas Nielsen nik...@mesosphere.io
  wrote:

  Hi all,
  
   We have been releasing major versions of Mesos roughly every second
 month
   (current average is ~66 days) and we are now 2 months after the 0.21.0
   release, so I would like to propose that we start planning for 0.22.0
   Not only in terms of timing, but also because we have some exciting
   features which are getting ready, including persistence primitives,
 modules
   and SSL support (I probably forgot a ton - please chime in).
  
   Since we are stakeholders in SSL and Modules, I would like to
 volunteer as
   release manager.
   Like in previous releases, I'd be happy to collaborate with co-release
   managers to make 0.22.0 a successful release.
  
   Niklas
  






 --
  Cheers,

 Chris Aniszczyk | Open Source | Twitter, Inc.
  @cra | +1 512 961 6719







Re: Marathon stability and use-case

2015-01-21 Thread Niklas Nielsen
Looping in Connor and Dario.

On 21 January 2015 at 17:21, Benjamin Mahler benjamin.mah...@gmail.com
wrote:

 Hm.. I'm not sure if any of the Marathon developers are on this list.

 They have a mailing list here:
 https://groups.google.com/forum/?hl=en#!forum/marathon-framework

 On Mon, Jan 19, 2015 at 4:07 AM, Antonin Kral a.k...@bobek.cz wrote:

 Hi all,

 first of all, than you for all the hard work on Mesos and related stuff.
 We are running fairly small mesos/marathon cluster (3 masters + 9
 slaves + 3 ZK nodes). All servers are hosted at http://www.hetzner.de/ .
 This means that we are sometime facing a network issues, frequently
 caused by some DDoS attack running against other servers in datacenters.

 We are then facing huge problems with our Marathon installation. Typical
 behavior would be that Marathon will abandon the tasks. So it will
 report the lower number of tasks is running (frequently 0) then
 requested with scaling. So it will try to scale up, which will fail as
 workers are occupied with previous jobs, which are correctly reported in
 Mesos.

 We have not been able to pinpoint anything helpful in the log files of
 Marathon. We have tried running in 1 master as well as 3 masters modes.
 3 node mode seemed actually a bit worse.

 The only working solution so far is to stop everything. Wipe ZK and kill
 all jobs on Mesos and then start all components again.

 So I would like to ask couple questions:

   - what is the actual use-case for Marathon?

 Is it expected to have larger number of apps/jobs (right now we have
 something like 50 apps) or rather to have like 5 of them, which are
 Mesos frameworks?

   - Is there a way how to tell Marathon to take ownership of currently
 running jobs?

 Honestly, not really sure how this could work as I possibly don't
 have any state information about them.

   - What should be the command line to get some helpful information for
 you guyz to debug the problem next time?

 As you can see, the problem is that problems are quite random. We
 didn't have any problem during December, but already had like 3
 total breakdowns last week.

 Thanks a lot,

 Antonin





Re: cluster wide init

2015-01-21 Thread Shuai Lin
You can always write the init wrapper scripts for marathon. There is an
official debian package, which you can find in mesos's apt repo.

On Thu, Jan 22, 2015 at 4:20 AM, CCAAT cc...@tampabay.rr.com wrote:

 Hello all,

 I was reading about Marathon: Marathon scheduler processes were started
 outside of Mesos using init, upstart, or a similar tool [1]

This means



 So my related questions are

 Does Marathon work with mesos + Openrc as the init system?

 Are there any other frameworks that work with Mesos + Openrc?


 James



 [1] http://mesosphere.github.io/marathon/



Re: Marathon stability and use-case

2015-01-21 Thread Dario Rexin
Thanks Niklas.

Hi Antonin,

Marathon should be able to handle tjousands of tasks and that is exactly what 
it's made for. Unfortunately the latest release (0.7.6) has been very unstable. 
We fixed a lot of bugs that caused this unstability and just tagged an RC for 
0.8.0 yesterday: https://github.com/mesosphere/marathon/releases/tag/v0.8.0-RC1.

 It would be great if you could try this RC and report if you still see these 
issues. I will add the Linux packages and some information about the changes 
later today. 

Cheers,
Dario

 On 22.01.2015, at 04:35, Niklas Nielsen nik...@mesosphere.io wrote:
 
 Looping in Connor and Dario.
 
 On 21 January 2015 at 17:21, Benjamin Mahler benjamin.mah...@gmail.com 
 wrote:
 Hm.. I'm not sure if any of the Marathon developers are on this list.
 
 They have a mailing list here: 
 https://groups.google.com/forum/?hl=en#!forum/marathon-framework
 
 On Mon, Jan 19, 2015 at 4:07 AM, Antonin Kral a.k...@bobek.cz wrote:
 Hi all,
 
 first of all, than you for all the hard work on Mesos and related stuff.
 We are running fairly small mesos/marathon cluster (3 masters + 9
 slaves + 3 ZK nodes). All servers are hosted at http://www.hetzner.de/ .
 This means that we are sometime facing a network issues, frequently
 caused by some DDoS attack running against other servers in datacenters.
 
 We are then facing huge problems with our Marathon installation. Typical
 behavior would be that Marathon will abandon the tasks. So it will
 report the lower number of tasks is running (frequently 0) then
 requested with scaling. So it will try to scale up, which will fail as
 workers are occupied with previous jobs, which are correctly reported in
 Mesos.
 
 We have not been able to pinpoint anything helpful in the log files of
 Marathon. We have tried running in 1 master as well as 3 masters modes.
 3 node mode seemed actually a bit worse.
 
 The only working solution so far is to stop everything. Wipe ZK and kill
 all jobs on Mesos and then start all components again.
 
 So I would like to ask couple questions:
 
   - what is the actual use-case for Marathon?
 
 Is it expected to have larger number of apps/jobs (right now we have
 something like 50 apps) or rather to have like 5 of them, which are
 Mesos frameworks?
 
   - Is there a way how to tell Marathon to take ownership of currently
 running jobs?
 
 Honestly, not really sure how this could work as I possibly don't
 have any state information about them.
 
   - What should be the command line to get some helpful information for
 you guyz to debug the problem next time?
 
 As you can see, the problem is that problems are quite random. We
 didn't have any problem during December, but already had like 3
 total breakdowns last week.
 
 Thanks a lot,
 
 Antonin
 
 


Trying to debug an issue in mesos task tracking

2015-01-21 Thread Itamar Ostricher
I'm using a custom internal framework, loosely based on MesosSubmit.
The phenomenon I'm seeing is something like this:
1. Task X is assigned to slave S.
2. I know this task should run for ~10minutes.
3. On the master dashboard, I see that task X is in the Running state for
several *hours*.
4. I SSH into slave S, and see that task X is *not* running. According to
the local logs on that slave, task X finished a long time ago, and seemed
to finish OK.
5. According to the scheduler logs, it never got any update from task X
after the Staging-Running update.

The phenomenon occurs pretty often, but it's not consistent or
deterministic.

I'd appreciate your input on how to go about debugging it, and/or implement
a workaround to avoid wasted resources.

I'm pretty sure the executor on the slave sends the TASK_FINISHED status
update (how can I verify that beyond my own logging?).
I'm pretty sure the scheduler never receives that update (again, how can I
verify that beyond my own logging?).
I have no idea if the master got the update and passed it through (how can
I check that?).
My scheduler and executor are written in Python.

As for a workaround - setting a timeout on a task should do the trick. I
did not see any timeout field in the TaskInfo message. Does mesos support
the concept of per-task timeouts? Or should I implement my own task
tracking and timeouting mechanism in the scheduler?


Re: Architecture question

2015-01-21 Thread Adam Bordelon
You should also look into Chronos for workflow dependency management of
batch jobs (also supports cron-like scheduling).

On Fri, Jan 9, 2015 at 2:12 PM, Srinimurthy srinimur...@gmail.com wrote:

 Tim,
  This is a SAAS environment where the jobs running on each of these nodes
 are varying depending on the workflow run by each company, resources (JVMs)
 are allocated per size and need of the job involved

 Srinivas

 On Jan 9, 2015, at 1:59 PM, Tim Chen t...@mesosphere.io wrote:

 Hi Srinivas,

 Can you elaborate more about what does maintaining a dynamic count of
 executors?

 You can always write a custom framework that provides the scheduling,
 similiar to what Marathon or Aurora is doing if it doesn't fit your need.

 Tim

 On Fri, Jan 9, 2015 at 1:18 PM, Srinivas Murthy srinimur...@gmail.com
 wrote:

 Thanks Vinod. I need to deal with a very conservative management that
 needs a lot of selling for each additional open source framework. I have
 glossed over Marathon so far. I was hoping to hear there's some way I could
 override the Scheduler and work with what I have, but I hear you say that
 isn't the route I should be pursuing :-)


 On Fri, Jan 9, 2015 at 11:43 AM, Vinod Kone vinodk...@apache.org wrote:

 Have you looked at Aurora or Marathon? They have some (most?) of the
 features you are looking for.

 On Fri, Jan 9, 2015 at 10:59 AM, Srinivas Murthy srinimur...@gmail.com
 wrote:

 We have a legacy system with home brewn workflows defined in XPDL,
 running across multiple dozens of nodes. Resources are mapped in XML
 definition files, and availability of resource to a given task at hand
 managed by a custom written job scheduler. Jobs communicate status with
 callback/JMS messages. Job completion decides steps in the workflow.

 To this eco system now comes some Hadoop/Spark jobs.
 I am tentatively exploring Mesos to manage this disparate set of
 clusters.
 How can I maintain a dynamic count of Executors, how can I provide
 dynamic workflow orchestration to pull off above architecture in the Mesos
 world? Sorry for the noob question!







Re: Architecture question

2015-01-21 Thread Tim St Clair
@some point I'd hope the litany of existing DAG generators that exist for 
legacy batch systems would make it's way to support this ecosystem. 

/me coughs Makeflow, pegasus ... 

| for that matter, one might redux a high throughput systems in a (Docker) 
world where NP-hard matching no longer makes any sense, b/c it's all cattle. 

Cheers, 
Tim 

- Original Message -

 From: Adam Bordelon a...@mesosphere.io
 To: user@mesos.apache.org
 Sent: Wednesday, January 21, 2015 3:41:40 AM
 Subject: Re: Architecture question

 You should also look into Chronos for workflow dependency management of batch
 jobs (also supports cron-like scheduling).

 On Fri, Jan 9, 2015 at 2:12 PM, Srinimurthy  srinimur...@gmail.com  wrote:

  Tim,
 
  This is a SAAS environment where the jobs running on each of these nodes
  are
  varying depending on the workflow run by each company, resources (JVMs) are
  allocated per size and need of the job involved
 

  Srinivas
 

  On Jan 9, 2015, at 1:59 PM, Tim Chen  t...@mesosphere.io  wrote:
 

   Hi Srinivas,
  
 

   Can you elaborate more about what does maintaining a dynamic count of
   executors?
  
 

   You can always write a custom framework that provides the scheduling,
   similiar to what Marathon or Aurora is doing if it doesn't fit your need.
  
 

   Tim
  
 

   On Fri, Jan 9, 2015 at 1:18 PM, Srinivas Murthy  srinimur...@gmail.com 
   wrote:
  
 

Thanks Vinod. I need to deal with a very conservative management that
needs
a
lot of selling for each additional open source framework. I have
glossed
over Marathon so far. I was hoping to hear there's some way I could
override
the Scheduler and work with what I have, but I hear you say that isn't
the
route I should be pursuing :-)
   
  
 

On Fri, Jan 9, 2015 at 11:43 AM, Vinod Kone  vinodk...@apache.org 
wrote:
   
  
 

 Have you looked at Aurora or Marathon? They have some (most?) of the
 features
 you are looking for.

   
  
 

 On Fri, Jan 9, 2015 at 10:59 AM, Srinivas Murthy 
 srinimur...@gmail.com
 
 wrote:

   
  
 

  We have a legacy system with home brewn workflows defined in XPDL,
  running
  across multiple dozens of nodes. Resources are mapped in XML
  definition
  files, and availability of resource to a given task at hand managed
  by
  a
  custom written job scheduler. Jobs communicate status with
  callback/JMS
  messages. Job completion decides steps in the workflow.
 

   
  
 

  To this eco system now comes some Hadoop/Spark jobs.
 

   
  
 
  I am tentatively exploring Mesos to manage this disparate set of
  clusters.
 

   
  
 
  How can I maintain a dynamic count of Executors, how can I provide
  dynamic
  workflow orchestration to pull off above architecture in the Mesos
  world?
  Sorry for the noob question!
 

   
  
 

-- 
Cheers, 
Timothy St. Clair 
Red Hat Inc.