Re: Users, or Job Types Base Resource Allocation

2017-06-13 Thread Sharma Podila
Mesos does have roles based allocation that may cover this need.
Alternatively, schedulers can do this from within the resources allocated
to them. As an example, OSS Fenzo library helps with the latter. Do you
mean one of these? Something else?


On Tue, Jun 13, 2017 at 1:48 AM, Bryan Fok  wrote:

> Hi all
>
>   Currently, Mesos instance cannot prioritize the resource allocation base
> on users, or job types, etc. Does the community interest in providing such
> a functionality?
>
> B.R
> Bryan
>


Re: Can I consider other framework tasks as a resource? Does it make sense?

2016-12-15 Thread Sharma Podila
Response below:

On Thu, Dec 15, 2016 at 5:22 AM, Petr Novak <oss.mli...@gmail.com> wrote:

> It is very helpful. I will take a deeper look on Fenzo.
> Isn’t pretty much everything external knowledge to a scheduler? CPU, mem,
> net, storage… these all information has to somehow get into scheduler. But
> for these there is an internal support by Mesos via resource offers and it
> is what I think you mean by internal vs. external.
>

​Yes, that's right.
​

>
>
> What I’m thinking is that there is already a mechanism in Mesos how to get
> information into scheduler but it is not extendable by custom resource
> types. Thinking about offered resources I have also realized that there is
> a common trait to them – they are consumable. When one task accepts some
> resources they are not available to other tasks. Hence probably if I would
> like to represent other constraints as resources they would have to have
> this property. Then, in theory, they could by plugged into Mesos resources
> mechanism. Possibly not all constraints can be modelled as consumables and
> the approach through pluggable scheduling library like Fenzo might be more
> flexible.
>

​Constraints can be on non-consumables. For example, we have constraints on
custom attributes, not just resources. ​The trick is to get the information
on other tasks on the agent back into the scheduler. Today we do this only
among tasks of the same framework, so, Fenzo know about all of them. If
there is a way to get the dynamic task scheduling info for the other
frameworks, you could, for example, add those into Fenzo and let it
maintain state and do the constraints.



>
>
> My original question was basically about what counts as scheduling so that
> when I need to model some constraint how to place a task I would know where
> it belongs in my framework’s code. It seems to be answered. Thanks a lot.
>
>
>
> *From:* Sharma Podila [mailto:spod...@netflix.com]
> *Sent:* 15. prosince 2016 1:59
> *To:* user@mesos.apache.org
>
> *Subject:* Re: Can I consider other framework tasks as a resource? Does
> it make sense?
>
>
>
> In general, placing a task based on certain constraints (e.g., locality
> with other tasks) is a scheduling concern. The complexity in your scenario
> is that the constraint specification requires knowledge external to your
> scheduler. If you are able to route that external information (on what and
> where other frameworks' tasks are running) into your scheduler, then, you
> should be able to achieve the locality constraints in your scheduler.
>
>
>
> If your scheduler happens to be running on the JVM, our open source Fenzo
> scheduling library can be useful. Or at least provide one idea on how your
> could write a scheduler that deals with such constraints. In Fenzo, for
> example, you'd write a custom plugin to handle the locality by using the
> external information, that I refer to above, to "score" agents that fit
> your task better. Fenzo will then pick the best agent to launch your task
> for locality.
>
>
>
> One limitation is the fact that you'd have little to no control on
> ensuring that the agents on which those other frameworks' tasks are running
> on will have additional resources available to fit your tasks. And that
> offers from those agents will arrive at your scheduler. Some variation of
> "delay scheduling" can help the latter by rejecting offers from agents that
> do not contain the tasks of interest from other frameworks.
>
>
>
>
>
> On Wed, Dec 14, 2016 at 10:33 AM, Petr Novak <oss.mli...@gmail.com> wrote:
>
> Thanks a lot for the input.
>
>
>
> “Y scheduler can accept a rule how to check readiness on startup”
>
>
>
> Based on it seems like +1 that I can consider it as a responsibility of a
> scheduler.
>
>
>
> Cheers,
>
> Petr
>
>
>
>
>
> *From:* Alex Rukletsov [mailto:a...@mesosphere.com]
> *Sent:* 14. prosince 2016 13:01
> *To:* user
> *Subject:* Re: Can I consider other framework tasks as a resource? Does
> it make sense?
>
>
>
> Task dependency is probably too vague to discuss specifically. Mesos
> currently does not explicitly support arbitrary task dependencies. You
> mentioned colocation, one type of dependency, so let's look at it.
>
>
>
> If I understood you correctly, you would like to colocate a task from
> framework B to the same node where a task from framework A is running. The
> first problem is to get a list of such nodes (and keep them updated,
> because task may crash, migrate and so on). This can be done, say, by using
> Mesos DNS or alike. The second problem is to ensure that framework gets
> enough resources from that nodes. A possible solution h

Re: Can I consider other framework tasks as a resource? Does it make sense?

2016-12-14 Thread Sharma Podila
In general, placing a task based on certain constraints (e.g., locality
with other tasks) is a scheduling concern. The complexity in your scenario
is that the constraint specification requires knowledge external to your
scheduler. If you are able to route that external information (on what and
where other frameworks' tasks are running) into your scheduler, then, you
should be able to achieve the locality constraints in your scheduler.

If your scheduler happens to be running on the JVM, our open source Fenzo
scheduling library can be useful. Or at least provide one idea on how your
could write a scheduler that deals with such constraints. In Fenzo, for
example, you'd write a custom plugin to handle the locality by using the
external information, that I refer to above, to "score" agents that fit
your task better. Fenzo will then pick the best agent to launch your task
for locality.

One limitation is the fact that you'd have little to no control on ensuring
that the agents on which those other frameworks' tasks are running on will
have additional resources available to fit your tasks. And that offers from
those agents will arrive at your scheduler. Some variation of "delay
scheduling" can help the latter by rejecting offers from agents that do not
contain the tasks of interest from other frameworks.


On Wed, Dec 14, 2016 at 10:33 AM, Petr Novak  wrote:

> Thanks a lot for the input.
>
>
>
> “Y scheduler can accept a rule how to check readiness on startup”
>
>
>
> Based on it seems like +1 that I can consider it as a responsibility of a
> scheduler.
>
>
>
> Cheers,
>
> Petr
>
>
>
>
>
> *From:* Alex Rukletsov [mailto:a...@mesosphere.com]
> *Sent:* 14. prosince 2016 13:01
> *To:* user
> *Subject:* Re: Can I consider other framework tasks as a resource? Does
> it make sense?
>
>
>
> Task dependency is probably too vague to discuss specifically. Mesos
> currently does not explicitly support arbitrary task dependencies. You
> mentioned colocation, one type of dependency, so let's look at it.
>
>
>
> If I understood you correctly, you would like to colocate a task from
> framework B to the same node where a task from framework A is running. The
> first problem is to get a list of such nodes (and keep them updated,
> because task may crash, migrate and so on). This can be done, say, by using
> Mesos DNS or alike. The second problem is to ensure that framework gets
> enough resources from that nodes. A possible solution here is to put both
> frameworks A and B into the same role and use dynamic reservations to
> ensure enough resources are laid away for both tasks. Disadvantages: you
> should know about all dependencies upfront, frameworks should be in the
> same role.
>
>
>
> Now the question is, why would you need to colocate workloads? I would say
> this is something you should avoid if possible, like any extra constraint
> that complicate the system. Probably the only 100% legitimate use case for
> colocation is data locality. Solving this particular problem seems easier
> than to address arbitrary task dependencies.
>
>
>
> If all you try to achieve is making sure a specific service represented by
> a framework X is running and ready in the cluster, you can do that by
> running specific checks before starting a depending framework Y or
> launching a new task in this framework. If your question is about whether Y
> should know about X and know how to check readiness of X in the cluster,
> I'd say you'd better keep that abstracted: Y scheduler can accept a rule
> how to check readiness on startup.
>
>
>
> On Wed, Dec 14, 2016 at 5:14 AM, haosdent  wrote:
>
> Hi, @Petr.
>
>
>
> > Like if I want to run my task collocated with some other tasks on the
> same node I have to make this decision somewhere.
>
> Do you mean "POD" here?
>
>
>
> For my cases, if there are some dependencies between my tasks, I use
> database, message queue or zookeeper to implement my requirement.
>
>
>
> On Wed, Dec 14, 2016 at 3:09 AM, Petr Novak  wrote:
>
> Hello,
>
> I want to execute tasks which requires some other tasks from other
> framework(s) already running. I’m thinking where such logic/strategy/policy
> belongs in principle. I understand scheduling as a process to decide where
> to execute task according to some resources availability, typically CPU,
> mem, net, hdd etc.
>
>
>
> If my task require other tasks running could I generalize and consider
> that those tasks from other frameworks are kind of required resources and
> put this logic/strategy decisions into scheduler? Like if I want to run my
> task collocated with some other tasks on the same node I have to make this
> decision somewhere.
>
>
>
> Does it make any sense? I’m asking because I have never thought about
> other frameworks/tasks as “resources” so that I could put them into
> scheduler to satisfy my understanding of a scheduler. Or it rather belongs
> higher like to a framework, or lower to an 

Re: mesos agent not recovering after ZK init failure

2016-07-15 Thread Sharma Podila
Vinod,

MESOS-5854 <https://issues.apache.org/jira/browse/MESOS-5854> created. Feel
free to change the priority appropriately.

Yes, the workaround I mentioned for disk size is based on resource
specification, so that works for now.


On Fri, Jul 15, 2016 at 11:48 AM, Andrew Leung <ale...@netflix.com> wrote:

> Hi Jie,
>
> Yes, that is how we are working around this issue. However, we wanted to
> see if others were hitting this issue as well. If others had a similar
> Mesos Slave on ZFS setup, it might be worth considering a disk space
> calculation approach that works more reliably with ZFS or at least calling
> out the need to specify the disk resource explicitly.
>
> Thanks for the help.
> Andrew
>
> On Jul 15, 2016, at 11:41 AM, Jie Yu <yujie@gmail.com> wrote:
>
> Can you hard code your disk size using --resources flag?
>
>
> On Fri, Jul 15, 2016 at 11:31 AM, Sharma Podila <spod...@netflix.com>
> wrote:
>
>> We had this issue happen again and were able to debug further. The cause
>> for agent not being able to restart is that one of the resources (disk)
>> changed its total size since the last restart. However, this error does not
>> show up in INFO/WARN/ERROR files. We saw it in stdout only when manually
>> restarting the agent. It would be good to have all messages going to
>> stdout/stderr show up in the logs. Is there a config setting for it that I
>> missed?
>>
>> The disk size total is changing sometimes on our agents. It is off by a
>> few bytes (seeing ~10 bytes difference out of, say, 600 GB). We use ZFS on
>> our agents to manage the disk partition. From my colleague, Andrew (copied
>> here):
>>
>> The current Mesos approach (i.e., `statvfs()` for total blocks and assume
>>> that never changes) won’t work reliably on ZFS
>>>
>>
>> Anyone else experience this? We can likely hack a workaround for this by
>> reporting the "whole GBs" of the disk so we are insensitive to small
>> changes in the total size. But, not sure if the changes can be larger due
>> to Andrew's point above.
>>
>>
>> On Mon, Mar 7, 2016 at 6:00 PM, Sharma Podila <spod...@netflix.com>
>> wrote:
>>
>>> Sure, will do.
>>>
>>>
>>> On Mon, Mar 7, 2016 at 5:54 PM, Benjamin Mahler <bmah...@apache.org>
>>> wrote:
>>>
>>>> Very surprising.. I don't have any ideas other than trying to replicate
>>>> the scenario in a test.
>>>>
>>>> Please do keep us posted if you encounter it again and gain more
>>>> information.
>>>>
>>>> On Fri, Feb 26, 2016 at 4:34 PM, Sharma Podila <spod...@netflix.com>
>>>> wrote:
>>>>
>>>>> MESOS-4795 created.
>>>>>
>>>>> I don't have the exit status. We haven't seen a repeat yet, will catch
>>>>> the exit status next time it happens.
>>>>>
>>>>> Yes, removing the metadata directory was the only way it was resolved.
>>>>> This happened on multiple hosts requiring the same resolution.
>>>>>
>>>>>
>>>>> On Thu, Feb 25, 2016 at 6:37 PM, Benjamin Mahler <bmah...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Feel free to create one. I don't have enough information to know what
>>>>>> the issue is without doing some further investigation, but if the 
>>>>>> situation
>>>>>> you described is accurate it seems like a there are two strange bugs:
>>>>>>
>>>>>> -the silent exit (do you not have the exit status?), and
>>>>>> -the flapping from ZK errors that needed the meta data directory to
>>>>>> be removed to resolve (are you convinced the removal of the meta 
>>>>>> directory
>>>>>> is what solved it?)
>>>>>>
>>>>>> It would be good to track these issues in case they crop up again.
>>>>>>
>>>>>> On Tue, Feb 23, 2016 at 2:51 PM, Sharma Podila <spod...@netflix.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Ben,
>>>>>>>
>>>>>>> Let me know if there is a new issue created for this, I would like
>>>>>>> to add myself to watch it.
>>>>>>> Thanks.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Feb 10, 2016 at 9:54 AM, Sharma Podila <spod...@netflix

Re: mesos agent not recovering after ZK init failure

2016-07-15 Thread Sharma Podila
We had this issue happen again and were able to debug further. The cause
for agent not being able to restart is that one of the resources (disk)
changed its total size since the last restart. However, this error does not
show up in INFO/WARN/ERROR files. We saw it in stdout only when manually
restarting the agent. It would be good to have all messages going to
stdout/stderr show up in the logs. Is there a config setting for it that I
missed?

The disk size total is changing sometimes on our agents. It is off by a few
bytes (seeing ~10 bytes difference out of, say, 600 GB). We use ZFS on our
agents to manage the disk partition. From my colleague, Andrew (copied
here):

The current Mesos approach (i.e., `statvfs()` for total blocks and assume
> that never changes) won’t work reliably on ZFS
>

Anyone else experience this? We can likely hack a workaround for this by
reporting the "whole GBs" of the disk so we are insensitive to small
changes in the total size. But, not sure if the changes can be larger due
to Andrew's point above.


On Mon, Mar 7, 2016 at 6:00 PM, Sharma Podila <spod...@netflix.com> wrote:

> Sure, will do.
>
>
> On Mon, Mar 7, 2016 at 5:54 PM, Benjamin Mahler <bmah...@apache.org>
> wrote:
>
>> Very surprising.. I don't have any ideas other than trying to replicate
>> the scenario in a test.
>>
>> Please do keep us posted if you encounter it again and gain more
>> information.
>>
>> On Fri, Feb 26, 2016 at 4:34 PM, Sharma Podila <spod...@netflix.com>
>> wrote:
>>
>>> MESOS-4795 created.
>>>
>>> I don't have the exit status. We haven't seen a repeat yet, will catch
>>> the exit status next time it happens.
>>>
>>> Yes, removing the metadata directory was the only way it was resolved.
>>> This happened on multiple hosts requiring the same resolution.
>>>
>>>
>>> On Thu, Feb 25, 2016 at 6:37 PM, Benjamin Mahler <bmah...@apache.org>
>>> wrote:
>>>
>>>> Feel free to create one. I don't have enough information to know what
>>>> the issue is without doing some further investigation, but if the situation
>>>> you described is accurate it seems like a there are two strange bugs:
>>>>
>>>> -the silent exit (do you not have the exit status?), and
>>>> -the flapping from ZK errors that needed the meta data directory to be
>>>> removed to resolve (are you convinced the removal of the meta directory is
>>>> what solved it?)
>>>>
>>>> It would be good to track these issues in case they crop up again.
>>>>
>>>> On Tue, Feb 23, 2016 at 2:51 PM, Sharma Podila <spod...@netflix.com>
>>>> wrote:
>>>>
>>>>> Hi Ben,
>>>>>
>>>>> Let me know if there is a new issue created for this, I would like to
>>>>> add myself to watch it.
>>>>> Thanks.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Feb 10, 2016 at 9:54 AM, Sharma Podila <spod...@netflix.com>
>>>>> wrote:
>>>>>
>>>>>> Hi Ben,
>>>>>>
>>>>>> That is accurate, with one additional line:
>>>>>>
>>>>>> -Agent running fine with 0.24.1
>>>>>> -Transient ZK issues, slave flapping with zookeeper_init failure
>>>>>> -ZK issue resolved
>>>>>> -Most agents stop flapping and function correctly
>>>>>> -Some agents continue flapping, but silent exit after printing the
>>>>>> detector.cpp:481 log line.
>>>>>> -The agents that continue to flap repaired with manual removal of
>>>>>> contents in mesos-slave's working dir
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Wed, Feb 10, 2016 at 9:43 AM, Benjamin Mahler <bmah...@apache.org>
>>>>>> wrote:
>>>>>>
>>>>>>> Hey Sharma,
>>>>>>>
>>>>>>> I didn't quite follow the timeline of events here or how the agent
>>>>>>> logs you posted fit into the timeline of events. Here's how I 
>>>>>>> interpreted:
>>>>>>>
>>>>>>> -Agent running fine with 0.24.1
>>>>>>> -Transient ZK issues, slave flapping with zookeeper_init failure
>>>>>>> -ZK issue resolved
>>>>>>> -Most agents stop flapping and function correctly
>>>>>>> -Some agents continue flapping, but silent ex

Re: Mesos on hybrid AWS - Best practices?

2016-06-30 Thread Sharma Podila
I would second the suggestion of separate Mesos clusters for DC and AWS,
with a layer on top for picking one or either based on the job SLAs and
resource requirements.
The local storage on cloud instances are more ephemeral than I'd expect the
DC instances to be. So, persistent storage of job metadata needs
consideration. Using something like DynamoDB may work, however, depending
on the scale of your operations, you may have to plan for EC2 rate limiting
its API calls and/or paying for higher IOPS for data storage/access.
Treating the cloud instances as immutable infrastructure has additional
benefits. For example, we deploy new Mesos master ASG for version upgrades,
let them join the quorum, and then "tear down" the old master ASG. Same for
agents. Although, for agent migration our framework does coordinate
migration of jobs from old agent ASG to new one with some SLAs on not too
many instances of a service being down at a time. Sort of what the
maintenance primitives from Mesos aim to address.


On Thu, Jun 30, 2016 at 9:41 AM, Ken Sipe  wrote:

> I would suggest a cluster on AWS and a cluster on-prem.Then tooling on
> top to manage between the 2.
> It is unlikely that a failure of a task on-prem should have a scheduled
> replacement on AWS or vise versa.It is likely that you will end up
> creating constraints to statically partition the clusters anyway IMO.
> 2 Clusters eliminates most of your proposed questions.
>
> ken
>
> > On Jun 30, 2016, at 10:57 AM, Florian Pfeiffer  wrote:
> >
> > Hi,
> >
> > the last 2 years I managed a mesos cluster with bare-metal on-premise.
> Now at my new company, the situation is a little bit different, and I'm
> wondering if there are some kind of best practices:
> > The company is in the middle of a transition from on-premise to AWS. The
> old stuff is still running in the DC, the newer micro services are running
> within autoscales groups on AWS and other AWS services like DynamoDB,
> Kinesis and Lambda are also on the rise.
> >
> > So in my naive view of the world (where no problems occur. never!)
> I'm thinking that it would be great to span a hybrid mesos cluster over
> AWS to leverage the still available resources in the DC which gets more
> and more underutilized over the time.
> >
> > Now my naive world view slowly crumbles, and I realize that I'm missing
> the experience with AWS. Questions that are already popping up (beside all
> those Questions, where I currently don't know that I will have them...) are:
> > * Is Virtual Private Gateway to my VPC enough, or do I need to aim for a
> Direct Connect?
> > * Put everything into one Account, or use a Multi-Account strategy?
> (Mainly to prevent things running amok and drag stuff down while running
> into an account wide shared limit?)
> > * Will e.g. DynamoDb be "fast" enough if it's accessed from the
> Datacenter.
> >
> > I'll appreciate any feedback or lessons learned about that topic :)
> >
> > Thanks,
> > Florian
> >
>
>


Re: how to stop the mesos executor process in JVM?

2016-06-06 Thread Sharma Podila
Yao, in our Java executor, we explicitly call System.exit(0) after we have
successfully sent the last finished message. However, note that there can
be a little bit of a timing issue here. Once we send the last message, we
call an asynchronous "sleep some and exit" routine. This gives the mesos
driver a chance to send the last message successfully before the executor
JVM exits. Usually, a sleep of 2-3 secs should suffice. There may be a more
elegant way to handle this timing issue, but, I haven't looked at it
recently.


On Mon, Jun 6, 2016 at 6:34 AM, Vinod Kone  wrote:

> Couple things.
>
> You need to do the business logic and status update sending in a different
> thread than synchronously in launchTask(). This is because the driver
> doesn't send messages to the agent unless the launchTask() method returns.
> See
> https://github.com/apache/mesos/blob/master/src/examples/java/TestExecutor.java
> for example.
>
> Regarding exiting the executor, driver.stop() or driver.abort() only stops
> the driver, i.e., your executor won't be able to send or receive messages
> from the agent. It is entirely up to the executor process to exit.
>
> HTH,
>
>
>
> On Mon, Jun 6, 2016 at 4:05 AM, Yao Wang  wrote:
>
>> Hi , all !
>>
>> I write my own executor to run code,
>>
>> I override the launchTask method like that :
>>
>>
>> 
>>
>> @Override public void launchTask(ExecutorDriver driver, Protos.TaskInfo
>> task) {
>> LOGGER.info("Executor is launching task#{}\n...", task);
>> //before launch
>> driver.sendStatusUpdate(
>>
>> Protos.TaskStatus.newBuilder().setTaskId(task.getTaskId()).setState(
>> Protos.TaskState.TASK_RUNNING).build());
>>
>> LOGGER.info("Add your bussiness code hear .. ");
>> //bussiness code hear
>>
>>
>> //after launch
>> driver.sendStatusUpdate(
>>
>> Protos.TaskStatus.newBuilder().setTaskId(task.getTaskId()).setState(Protos.TaskState.TASK_FINISHED).setData(
>> ByteString.copyFromUtf8(
>> "${taksData}")).build());
>>
>>
>>   } // end method launchTask
>> 
>>
>>
>> And i build the commandInfo  like that:
>>
>> 
>>
>>
>> String executorCommand = String.format("java -jar %s",
>> extractPath(executorJarPath));
>>
>> Protos.CommandInfo.URI.Builder executorJarURI =
>> Protos.CommandInfo.URI.newBuilder().setValue(executorJarPath); //
>> executorJarURI is local uri or hadoop
>>
>> Protos.CommandInfo.Builder commandInfoBuilder =
>> Protos.CommandInfo.newBuilder().setEnvironment(envBuilder).setValue(
>> executorCommand).addUris(executorJarURI); // executorJarURI is
>> local uri or hadoop
>>
>> long  ctms  = System.nanoTime();
>>
>> Protos.ExecutorID.Builder executorIDBuilder =
>> Protos.ExecutorID.newBuilder().setValue(new StringBuilder().append(
>>   ctms).append("-").append(task.getTaskRequestId()).toString());
>>   Protos.ExecutorInfo.Builder executorInfoBuilder =
>> Protos.ExecutorInfo.newBuilder().setExecutorId(
>>
>> executorIDBuilder).setCommand(commandInfoBuilder).setName("flexcloud-executor-2.0.1-"
>> + ctms).setSource("java");
>>
>> // TaskInfo
>> Protos.TaskInfo.Builder taskInfoBuilder =
>> Protos.TaskInfo.newBuilder().setName(task.getTaskName()).setTaskId(
>>
>> taskIDBuilder).setSlaveId(offer.getSlaveId()).setExecutor(executorInfoBuilder);
>>
>>
>> return taskInfoBuilder.build();
>> 
>>
>>
>> After run the executor with mesos for several times ,  i found every
>> executor  was not exit ,
>>
>> I  execute $ ps -ef | grep “java -jar”  on the slave machine , that
>> shows me :
>>
>> wangyao$ ps -ef | grep "java -jar"
>>   501 20078 19302   0  3:54下午 ?? 0:15.77 /usr/bin/java -jar
>> flexcloud-executor.jar
>>   501 20154 19302   0  3:54下午 ?? 0:17.92 /usr/bin/java -jar
>> flexcloud-executor.jar
>>   501 20230 19302   0  3:54下午 ?? 0:16.13 /usr/bin/java -jar
>> flexcloud-executor.jar
>>
>> In order to stop these process after running a executor,   first ,  i
>> tried to add code  "driver.stop()” or “driver.abort()” to the Executor’s
>> launchTask method,  but it is unused.
>> So,  I add code  “System.exit(0)” ,  stop the JVM directly……. it works  …
>>
>> I have doubt about this way to stop executor  ,  it is the only way to do
>> that?
>>
>>
>


Re: Running Mesos agent on ARM (Raspberry Pi)?

2016-05-13 Thread Sharma Podila
Mesos builds were mostly already covered by notes from the community. There 
were a few other items that included challenges of running on corporate network 
in a company that runs everything on the ec2 cloud, time to burn 32G sd cards, 
etc. Will have more details later. 
Thanks. 

Sent from my iPhone

> On May 13, 2016, at 2:10 AM, Tomek Janiszewski <jani...@gmail.com> wrote:
> 
> Cool. Did you hit any trubles with that setup?
> 
> 
> pt., 13.05.2016, 03:13 użytkownik Sharma Podila <spod...@netflix.com> napisał:
>> We have Mesos agents running on Pi3's taking tasks from master running on a 
>> Linux laptop. 
>> 
>> https://twitter.com/aspyker/status/730924571440779264
>> 
>> More info to follow.
>> 
>> Thanks for all the pointers.
>> 
>> 
>>> On Fri, Apr 29, 2016 at 1:09 PM, Sharma Podila <spod...@netflix.com> wrote:
>>> Fyi- Things are progressing, we have a build on Pi. The agent was able to 
>>> come up and register with a master running on a regular Linux server. 
>>> 
>>> https://twitter.com/aspyker/status/725923864031559681
>>> 
>>> The master has problem running with this build on the Pi, but, that isn't a 
>>> goal for us. We are running Mesos 0.24.1 for now. We'll document our build 
>>> steps, etc. here soon.
>>> 
>>> 
>>> 
>>>> On Mon, Apr 25, 2016 at 10:21 AM, Sharma Podila <spod...@netflix.com> 
>>>> wrote:
>>>> This is for an internal hackday project, not for a production setup. 
>>>> 
>>>> 
>>>>> On Mon, Apr 25, 2016 at 1:05 AM, Aaron Carey <aca...@ilm.com> wrote:
>>>>> Out of curiosity... is this for fun or production workloads? I'd be 
>>>>> curious to hear about raspis being used in production!
>>>>> 
>>>>>  --
>>>>> 
>>>>> Aaron Carey
>>>>> Production Engineer - Cloud Pipeline
>>>>> Industrial Light & Magic
>>>>> London
>>>>> 020 3751 9150
>>>>> From: Sharma Podila [spod...@netflix.com]
>>>>> Sent: 22 April 2016 17:53
>>>>> To: user@mesos.apache.org; dev
>>>>> Subject: Running Mesos agent on ARM (Raspberry Pi)?
>>>>> 
>>>>> We are working on a hack to run Mesos agents on Raspberry Pi and are 
>>>>> wondering if anyone here has done that before. From the Google search 
>>>>> results we looked at so far, it seems like it has been compiled, but we 
>>>>> haven't seen an indication that anyone has run it and launched tasks on 
>>>>> them. And does it sound right that it might take 4 hours or so to compile?
>>>>> 
>>>>> We are looking to run just the agents. The master will be on a regular 
>>>>> Ubuntu laptop or a server. 
>>>>> 
>>>>> Appreciate any pointers.


Re: Running Mesos agent on ARM (Raspberry Pi)?

2016-05-12 Thread Sharma Podila
We have Mesos agents running on Pi3's taking tasks from master running on a
Linux laptop.

https://twitter.com/aspyker/status/730924571440779264

More info to follow.

Thanks for all the pointers.


On Fri, Apr 29, 2016 at 1:09 PM, Sharma Podila <spod...@netflix.com> wrote:

> Fyi- Things are progressing, we have a build on Pi. The agent was able to
> come up and register with a master running on a regular Linux server.
>
> https://twitter.com/aspyker/status/725923864031559681
>
> The master has problem running with this build on the Pi, but, that isn't
> a goal for us. We are running Mesos 0.24.1 for now. We'll document our
> build steps, etc. here soon.
>
>
>
> On Mon, Apr 25, 2016 at 10:21 AM, Sharma Podila <spod...@netflix.com>
> wrote:
>
>> This is for an internal hackday project, not for a production setup.
>>
>>
>> On Mon, Apr 25, 2016 at 1:05 AM, Aaron Carey <aca...@ilm.com> wrote:
>>
>>> Out of curiosity... is this for fun or production workloads? I'd be
>>> curious to hear about raspis being used in production!
>>>
>>> --
>>>
>>> Aaron Carey
>>> Production Engineer - Cloud Pipeline
>>> Industrial Light & Magic
>>> London
>>> 020 3751 9150
>>>
>>> --
>>> *From:* Sharma Podila [spod...@netflix.com]
>>> *Sent:* 22 April 2016 17:53
>>> *To:* user@mesos.apache.org; dev
>>> *Subject:* Running Mesos agent on ARM (Raspberry Pi)?
>>>
>>> We are working on a hack to run Mesos agents on Raspberry Pi and are
>>> wondering if anyone here has done that before. From the Google search
>>> results we looked at so far, it seems like it has been compiled, but we
>>> haven't seen an indication that anyone has run it and launched tasks on
>>> them. And does it sound right that it might take 4 hours or so to compile?
>>>
>>> We are looking to run just the agents. The master will be on a regular
>>> Ubuntu laptop or a server.
>>>
>>> Appreciate any pointers.
>>>
>>>
>>>
>>
>


Re: How to use a complete host

2016-05-02 Thread Sharma Podila
This can't be achieved with the offer model as it stands today, unless you
have only a single framework in the cluster. There is no visibility into
what other resources are available on the agent which weren't offered to
your framework.

However, for the short term, you can use a hack to put in custom attributes
that indicate the total amount of resources available on the agent. So,
when you get the offer, you can verify if you got the offer for the entire
agent's resources.

Mesos-4138 is interesting, as Jeff pointed out.





On Mon, May 2, 2016 at 11:07 AM, haosdent  wrote:

> It sounds like you could reserve resources by a role in that machine. And
> then your framework launched by that role.
>
> On Tue, May 3, 2016 at 1:57 AM, Christoph Heer 
> wrote:
>
>> Hi everyone,
>>
>> sometimes in my Mesos use-case it's required to ensure that my own
>> framework is able to schedule a task which consume all resources of a
>> machine.
>>
>> Do you have some advises how to implement such a scheduler. Is there
>> another scheduler which already implemented something similar?
>>
>> Thank you and best regards
>> Christoph
>>
>
>
>
> --
> Best Regards,
> Haosdent Huang
>


Re: Running Mesos agent on ARM (Raspberry Pi)?

2016-04-25 Thread Sharma Podila
This is for an internal hackday project, not for a production setup.


On Mon, Apr 25, 2016 at 1:05 AM, Aaron Carey <aca...@ilm.com> wrote:

> Out of curiosity... is this for fun or production workloads? I'd be
> curious to hear about raspis being used in production!
>
> --
>
> Aaron Carey
> Production Engineer - Cloud Pipeline
> Industrial Light & Magic
> London
> 020 3751 9150
>
> ------
> *From:* Sharma Podila [spod...@netflix.com]
> *Sent:* 22 April 2016 17:53
> *To:* user@mesos.apache.org; dev
> *Subject:* Running Mesos agent on ARM (Raspberry Pi)?
>
> We are working on a hack to run Mesos agents on Raspberry Pi and are
> wondering if anyone here has done that before. From the Google search
> results we looked at so far, it seems like it has been compiled, but we
> haven't seen an indication that anyone has run it and launched tasks on
> them. And does it sound right that it might take 4 hours or so to compile?
>
> We are looking to run just the agents. The master will be on a regular
> Ubuntu laptop or a server.
>
> Appreciate any pointers.
>
>
>


Re: Running Mesos agent on ARM (Raspberry Pi)?

2016-04-22 Thread Sharma Podila
Appreciate all the pointers so far. We'll certainly share what we end up
with in a few weeks.


On Fri, Apr 22, 2016 at 5:49 PM, tommy xiao <xia...@gmail.com> wrote:

> the alternative way, use Docker on rpi to containerised the mesos master
> and slave, it also cool things.
>
> 2016-04-23 1:38 GMT+08:00 Dario Rexin <dario.re...@me.com>:
>
>> Hi Sharma,
>>
>> I played around with Mesos on RPi a while back and have been able to
>> compile and run it with 2 little patches.
>>
>> 1) Depending on the ZK version, it may be necessary to patch a function
>> that uses inline ASM to use the resp. compiler intrinsics (I don’t remember
>> where exactly in zk it was, but the compile error should tell you)
>>
>> 2) There is string formatting code somewhere that compiles, but is not
>> architecture independent, i.e. behaves different on 32 and 64 bit. IIRC the
>> fix was to change %lu to %llu or something close to that. The stack trace
>> when Mesos crashes should tell you. If you’re lucky enough to have an RPi3,
>> this may not be necessary.
>>
>> Also, if you compile on the RPi make sure to create a swap file of
>> >=512MB. The build process will use lots of memory. I have not been able to
>> compile on multiple cores, because the memory usage was just too high.
>>
>> I hope this helps.
>>
>> On Apr 22, 2016, at 10:23 AM, Tomek Janiszewski <jani...@gmail.com>
>> wrote:
>>
>> As @haosdent mentioned with Kevin we tried to run it on ARM. AFAIR there
>> was a problem only with master, agents runs smoothly (or pretend to). To
>> run it on RPi you need to compile it for ARM. Easy but long solution is to
>> compile it on rpi. Quick but a little bit harder  cross compile it on
>> "normal" machine and upload to device.
>>
>>
>> http://likemagicappears.com/projects/raspberry-pi-cluster/mesos-on-raspbian/
>>
>> pt., 22 kwi 2016, 19:02 użytkownik haosdent <haosd...@gmail.com> napisał:
>>
>>> Tomek have a gsoc proposal to make Mesos build on ARM
>>> https://docs.google.com/document/d/1zbms2jQfExuIm6g-adqaXjFpPif6OsqJ84KAgMrOjHQ/edit
>>> I think you could take a look at this code in github
>>> https://github.com/lyda/mesos-on-arm
>>>
>>> On Sat, Apr 23, 2016 at 12:53 AM, Sharma Podila <spod...@netflix.com>
>>> wrote:
>>>
>>>> We are working on a hack to run Mesos agents on Raspberry Pi and are
>>>> wondering if anyone here has done that before. From the Google search
>>>> results we looked at so far, it seems like it has been compiled, but we
>>>> haven't seen an indication that anyone has run it and launched tasks on
>>>> them. And does it sound right that it might take 4 hours or so to compile?
>>>>
>>>> We are looking to run just the agents. The master will be on a regular
>>>> Ubuntu laptop or a server.
>>>>
>>>> Appreciate any pointers.
>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> Best Regards,
>>> Haosdent Huang
>>>
>>
>>
>
>
> --
> Deshi Xiao
> Twitter: xds2000
> E-mail: xiaods(AT)gmail.com
>


Running Mesos agent on ARM (Raspberry Pi)?

2016-04-22 Thread Sharma Podila
We are working on a hack to run Mesos agents on Raspberry Pi and are
wondering if anyone here has done that before. From the Google search
results we looked at so far, it seems like it has been compiled, but we
haven't seen an indication that anyone has run it and launched tasks on
them. And does it sound right that it might take 4 hours or so to compile?

We are looking to run just the agents. The master will be on a regular
Ubuntu laptop or a server.

Appreciate any pointers.


Re: mesos agent not recovering after ZK init failure

2016-03-07 Thread Sharma Podila
Sure, will do.


On Mon, Mar 7, 2016 at 5:54 PM, Benjamin Mahler <bmah...@apache.org> wrote:

> Very surprising.. I don't have any ideas other than trying to replicate
> the scenario in a test.
>
> Please do keep us posted if you encounter it again and gain more
> information.
>
> On Fri, Feb 26, 2016 at 4:34 PM, Sharma Podila <spod...@netflix.com>
> wrote:
>
>> MESOS-4795 created.
>>
>> I don't have the exit status. We haven't seen a repeat yet, will catch
>> the exit status next time it happens.
>>
>> Yes, removing the metadata directory was the only way it was resolved.
>> This happened on multiple hosts requiring the same resolution.
>>
>>
>> On Thu, Feb 25, 2016 at 6:37 PM, Benjamin Mahler <bmah...@apache.org>
>> wrote:
>>
>>> Feel free to create one. I don't have enough information to know what
>>> the issue is without doing some further investigation, but if the situation
>>> you described is accurate it seems like a there are two strange bugs:
>>>
>>> -the silent exit (do you not have the exit status?), and
>>> -the flapping from ZK errors that needed the meta data directory to be
>>> removed to resolve (are you convinced the removal of the meta directory is
>>> what solved it?)
>>>
>>> It would be good to track these issues in case they crop up again.
>>>
>>> On Tue, Feb 23, 2016 at 2:51 PM, Sharma Podila <spod...@netflix.com>
>>> wrote:
>>>
>>>> Hi Ben,
>>>>
>>>> Let me know if there is a new issue created for this, I would like to
>>>> add myself to watch it.
>>>> Thanks.
>>>>
>>>>
>>>>
>>>> On Wed, Feb 10, 2016 at 9:54 AM, Sharma Podila <spod...@netflix.com>
>>>> wrote:
>>>>
>>>>> Hi Ben,
>>>>>
>>>>> That is accurate, with one additional line:
>>>>>
>>>>> -Agent running fine with 0.24.1
>>>>> -Transient ZK issues, slave flapping with zookeeper_init failure
>>>>> -ZK issue resolved
>>>>> -Most agents stop flapping and function correctly
>>>>> -Some agents continue flapping, but silent exit after printing the
>>>>> detector.cpp:481 log line.
>>>>> -The agents that continue to flap repaired with manual removal of
>>>>> contents in mesos-slave's working dir
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Feb 10, 2016 at 9:43 AM, Benjamin Mahler <bmah...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> Hey Sharma,
>>>>>>
>>>>>> I didn't quite follow the timeline of events here or how the agent
>>>>>> logs you posted fit into the timeline of events. Here's how I 
>>>>>> interpreted:
>>>>>>
>>>>>> -Agent running fine with 0.24.1
>>>>>> -Transient ZK issues, slave flapping with zookeeper_init failure
>>>>>> -ZK issue resolved
>>>>>> -Most agents stop flapping and function correctly
>>>>>> -Some agents continue flapping, but silent exit after printing the
>>>>>> detector.cpp:481 log line.
>>>>>>
>>>>>> Is this accurate? What is the exit code from the silent exit?
>>>>>>
>>>>>> On Tue, Feb 9, 2016 at 9:09 PM, Sharma Podila <spod...@netflix.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Maybe related, but, maybe different since a new process seems to
>>>>>>> find the master leader and still aborts, never recovering with restarts
>>>>>>> until work dir data is removed.
>>>>>>> It is happening in 0.24.1.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Tue, Feb 9, 2016 at 11:53 AM, Vinod Kone <vinodk...@apache.org>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> MESOS-1326 was fixed in 0.19.0 (set the fix version now). But I
>>>>>>>> guess you are saying it is somehow related but not exactly the same 
>>>>>>>> issue?
>>>>>>>>
>>>>>>>> On Tue, Feb 9, 2016 at 11:46 AM, Raúl Gutiérrez Segalés <
>>>>>>>> r...@itevenworks.net> wrote:
>>>>>>>>
>>>>>>>>> On 9 February 

Re: mesos agent not recovering after ZK init failure

2016-02-26 Thread Sharma Podila
MESOS-4795 created.

I don't have the exit status. We haven't seen a repeat yet, will catch the
exit status next time it happens.

Yes, removing the metadata directory was the only way it was resolved. This
happened on multiple hosts requiring the same resolution.


On Thu, Feb 25, 2016 at 6:37 PM, Benjamin Mahler <bmah...@apache.org> wrote:

> Feel free to create one. I don't have enough information to know what the
> issue is without doing some further investigation, but if the situation you
> described is accurate it seems like a there are two strange bugs:
>
> -the silent exit (do you not have the exit status?), and
> -the flapping from ZK errors that needed the meta data directory to be
> removed to resolve (are you convinced the removal of the meta directory is
> what solved it?)
>
> It would be good to track these issues in case they crop up again.
>
> On Tue, Feb 23, 2016 at 2:51 PM, Sharma Podila <spod...@netflix.com>
> wrote:
>
>> Hi Ben,
>>
>> Let me know if there is a new issue created for this, I would like to add
>> myself to watch it.
>> Thanks.
>>
>>
>>
>> On Wed, Feb 10, 2016 at 9:54 AM, Sharma Podila <spod...@netflix.com>
>> wrote:
>>
>>> Hi Ben,
>>>
>>> That is accurate, with one additional line:
>>>
>>> -Agent running fine with 0.24.1
>>> -Transient ZK issues, slave flapping with zookeeper_init failure
>>> -ZK issue resolved
>>> -Most agents stop flapping and function correctly
>>> -Some agents continue flapping, but silent exit after printing the
>>> detector.cpp:481 log line.
>>> -The agents that continue to flap repaired with manual removal of
>>> contents in mesos-slave's working dir
>>>
>>>
>>>
>>> On Wed, Feb 10, 2016 at 9:43 AM, Benjamin Mahler <bmah...@apache.org>
>>> wrote:
>>>
>>>> Hey Sharma,
>>>>
>>>> I didn't quite follow the timeline of events here or how the agent logs
>>>> you posted fit into the timeline of events. Here's how I interpreted:
>>>>
>>>> -Agent running fine with 0.24.1
>>>> -Transient ZK issues, slave flapping with zookeeper_init failure
>>>> -ZK issue resolved
>>>> -Most agents stop flapping and function correctly
>>>> -Some agents continue flapping, but silent exit after printing the
>>>> detector.cpp:481 log line.
>>>>
>>>> Is this accurate? What is the exit code from the silent exit?
>>>>
>>>> On Tue, Feb 9, 2016 at 9:09 PM, Sharma Podila <spod...@netflix.com>
>>>> wrote:
>>>>
>>>>> Maybe related, but, maybe different since a new process seems to find
>>>>> the master leader and still aborts, never recovering with restarts until
>>>>> work dir data is removed.
>>>>> It is happening in 0.24.1.
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Tue, Feb 9, 2016 at 11:53 AM, Vinod Kone <vinodk...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> MESOS-1326 was fixed in 0.19.0 (set the fix version now). But I guess
>>>>>> you are saying it is somehow related but not exactly the same issue?
>>>>>>
>>>>>> On Tue, Feb 9, 2016 at 11:46 AM, Raúl Gutiérrez Segalés <
>>>>>> r...@itevenworks.net> wrote:
>>>>>>
>>>>>>> On 9 February 2016 at 11:04, Sharma Podila <spod...@netflix.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> We had a few mesos agents stuck in an unrecoverable state after a
>>>>>>>> transient ZK init error. Is this a known problem? I wasn't able to 
>>>>>>>> find an
>>>>>>>> existing jira item for this. We are on 0.24.1 at this time.
>>>>>>>>
>>>>>>>> Most agents were fine, except a handful. These handful of agents
>>>>>>>> had their mesos-slave process constantly restarting. The .INFO logfile 
>>>>>>>> had
>>>>>>>> the following contents below, before the process exited, with no error
>>>>>>>> messages. The restarts were happening constantly due to an existing 
>>>>>>>> service
>>>>>>>> keep alive strategy.
>>>>>>>>
>>>>>>>> To fix it, we manually stopped the service, removed the data in the
>

Re: mesos agent not recovering after ZK init failure

2016-02-23 Thread Sharma Podila
Hi Ben,

Let me know if there is a new issue created for this, I would like to add
myself to watch it.
Thanks.



On Wed, Feb 10, 2016 at 9:54 AM, Sharma Podila <spod...@netflix.com> wrote:

> Hi Ben,
>
> That is accurate, with one additional line:
>
> -Agent running fine with 0.24.1
> -Transient ZK issues, slave flapping with zookeeper_init failure
> -ZK issue resolved
> -Most agents stop flapping and function correctly
> -Some agents continue flapping, but silent exit after printing the
> detector.cpp:481 log line.
> -The agents that continue to flap repaired with manual removal of contents
> in mesos-slave's working dir
>
>
>
> On Wed, Feb 10, 2016 at 9:43 AM, Benjamin Mahler <bmah...@apache.org>
> wrote:
>
>> Hey Sharma,
>>
>> I didn't quite follow the timeline of events here or how the agent logs
>> you posted fit into the timeline of events. Here's how I interpreted:
>>
>> -Agent running fine with 0.24.1
>> -Transient ZK issues, slave flapping with zookeeper_init failure
>> -ZK issue resolved
>> -Most agents stop flapping and function correctly
>> -Some agents continue flapping, but silent exit after printing the
>> detector.cpp:481 log line.
>>
>> Is this accurate? What is the exit code from the silent exit?
>>
>> On Tue, Feb 9, 2016 at 9:09 PM, Sharma Podila <spod...@netflix.com>
>> wrote:
>>
>>> Maybe related, but, maybe different since a new process seems to find
>>> the master leader and still aborts, never recovering with restarts until
>>> work dir data is removed.
>>> It is happening in 0.24.1.
>>>
>>>
>>>
>>>
>>> On Tue, Feb 9, 2016 at 11:53 AM, Vinod Kone <vinodk...@apache.org>
>>> wrote:
>>>
>>>> MESOS-1326 was fixed in 0.19.0 (set the fix version now). But I guess
>>>> you are saying it is somehow related but not exactly the same issue?
>>>>
>>>> On Tue, Feb 9, 2016 at 11:46 AM, Raúl Gutiérrez Segalés <
>>>> r...@itevenworks.net> wrote:
>>>>
>>>>> On 9 February 2016 at 11:04, Sharma Podila <spod...@netflix.com>
>>>>> wrote:
>>>>>
>>>>>> We had a few mesos agents stuck in an unrecoverable state after a
>>>>>> transient ZK init error. Is this a known problem? I wasn't able to find 
>>>>>> an
>>>>>> existing jira item for this. We are on 0.24.1 at this time.
>>>>>>
>>>>>> Most agents were fine, except a handful. These handful of agents had
>>>>>> their mesos-slave process constantly restarting. The .INFO logfile had 
>>>>>> the
>>>>>> following contents below, before the process exited, with no error
>>>>>> messages. The restarts were happening constantly due to an existing 
>>>>>> service
>>>>>> keep alive strategy.
>>>>>>
>>>>>> To fix it, we manually stopped the service, removed the data in the
>>>>>> working dir, and then restarted it. The mesos-slave process was able to
>>>>>> restart then. The manual intervention needed to resolve it is 
>>>>>> problematic.
>>>>>>
>>>>>> Here's the contents of the various log files on the agent:
>>>>>>
>>>>>> The .INFO logfile for one of the restarts before mesos-slave process
>>>>>> exited with no other error messages:
>>>>>>
>>>>>> Log file created at: 2016/02/09 02:12:48
>>>>>> Running on machine: titusagent-main-i-7697a9c5
>>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
>>>>>> I0209 02:12:48.502403 97255 logging.cpp:172] INFO level logging
>>>>>> started!
>>>>>> I0209 02:12:48.502938 97255 main.cpp:185] Build: 2015-09-30 16:12:07
>>>>>> by builds
>>>>>> I0209 02:12:48.502974 97255 main.cpp:187] Version: 0.24.1
>>>>>> I0209 02:12:48.503288 97255 containerizer.cpp:143] Using isolation:
>>>>>> posix/cpu,posix/mem,filesystem/posix
>>>>>> I0209 02:12:48.507961 97255 main.cpp:272] Starting Mesos slave
>>>>>> I0209 02:12:48.509827 97296 slave.cpp:190] Slave started on 1)@
>>>>>> 10.138.146.230:7101
>>>>>> I0209 02:12:48.510074 97296 slave.cpp:191] Flags at startup:
>>>>>> --appc_store_dir="/tmp/mesos/store/appc"
>>>>>>

Re: AW: Feature request: move in-flight containers w/o stopping them

2016-02-19 Thread Sharma Podila
Moving stateless services can be trivial or a non problem, as others have
suggested.
Migrating state full services becomes a function of migrating the state,
including any network conx, etc. To think aloud, from a bit of past
considerations in hpc like systems, some systems relied upon the underlying
systems to support migration (vMotion, etc.), to 3rd party libraries (was
that Meiosys) that could work on existing application binaries, to
libraries (BLCR
)
that need support from application developer. I was involved with providing
support for BLCR based applications. One of the challenges was the time to
checkpoint an application with large memory footprint, say, 100 GB or more,
which isn't uncommon in hpc. Incremental checkpointing wasn't an option, at
least at that point.
Regardless, Mesos' support for checkpoint-restore would have to consider
the type of checkpoint-restore being used. I would imagine that the core
part of the solution would be simple'ish, in providing a "workflow" for the
checkpoint-restore system (sort of send signal to start checkpoint, wait
certain time to complete or timeout). Relatively less simple would be the
actual integration of the checkpoint-restore system and dealing with its
constraints and idiosyncrasies.


On Fri, Feb 19, 2016 at 4:50 AM, Dick Davies  wrote:

> Agreed, vMotion always struck me as something for those monolithic
> apps with a lot of local state.
>
> The industry seems to be moving away from that as fast as its little
> legs will carry it.
>
> On 19 February 2016 at 11:35, Jason Giedymin 
> wrote:
> > Food for thought:
> >
> > One should refrain from monolithic apps. If they're small and stateless
> you
> > should be doing rolling upgrades.
> >
> > If you find yourself with one container and you can't easily distribute
> that
> > work load by just scaling and load balancing then you have a monolith.
> Time
> > to enhance it.
> >
> > Containers should not be treated like VMs.
> >
> > -Jason
> >
> > On Feb 19, 2016, at 6:05 AM, Mike Michel  wrote:
> >
> > Question is if you really need this when you are moving in the world of
> > containers/microservices where it is about building stateless 12factor
> apps
> > except databases. Why moving a service when you can just kill it and let
> the
> > work be done by 10 other containers doing the same? I remember a talk on
> > dockercon about containers and live migration. It was like: „And now
> where
> > you know how to do it, dont’t do it!“
> >
> >
> >
> > Von: Avinash Sridharan [mailto:avin...@mesosphere.io]
> > Gesendet: Freitag, 19. Februar 2016 05:48
> > An: user@mesos.apache.org
> > Betreff: Re: Feature request: move in-flight containers w/o stopping them
> >
> >
> >
> > One problem with implementing something like vMotion for Mesos is to
> address
> > seamless movement of network connectivity as well. This effectively
> requires
> > moving the IP address of the container across hosts. If the container
> shares
> > host network stack, this won't be possible since this would imply moving
> the
> > host IP address from one host to another. When a container has its
> network
> > namespace, attached to the host, using a bridge, moving across L2
> segments
> > might be a possibility. To move across L3 segments you will need some
> form
> > of overlay (VxLAN maybe ?) .
> >
> >
> >
> > On Thu, Feb 18, 2016 at 7:34 PM, Jay Taylor  wrote:
> >
> > Is this theoretically feasible with Linux checkpoint and restore, perhaps
> > via CRIU?http://criu.org/Main_Page
> >
> >
> > On Feb 18, 2016, at 4:35 AM, Paul Bell  wrote:
> >
> > Hello All,
> >
> >
> >
> > Has there ever been any consideration of the ability to move in-flight
> > containers from one Mesos host node to another?
> >
> >
> >
> > I see this as analogous to VMware's "vMotion" facility wherein VMs can be
> > moved from one ESXi host to another.
> >
> >
> >
> > I suppose something like this could be useful from a load-balancing
> > perspective.
> >
> >
> >
> > Just curious if it's ever been considered and if so - and rejected - why
> > rejected?
> >
> >
> >
> > Thanks.
> >
> >
> >
> > -Paul
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > --
> >
> > Avinash Sridharan, Mesosphere
> >
> > +1 (323) 702 5245
>


Re: mesos agent not recovering after ZK init failure

2016-02-10 Thread Sharma Podila
Hi Ben,

That is accurate, with one additional line:

-Agent running fine with 0.24.1
-Transient ZK issues, slave flapping with zookeeper_init failure
-ZK issue resolved
-Most agents stop flapping and function correctly
-Some agents continue flapping, but silent exit after printing the
detector.cpp:481 log line.
-The agents that continue to flap repaired with manual removal of contents
in mesos-slave's working dir



On Wed, Feb 10, 2016 at 9:43 AM, Benjamin Mahler <bmah...@apache.org> wrote:

> Hey Sharma,
>
> I didn't quite follow the timeline of events here or how the agent logs
> you posted fit into the timeline of events. Here's how I interpreted:
>
> -Agent running fine with 0.24.1
> -Transient ZK issues, slave flapping with zookeeper_init failure
> -ZK issue resolved
> -Most agents stop flapping and function correctly
> -Some agents continue flapping, but silent exit after printing the
> detector.cpp:481 log line.
>
> Is this accurate? What is the exit code from the silent exit?
>
> On Tue, Feb 9, 2016 at 9:09 PM, Sharma Podila <spod...@netflix.com> wrote:
>
>> Maybe related, but, maybe different since a new process seems to find the
>> master leader and still aborts, never recovering with restarts until work
>> dir data is removed.
>> It is happening in 0.24.1.
>>
>>
>>
>>
>> On Tue, Feb 9, 2016 at 11:53 AM, Vinod Kone <vinodk...@apache.org> wrote:
>>
>>> MESOS-1326 was fixed in 0.19.0 (set the fix version now). But I guess
>>> you are saying it is somehow related but not exactly the same issue?
>>>
>>> On Tue, Feb 9, 2016 at 11:46 AM, Raúl Gutiérrez Segalés <
>>> r...@itevenworks.net> wrote:
>>>
>>>> On 9 February 2016 at 11:04, Sharma Podila <spod...@netflix.com> wrote:
>>>>
>>>>> We had a few mesos agents stuck in an unrecoverable state after a
>>>>> transient ZK init error. Is this a known problem? I wasn't able to find an
>>>>> existing jira item for this. We are on 0.24.1 at this time.
>>>>>
>>>>> Most agents were fine, except a handful. These handful of agents had
>>>>> their mesos-slave process constantly restarting. The .INFO logfile had the
>>>>> following contents below, before the process exited, with no error
>>>>> messages. The restarts were happening constantly due to an existing 
>>>>> service
>>>>> keep alive strategy.
>>>>>
>>>>> To fix it, we manually stopped the service, removed the data in the
>>>>> working dir, and then restarted it. The mesos-slave process was able to
>>>>> restart then. The manual intervention needed to resolve it is problematic.
>>>>>
>>>>> Here's the contents of the various log files on the agent:
>>>>>
>>>>> The .INFO logfile for one of the restarts before mesos-slave process
>>>>> exited with no other error messages:
>>>>>
>>>>> Log file created at: 2016/02/09 02:12:48
>>>>> Running on machine: titusagent-main-i-7697a9c5
>>>>> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
>>>>> I0209 02:12:48.502403 97255 logging.cpp:172] INFO level logging
>>>>> started!
>>>>> I0209 02:12:48.502938 97255 main.cpp:185] Build: 2015-09-30 16:12:07
>>>>> by builds
>>>>> I0209 02:12:48.502974 97255 main.cpp:187] Version: 0.24.1
>>>>> I0209 02:12:48.503288 97255 containerizer.cpp:143] Using isolation:
>>>>> posix/cpu,posix/mem,filesystem/posix
>>>>> I0209 02:12:48.507961 97255 main.cpp:272] Starting Mesos slave
>>>>> I0209 02:12:48.509827 97296 slave.cpp:190] Slave started on 1)@
>>>>> 10.138.146.230:7101
>>>>> I0209 02:12:48.510074 97296 slave.cpp:191] Flags at startup:
>>>>> --appc_store_dir="/tmp/mesos/store/appc"
>>>>> --attributes="region:us-east-1;" --authenticatee=""
>>>>> --cgroups_cpu_enable_pids_and_tids_count="false"
>>>>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
>>>>> --cgroups_limit_swap="false" --cgroups_root="mesos"
>>>>> --container_disk_watch_interval="15secs" --containerizers="mesos" "
>>>>> I0209 02:12:48.511706 97296 slave.cpp:354] Slave resources:
>>>>> ports(*):[7150-7200]; mem(*):240135; cpus(*):32; disk(*):586104
>>>>> I0209 02:12:48.512320 97296 slav

mesos agent not recovering after ZK init failure

2016-02-09 Thread Sharma Podila
We had a few mesos agents stuck in an unrecoverable state after a transient
ZK init error. Is this a known problem? I wasn't able to find an existing
jira item for this. We are on 0.24.1 at this time.

Most agents were fine, except a handful. These handful of agents had their
mesos-slave process constantly restarting. The .INFO logfile had the
following contents below, before the process exited, with no error
messages. The restarts were happening constantly due to an existing service
keep alive strategy.

To fix it, we manually stopped the service, removed the data in the working
dir, and then restarted it. The mesos-slave process was able to restart
then. The manual intervention needed to resolve it is problematic.

Here's the contents of the various log files on the agent:

The .INFO logfile for one of the restarts before mesos-slave process exited
with no other error messages:

Log file created at: 2016/02/09 02:12:48
Running on machine: titusagent-main-i-7697a9c5
Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
I0209 02:12:48.502403 97255 logging.cpp:172] INFO level logging started!
I0209 02:12:48.502938 97255 main.cpp:185] Build: 2015-09-30 16:12:07 by
builds
I0209 02:12:48.502974 97255 main.cpp:187] Version: 0.24.1
I0209 02:12:48.503288 97255 containerizer.cpp:143] Using isolation:
posix/cpu,posix/mem,filesystem/posix
I0209 02:12:48.507961 97255 main.cpp:272] Starting Mesos slave
I0209 02:12:48.509827 97296 slave.cpp:190] Slave started on 1)@
10.138.146.230:7101
I0209 02:12:48.510074 97296 slave.cpp:191] Flags at startup:
--appc_store_dir="/tmp/mesos/store/appc"
--attributes="region:us-east-1;" --authenticatee=""
--cgroups_cpu_enable_pids_and_tids_count="false"
--cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
--cgroups_limit_swap="false" --cgroups_root="mesos"
--container_disk_watch_interval="15secs" --containerizers="mesos" "
I0209 02:12:48.511706 97296 slave.cpp:354] Slave resources:
ports(*):[7150-7200]; mem(*):240135; cpus(*):32; disk(*):586104
I0209 02:12:48.512320 97296 slave.cpp:384] Slave hostname: 
I0209 02:12:48.512368 97296 slave.cpp:389] Slave checkpoint: true
I0209 02:12:48.516139 97299 group.cpp:331] Group process (group(1)@
10.138.146.230:7101) connected to ZooKeeper
I0209 02:12:48.516216 97299 group.cpp:805] Syncing group operations: queue
size (joins, cancels, datas) = (0, 0, 0)
I0209 02:12:48.516253 97299 group.cpp:403] Trying to create path
'/titus/main/mesos' in ZooKeeper
I0209 02:12:48.520268 97275 detector.cpp:156] Detected a new leader:
(id='209')
I0209 02:12:48.520803 97284 group.cpp:674] Trying to get
'/titus/main/mesos/json.info_000209' in ZooKeeper
I0209 02:12:48.520874 97278 state.cpp:54] Recovering state from
'/mnt/data/mesos/meta'
I0209 02:12:48.520961 97278 state.cpp:690] Failed to find resources file
'/mnt/data/mesos/meta/resources/resources.info'
I0209 02:12:48.523680 97283 detector.cpp:481] A new leading master (UPID=
master@10.230.95.110:7103) is detected


The .FATAL log file when the original transient ZK error occurred:

Log file created at: 2016/02/05 17:21:37
Running on machine: titusagent-main-i-7697a9c5
Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create ZooKeeper,
zookeeper_init: No such file or directory [2]


The .ERROR log file:

Log file created at: 2016/02/05 17:21:37
Running on machine: titusagent-main-i-7697a9c5
Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create ZooKeeper,
zookeeper_init: No such file or directory [2]

The .WARNING file had the same content.


Re: mesos agent not recovering after ZK init failure

2016-02-09 Thread Sharma Podila
Maybe related, but, maybe different since a new process seems to find the
master leader and still aborts, never recovering with restarts until work
dir data is removed.
It is happening in 0.24.1.




On Tue, Feb 9, 2016 at 11:53 AM, Vinod Kone <vinodk...@apache.org> wrote:

> MESOS-1326 was fixed in 0.19.0 (set the fix version now). But I guess you
> are saying it is somehow related but not exactly the same issue?
>
> On Tue, Feb 9, 2016 at 11:46 AM, Raúl Gutiérrez Segalés <
> r...@itevenworks.net> wrote:
>
>> On 9 February 2016 at 11:04, Sharma Podila <spod...@netflix.com> wrote:
>>
>>> We had a few mesos agents stuck in an unrecoverable state after a
>>> transient ZK init error. Is this a known problem? I wasn't able to find an
>>> existing jira item for this. We are on 0.24.1 at this time.
>>>
>>> Most agents were fine, except a handful. These handful of agents had
>>> their mesos-slave process constantly restarting. The .INFO logfile had the
>>> following contents below, before the process exited, with no error
>>> messages. The restarts were happening constantly due to an existing service
>>> keep alive strategy.
>>>
>>> To fix it, we manually stopped the service, removed the data in the
>>> working dir, and then restarted it. The mesos-slave process was able to
>>> restart then. The manual intervention needed to resolve it is problematic.
>>>
>>> Here's the contents of the various log files on the agent:
>>>
>>> The .INFO logfile for one of the restarts before mesos-slave process
>>> exited with no other error messages:
>>>
>>> Log file created at: 2016/02/09 02:12:48
>>> Running on machine: titusagent-main-i-7697a9c5
>>> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
>>> I0209 02:12:48.502403 97255 logging.cpp:172] INFO level logging started!
>>> I0209 02:12:48.502938 97255 main.cpp:185] Build: 2015-09-30 16:12:07 by
>>> builds
>>> I0209 02:12:48.502974 97255 main.cpp:187] Version: 0.24.1
>>> I0209 02:12:48.503288 97255 containerizer.cpp:143] Using isolation:
>>> posix/cpu,posix/mem,filesystem/posix
>>> I0209 02:12:48.507961 97255 main.cpp:272] Starting Mesos slave
>>> I0209 02:12:48.509827 97296 slave.cpp:190] Slave started on 1)@
>>> 10.138.146.230:7101
>>> I0209 02:12:48.510074 97296 slave.cpp:191] Flags at startup:
>>> --appc_store_dir="/tmp/mesos/store/appc"
>>> --attributes="region:us-east-1;" --authenticatee=""
>>> --cgroups_cpu_enable_pids_and_tids_count="false"
>>> --cgroups_enable_cfs="false" --cgroups_hierarchy="/sys/fs/cgroup"
>>> --cgroups_limit_swap="false" --cgroups_root="mesos"
>>> --container_disk_watch_interval="15secs" --containerizers="mesos" "
>>> I0209 02:12:48.511706 97296 slave.cpp:354] Slave resources:
>>> ports(*):[7150-7200]; mem(*):240135; cpus(*):32; disk(*):586104
>>> I0209 02:12:48.512320 97296 slave.cpp:384] Slave hostname: 
>>> I0209 02:12:48.512368 97296 slave.cpp:389] Slave checkpoint: true
>>> I0209 02:12:48.516139 97299 group.cpp:331] Group process (group(1)@
>>> 10.138.146.230:7101) connected to ZooKeeper
>>> I0209 02:12:48.516216 97299 group.cpp:805] Syncing group operations:
>>> queue size (joins, cancels, datas) = (0, 0, 0)
>>> I0209 02:12:48.516253 97299 group.cpp:403] Trying to create path
>>> '/titus/main/mesos' in ZooKeeper
>>> I0209 02:12:48.520268 97275 detector.cpp:156] Detected a new leader:
>>> (id='209')
>>> I0209 02:12:48.520803 97284 group.cpp:674] Trying to get
>>> '/titus/main/mesos/json.info_000209' in ZooKeeper
>>> I0209 02:12:48.520874 97278 state.cpp:54] Recovering state from
>>> '/mnt/data/mesos/meta'
>>> I0209 02:12:48.520961 97278 state.cpp:690] Failed to find resources file
>>> '/mnt/data/mesos/meta/resources/resources.info'
>>> I0209 02:12:48.523680 97283 detector.cpp:481] A new leading master (UPID=
>>> master@10.230.95.110:7103) is detected
>>>
>>>
>>> The .FATAL log file when the original transient ZK error occurred:
>>>
>>> Log file created at: 2016/02/05 17:21:37
>>> Running on machine: titusagent-main-i-7697a9c5
>>> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
>>> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create
>>> ZooKeeper, zookeeper_init: No such file or directory [2]
>>>
>>>
>>> The .ERROR log file:
>>>
>>> Log file created at: 2016/02/05 17:21:37
>>> Running on machine: titusagent-main-i-7697a9c5
>>> Log line format: [IWEF]mmdd hh:mm:ss.uu threadid file:line] msg
>>> F0205 17:21:37.395644 53841 zookeeper.cpp:110] Failed to create
>>> ZooKeeper, zookeeper_init: No such file or directory [2]
>>>
>>> The .WARNING file had the same content.
>>>
>>
>> Maybe related: https://issues.apache.org/jira/browse/MESOS-1326
>>
>>
>> -rgs
>>
>>
>


Re: Scheduling tasks based on dependancy

2015-10-06 Thread Sharma Podila
Hi Pradeep,

We augment the mesos-slave command line on each agent to report the
available network bandwidth in Mbps. For example, the agents that have
1Gbps bandwidth have this additional custom resources (--resources command
line option) set: "network:1024". This shows up in the offers as a resource
with name "network" and value "1024", which can be used to assign. Similar
to how memory is assigned. That is, if you launch one task that requires
"network" resource of "100" value, that is, it is asking for 100 Mbps n/w
bandwidth, the next offer from Mesos will have network resource value of
924 (1024 - 100).
is this what you were asking for?

Sharma



On Tue, Oct 6, 2015 at 3:51 AM, Pradeep Kiruvale <pradeepkiruv...@gmail.com>
wrote:

> Hi Sharma,
>
> Is this how you collect the network info from the VMs?
>
> First you get the resource offers from the Mesos and then you collect the
> network bandwidth info and then you use that for assigning for your tasks?
> Or
> The mesos-slave collects the resource information? But I don't see any
> code to that and also the existing mesos-slave does not collects these
> resource information by itself.
>
> Am I missing something here?
>
> Regards,
> Pradeep
>
> On 5 October 2015 at 18:28, Sharma Podila <spod...@netflix.com> wrote:
>
>> Pradeep,
>>
>> We recently open sourced Fenzo <https://github.com/Netflix/Fenzo> (wiki
>> <https://github.com/Netflix/Fenzo/wiki>) to handle these scenarios. We
>> add a custom attribute for network bandwidth for each agent's "mesos-slave"
>> command line. And we have Fenzo assign resources to tasks based on CPU,
>> memory, disk, ports, and network bandwidth requirements. With Fenzo you can
>> define affinity, locality, and any other custom scheduling objectives using
>> plugins. Some of the plugins are already built in. It is also easy to add
>> additional plugins to cover other objectives you care about.
>>
>> "Dependencies" can mean multiple things. Do you mean dependencies on
>> certain attributes of resources/agents? Dependencies on where other tasks
>> are assigned? All of these are covered. However, if you mean workflow type
>> of dependencies on completion of other tasks, then, there are no built in
>> plugins. You could write one using Fenzo. It is also common for such
>> workflow dependencies to be covered by an entity external to the scheduler.
>> Both techniques can be made to work.
>>
>> Fenzo has the concept of hard Vs soft constraints. You could specify, for
>> example, resource affinity and/or task locality as a soft constraint or a
>> hard constraint. See the wiki docs link I provided above for details.
>>
>> Sharma
>>
>>
>> On Mon, Oct 5, 2015 at 8:21 AM, Pradeep Kiruvale <
>> pradeepkiruv...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> Are there any frameworks that exists with the Mesos to schedule the
>>> bigger apps?
>>> I mean to say scheduling a app which has many services and will not fit
>>> into one physical node.
>>>
>>> Is there any frame work that can be used to
>>>  schedule tasks based on the underlying hardware constraints like
>>> Network bandwidth ?
>>>  Schedule the tasks based on their dependencies and proximity to each
>>> other in a cluster or a rack?
>>>
>>> Thanks & Regards,
>>> Pradeep
>>>
>>
>>
>


Re: Scheduling tasks based on dependancy

2015-10-06 Thread Sharma Podila
Pradeep, attributes show up as name value pairs in the offers. Custom
attributes can also be used in Fenzo for assignment optimizations. For
example, we set custom attributes for AWS EC2 ZONE names and ASG names. We
use the ZONE name custom attribute to balance tasks of a job across zones
via the built in constraint plugin, BalancedHostAttributeConstraint



On Tue, Oct 6, 2015 at 4:03 AM, Pradeep Kiruvale 
wrote:

> Hi Guangya,
>
> One doubt about the  --attributes=rackid:r1;groupid:g1 option.
>
> How does the master provisions the resources? How will be the resource
> offer?
>
> Is it like (Rack 1 , G1, System)? how does this way of  doing resource
> offer will help?
>
> Can you please give me more information?
>
>
> -Pradeep
>
>
>
> On 5 October 2015 at 17:45, Guangya Liu  wrote:
>
>> Hi Pradeep,
>>
>> I think that you can try Chronos and Marathon which can help you.
>>
>> *Marathon:* https://github.com/mesosphere/marathon
>> You can try Marathon + Mesos + Mesos Resource Attribute
>>
>> When you start up mesos slave, uses --attributes option, here is an
>> example:
>> ./bin/mesos-slave.sh --master=9.21.61.21:5050 --quiet
>> --log_dir=/tmp/mesos --attributes=rackid:r1;groupid:g1
>> This basically defines two attributes for this mesos slave host. rackid
>> with value r1 and groupid with value g1.
>>
>> marathon start -i "like_test" -C "sleep 100" -n 4 -c 1 -m 50 -o
>> "rackid:LIKE:r1"
>>
>> this will place applications on the slave node whose rackid is r1
>>
>> *Chronos:* https://github.com/mesos/chronos , Chronos supports the
>> definition of jobs triggered by the completion of other jobs. It supports
>> arbitrarily long dependency chains.
>>
>> Thanks,
>>
>> Guangya
>>
>> On Mon, Oct 5, 2015 at 11:21 PM, Pradeep Kiruvale <
>> pradeepkiruv...@gmail.com> wrote:
>>
>>> Hi All,
>>>
>>> Are there any frameworks that exists with the Mesos to schedule the
>>> bigger apps?
>>> I mean to say scheduling a app which has many services and will not fit
>>> into one physical node.
>>>
>>> Is there any frame work that can be used to
>>>  schedule tasks based on the underlying hardware constraints like
>>> Network bandwidth ?
>>>
>>  Schedule the tasks based on their dependencies and proximity to each
>>> other in a cluster or a rack?
>>>
>>> Thanks & Regards,
>>> Pradeep
>>>
>>
>>
>


Re: Scheduling tasks based on dependancy

2015-10-05 Thread Sharma Podila
Pradeep,

We recently open sourced Fenzo  (wiki
) to handle these scenarios. We add
a custom attribute for network bandwidth for each agent's "mesos-slave"
command line. And we have Fenzo assign resources to tasks based on CPU,
memory, disk, ports, and network bandwidth requirements. With Fenzo you can
define affinity, locality, and any other custom scheduling objectives using
plugins. Some of the plugins are already built in. It is also easy to add
additional plugins to cover other objectives you care about.

"Dependencies" can mean multiple things. Do you mean dependencies on
certain attributes of resources/agents? Dependencies on where other tasks
are assigned? All of these are covered. However, if you mean workflow type
of dependencies on completion of other tasks, then, there are no built in
plugins. You could write one using Fenzo. It is also common for such
workflow dependencies to be covered by an entity external to the scheduler.
Both techniques can be made to work.

Fenzo has the concept of hard Vs soft constraints. You could specify, for
example, resource affinity and/or task locality as a soft constraint or a
hard constraint. See the wiki docs link I provided above for details.

Sharma


On Mon, Oct 5, 2015 at 8:21 AM, Pradeep Kiruvale 
wrote:

> Hi All,
>
> Are there any frameworks that exists with the Mesos to schedule the bigger
> apps?
> I mean to say scheduling a app which has many services and will not fit
> into one physical node.
>
> Is there any frame work that can be used to
>  schedule tasks based on the underlying hardware constraints like Network
> bandwidth ?
>  Schedule the tasks based on their dependencies and proximity to each
> other in a cluster or a rack?
>
> Thanks & Regards,
> Pradeep
>


Re: Metric for tasks queued/waiting?

2015-09-23 Thread Sharma Podila
Discussing in a separate place/JIRA ticket sounds good.
Basically, representing contention using a summary of pending resource
requests from each framework could be the hints to mesos master. However,
this gets into intricacies, not the least of which is diversity of resource
requests, qualified by queue depth.
Another way to think of this could be that each framework could trigger a
scale up individually (say, by hitting a mesos master or another
independent service's endpoint to add additional agents/slaves). Even
uncoordinated scale up actions from multiple frameworks should result in
the same end result, modulo reservations/limits/etc. Then, mesos master
needs to deal with only scale down, which it could perform based on offer
rejections from frameworks, implying nobody needs that many agents/slaves.

Maybe that's more details than needed in this discussion...



On Wed, Sep 23, 2015 at 2:05 PM, Niklas Nielsen <nik...@mesosphere.io>
wrote:

> I'd love to see this solved in a general way; "How does the framework
> communicate (insert intent, metric, hint, etc) to mesos".
>
> In one way, the 'webui_url' of in the framework info conveys "This is how
> you get to my web ui". As providing a webui was a common pattern for the
> frameworks.
>
> This could be expanded, so the framework can report an 'apiui_url' or
> maybe even more specific "metrics_url" where the mesos master (or other
> frameworks and 3rd party tooling) can get insights into queue depths,
> resource preferences, etc.
>
> We can start discussing this further in a JIRA ticket :)
>
> Niklas
>
> On 23 September 2015 at 13:54, Alex Gaudio <adgau...@gmail.com> wrote:
>
>> Hi Aaron,
>>
>> You might consider trying to solve the autoscaling problem with Relay, a
>> Python tool I use to solve this problem.  Feel free to shoot me an email if
>> you are interested.
>>
>> github.com/sailthru/relay
>>
>> Alex
>>
>> On Wed, Sep 23, 2015, 11:03 AM David Greenberg <dsg123456...@gmail.com>
>> wrote:
>>
>>> In addition, this technique could be implemented in the allocator with
>>> an understanding of global demand:
>>> https://www.youtube.com/watch?v=BkBMYUe76oI
>>>
>>> That would allow for tunable fair-sharing based on DRF-principles.
>>>
>>> On Wed, Sep 23, 2015 at 10:59 AM haosdent <haosd...@gmail.com> wrote:
>>>
>>>> Feel free to open a story in jira if you think you ideas are awesome.
>>>> :-)
>>>>
>>> On Sep 23, 2015 10:54 PM, "Sharma Podila" <spod...@netflix.com> wrote:
>>>>
>>>>> Ah, OK, thanks. Yes, Fenzo is a Java library.
>>>>>
>>>>> It might be a nice addition to Mesos master to get a global view of
>>>>> contention for resources. In addition to autoscaling, it would be useful 
>>>>> in
>>>>> the allocator.
>>>>>
>>>>>
>>>>>
>>>>> On Wed, Sep 23, 2015 at 7:29 AM, Aaron Carey <aca...@ilm.com> wrote:
>>>>>
>>>>>> Thanks Sharma,
>>>>>>
>>>>>> I was in the audience for a talk you did about Fenzo at MesosCon :)
>>>>>> It looked great but we're a python shop primarily so the Java requirement
>>>>>> would be a problem for us.
>>>>>>
>>>>>> The scaling in the scheduler makes total sense, (obvious when you
>>>>>> think about it!), I was naively hoping for some sort of knowledge of that
>>>>>> back in the Mesos master as we were hoping to have scaling be independent
>>>>>> of schedulers. I think this'll need a re-think!
>>>>>>
>>>>>> Thanks for your help!
>>>>>>
>>>>>> Aaron
>>>>>>
>>>>>> --
>>>>>> *From:* Sharma Podila [spod...@netflix.com]
>>>>>> *Sent:* 23 September 2015 15:22
>>>>>>
>>>>>> *To:* user@mesos.apache.org
>>>>>> *Subject:* Re: Metric for tasks queued/waiting?
>>>>>>
>>>>>> Jobs/tasks wait in framework schedulers, not mesos master.
>>>>>> Autoscaling triggers must come from schedulers, not only because that's 
>>>>>> who
>>>>>> knows the pending task set size, but, also because it knows how many of
>>>>>> them need to be launched right away, on what kind of machines.
>>>>>>
>>>>>> We built such 

Re: How to kill a task gracefully?

2015-09-22 Thread Sharma Podila
I believe this depends on the executor being used. A kill request to mesos
driver from the framework scheduler is delivered to the executor. The kill
request by itself is not a guarantee that the task will be killed, until
honored by the executor. So, it is possible that the executor can be
written to achieve the "graceful" behavior you desire. The resources
assigned to the task aren't reclaimed until the task is reported
killed/terminated by the executor. It is possible, for example, to perform
"clean up" or "checkpointing" as part of a task kill request. But, that can
be tricky in terms of delaying the actual kill logistics, and therefore
making the resources available.

If your question regarding a specific executor you are already using?

In case you are referring to resource oversubscription feature recently
introduced, those kills are determined by an entity local to the Mesos
agent node. "Grace" would be majorly influenced by that entity and less by
the executor.


On Tue, Sep 22, 2015 at 9:39 AM, Jerry Lam  wrote:

> Hi mesos users,
>
> anyone knows how to kill a task running in Mesos gracefully? Thanks!
>
>
> Best Regards,
>
> Jerry
>


Re: Setting maximum per-node resources in offers

2015-09-10 Thread Sharma Podila
FYI-
If you are to use Fenzo in writing your framework, it has support for
limiting overall resources used by tasks with the use of a "group name".
That is, all tasks with a group name, say "userA", would be limited to
using the resources specified in the limit for the group. For this to work,
you would have to specify the limits for each user and specify each task's
group name as the user's name, same as in the limits. Each user can be
given different limits, if desired. See this
 for
details.

In general, "fair share" is subjective. Quotas fragment the cluster and can
reduce the overall cluster utilization when only few users are active. One
improvement may be to treat the limits as soft limits. That is, let users
use resources beyond their limits if there is no contention. However, for
this to work well, we would need one of two things to be true:

1. rate of task completion is high enough that a new user will be able to
get resources after not using the cluster for a while, or,
2. users' tasks that are consuming more resources than limits can be
preempted when needed for other users.

The quota management in Mesos, that Guangya gave the link for, seems to
address some of these concerns. My understanding is that the MVP is going
to be the equivalent of hard limits.



On Tue, Sep 8, 2015 at 11:55 PM, Guangya Liu  wrote:

> Great that it helps!
>
> I think that it is a bit heavy to running Spark+Aurora+Mesos, but you can
> have a try if it can fill your requirement. ;-)
>
> In my understanding, I think that what you may want to have a try with
> Spark + (Customized Spark Scheduler, leverage Fenzo or others) + Mesos, but
> this may involve some code change for spark.
>
> Thanks,
>
> Guangya
>
> On Wed, Sep 9, 2015 at 2:05 PM, RJ Nowling  wrote:
>
>> Thanks, Guangya!
>>
>> Inspired by your comments, I've also been thinking about the option of
>> using Apache Aurora to provide some of the features I want.  Spark could be
>> deployed in standalone mode on top of Aurora on top of Mesos. :)
>>
>> Funny enough, two of my colleagues (Tim St. Clair and Erik Erlandson)
>> seem to be tracking and commenting on the epic you linked to.  :)
>>
>> On Wed, Sep 9, 2015 at 12:59 AM, Guangya Liu  wrote:
>>
>>> Hi RJ, please check my answers in line.
>>>
>>> Thanks,
>>>
>>> Guangya
>>>
>>> On Wed, Sep 9, 2015 at 1:24 PM, RJ Nowling  wrote:
>>>
 Hi Guangya,

 My use case is actually trying to run Spark (in coarse grain mode) with
 multiple users. I wanted ways to better ensure fair scheduling across
 users. Spark provides very few primitives so I was hoping I could use Mesos
 to limit resources per user and control how the cluster is partitioned. For
 example, I may prefer that a Spark jobs share multiple machines without
 using all resources on a single machine for fault tolerance.

>>> For this scenario, you may want to schedule those offered resource again
>>> in framework level, you can leverage fenzo or what ever to enhance the
>>> scheduler part for spark to achieve your goal.
>>>

 I'm also considering the case of running multiple frameworks. In this
 case, frameworks would have to coordinate to enforce user quotas and such.
 It seems that this would be better solved somewhere below the framework
 level.

>>> For this scenario, there is an epic for "quota management" which can
>>> fill your requirement but it is still undergoing and not available now.
>>> epic: https://issues.apache.org/jira/browse/MESOS-1791
>>> Design doc:
>>> https://docs.google.com/document/d/16iRNmziasEjVOblYp5bbkeBZ7pnjNlaIzPQqMTHQ-9I/edit?pli=1#heading=h.9g7fqjh6652v
>>>

 RJ



 On Sep 8, 2015, at 11:47 PM, Guangya Liu  wrote:

 Hi RJ,

 I think that your final goal is that you want to use framework running
 on top of mesos to execute some tasks. Such logic should be in the
 framework part. The netflix open sourced a framework scheduler library
 named as fenzo, you may want to take a look at this one to see if it can
 help you.


 http://techblog.netflix.com/2015/08/fenzo-oss-scheduler-for-apache-mesos.html
 https://github.com/Netflix/Fenzo

 Thanks,

 Guangya

 --
 Date: Tue, 8 Sep 2015 23:09:36 -0500
 Subject: Re: Setting maximum per-node resources in offers
 From: rnowl...@gmail.com
 To: user@mesos.apache.org

 Thanks, Klaus.

 I think I was probably misunderstanding the role of the allocator in
 Mesos versus the scheduler in the framework sitting on top of Mesos.
 Probably out of scope for Mesos to divide up resources as I was suggesting.

 On Tue, Sep 8, 2015 at 10:48 PM, Klaus Ma  wrote:

 If it's the only framework, you 

Re: MesosCon Seattle attendee introduction thread

2015-08-17 Thread Sharma Podila
Hello Everyone,

I am Sharma Podila, senior software engineer at Netflix. It is exciting to
be a part of MesosCon again this year.
We developed a cloud native Mesos framework to run a mix of service, batch,
and stream processing workloads. To which end we created a reusable
plug-ins based scheduling library, Fenzo. I am looking forward to
presenting an in-depth look on Thurs at 2pm about how we achieve scheduling
objectives and cluster autoscaling, as well as share some of our results
with you.

I am interested in learning about and collaborating with you all regarding
scheduling and framework development.

Sharma



On Mon, Aug 17, 2015 at 2:11 AM, Ankur Chauhan an...@malloc64.com wrote:

 Hi all,

 I am Ankur Chauhan. I am a Sr. Software engineer with the Reporting and
 Analytics team
 at Brightcove Inc. I have been evaluating, tinkering, developing with
 mesos for about an year
 now. My latest adventure has been in the spark mesos integration and
 writing the new apache flink -
 mesos integration.

 I am interested in learning about managing stateful services in mesos and
 creating better documentation
 for the project.

 I am very excited to meet everyone!

 -- Ankur Chauhan.

  On 17 Aug 2015, at 00:10, Trevor Powell trevor.pow...@rms.com wrote:
 
  Hey Mesos Family! Can¹t wait to see you all in person.
 
  I¹m Trevor Powell. I am the Product Owner for our TechOps engineering
 team
  at RMS. RMS is in the catastrophic modeling business. Think of it as
  modeling Acts of God (earthquakes, floods, Godzilla, etc)  on physical
  property and damages associated with them.
 
  We¹ve been evaluating Mesos this year, and we are planning to launch it
 in
  PRD at the start of next. I am super excited :-)
 
  I am very interested in managing stateful applications inside Mesos. Also
  network segmentation in Mesos (see my ³Mesos, Multinode Workload Network
  segregation² email thread earlier this month).
 
  See you all Thursday!!
 
  Stay Smooth,
 
  --
 
  Trevor Alexander Powell
  Sr. Manager, Cloud Engineer  Architecture
  7575 Gateway Blvd. Newark, CA 94560
  T: +1.510.713.3751
  M: +1.650.325.7467
  www.rms.com
  https://www.linkedin.com/in/trevorapowell
 
  https://github.com/tpowell-rms
 
 
 
 
 
 
  On 8/16/15, 1:58 PM, Dave Lester d...@davelester.org wrote:
 
  Hi All,
 
  I'd like to kick off a thread for folks to introduce themselves in
  advance of #MesosCon
  http://events.linuxfoundation.org/events/mesoscon. Here goes:
 
  My name is Dave Lester, and I'm an Open Source Advocate at Twitter. I am
  a member of the MesosCon program committee, along with a stellar group
  of other community members who have volunteered
  
 http://events.linuxfoundation.org/events/mesoscon/program/programcommitte
  e.
  Can't wait to meet as many of you as possible.
 
  I'm eager to meet with folks interested in learning more about how we
  deploy and manage services at Twitter using Mesos and Apache Aurora
  http://aurora.apache.org. Twitter has a booth where I'll be hanging
  out for a portion of the conference, feel free to stop by and say hi.
  I'm also interested in connecting with companies that use Mesos; let's
  make sure we add you to our #PoweredByMesos list
  http://mesos.apache.org/documentation/latest/powered-by-mesos/.
 
  I'm also on Twitter: @davelester
 
  Next!
 




Re: Setting minimum offer size

2015-06-30 Thread Sharma Podila
Having the knowledge of tasks pending in the frameworks, at least via the
offer filters specifying minimum resource sizes, could prove useful. And
roles+weights would be complementary. This might remove the need to use
dynamic reservations for every framework that uses more than the smallest
size resources. Starvation often ends up being addressed via multiple
tricks including reservations, priority/weights based preemptions, and
oversubscription of resources, to name a few.

This may then tend to make frameworks be relatively more homogeneous in
their task sizes, unless they further implement prioritization within their
tasks and ask for mostly offer sizes to fit their bigger tasks.
Effectively, they become homogeneous in terms of the offer sizes they
filter on.

In general, the more diverse the resource requests, more difficult the
scheduling problem.


On Tue, Jun 30, 2015 at 7:25 AM, Dharmesh Kakadia dhkaka...@gmail.com
wrote:

 Yes, alternative allocator module will be great in terms of
 implementation, but adding more capabilities to filters might be required
 to convey some more info to the Mesos scheduler/allocator. Am I correct
 here or are there already ways to convey such info ?

 Thanks,
 Dharmesh

 On Tue, Jun 30, 2015 at 7:15 PM, Alex Rukletsov a...@mesosphere.com
 wrote:

 One option is to implement alternative behaviour in an allocator module.

 On Tue, Jun 30, 2015 at 3:34 PM, Dharmesh Kakadia dhkaka...@gmail.com
 wrote:

 Interesting.

 I agree, that dynamic reservation and optimistic offers will help
 mitigate the issue, but the resource fragmentation (and starvation due to
 that) is a more general problem. Predictive models can certainly aid the
 Mesos scheduler here. I think the filters in Mesos can be extended to add
 more general preferences like the offer size, execution/predictive model
 etc. For the Mesos scheduler, the user should be able to configure what all
 filters it recognizes while making offers, which will also make the effect
 on scalability limited,as far as I understand. Thoughts?

 Thanks,
 Dharmesh



 On Sun, Jun 28, 2015 at 7:29 PM, Alex Rukletsov a...@mesosphere.com
 wrote:

 Sharma,

 that's exactly what we plan to add to Mesos. Dynamic reservations will
 land in 0.23, the next step is to optimistically offer reserved but yet
 unused resources (we call them optimistic offers) to other framework as
 revocable. The alternative with one framework will of course work, but this
 implies having a general-purpose framework, that does some work that is
 better done by Mesos (which has more information and therefore can take
 better decisions).

 On Wed, Jun 24, 2015 at 11:54 PM, Sharma Podila spod...@netflix.com
 wrote:

 In a previous (more HPC like) system I worked on, the scheduler did
 advance reservation of resources, claiming bits and pieces it got and
 holding on until all were available. Say the last bit is expected to come
 in about 1 hour from now (and this needs job runtime 
 estimation/knowledge),
 any short jobs are back filled on to the advance reserved resources that
 are sitting idle for an hour, to improve utilization. This was combined
 with weights and priority based job preemptions, sometimes 1GB jobs are
 higher priority than the 1GB job. Unfortunately, that technique doesn't
 lend itself natively onto Mesos based scheduling.

 One idea that may work in Mesos is (thinking aloud):

 - The large (20GB) framework reserves 20 GB on some number of slaves
 (I am referring to dynamic reservations here, which aren't available yet)
 - The small framework continues to use up 1GB offers.
 - When the large framework needs to run a job, it will have the 20 GB
 offers since it has the reservation.
 - When the large framework does not have any jobs running on it, the
 small framework may be given those resources, but, those jobs will have to
 be preempted in order to offer 20 GB to the large framework.

 I understand this idea has some forward looking expectations on how
 dynamic reservations would/could work. Caveat: I haven't involved myself
 closely with that feature definition, so could be wrong with my
 expectations.

 Until something like that lands, the existing static reservations, of
 course, should work. But, that reduces utilization drastically if the 
 large
 framework runs jobs sporadically.

 Another idea is to have one framework schedule both the 20GB jobs and
 1GB jobs. Within the framework, it can bin pack the 1GB jobs on to as 
 small
 a number of slaves as possible. This increases the likelihood of finding
 20GB on a slave. Combining that with preemptions from within the framework
 (a simple kill of certain number of 1GB jobs) should satisfy the 20 GB 
 jobs.



 On Wed, Jun 24, 2015 at 9:26 AM, Tim St Clair tstcl...@redhat.com
 wrote:



 - Original Message -
  From: Brian Candler b.cand...@pobox.com
  To: user@mesos.apache.org
  Sent: Wednesday, June 24, 2015 10:50:43 AM
  Subject: Re: Setting minimum offer size
 
  On 24/06/2015 16:31

Re: Cluster autoscaling in Spark+Mesos ?

2015-06-05 Thread Sharma Podila
Not yet, we are working on making it available, sometime soon (I know, I've
said that before). Until then, if you are interested, some details are
available in my slides from Nov at
http://www.slideshare.net/spodila/aws-reinvent-2014-talk-scheduling-using-apache-mesos-in-the-cloud


On Fri, Jun 5, 2015 at 12:05 AM, Ankur Chauhan an...@malloc64.com wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Hi,

 @Sharma - Is mantis/fenzo available on github or something, I did find
 some maven artifacts but the repository netflix/fenzo is a 404. I am
 interested in learning about the bin packing logic of fenzo.

 - -- Ankur Chauhan

 On 04/06/2015 22:35, Sharma Podila wrote:
  We Autoscale our Mesos cluster in EC2 from within our framework.
  Scaling up can be easy via watching demand Vs supply. However,
  scaling down requires bin packing the tasks tightly onto as few
  servers as possible. Do you have any specific ideas on how you
  would leverage Mantis/Mesos for Spark based jobs? Fenzo, the
  scheduler part of Mantis, could be another point of leverage, which
  could give a framework the ability to autoscale the cluster among
  other benefits.
 
 
 
  On Thu, Jun 4, 2015 at 1:06 PM, Dmitry Goldenberg
  dgoldenberg...@gmail.com mailto:dgoldenberg...@gmail.com
  wrote:
 
  Thanks, Vinod. I'm really interested in how we could leverage
  something like Mantis and Mesos to achieve autoscaling in a
  Spark-based data processing system...
 
  On Jun 4, 2015, at 3:54 PM, Vinod Kone vinodk...@gmail.com
  mailto:vinodk...@gmail.com wrote:
 
  Hey Dmitry. At the current time there is no built-in support for
  Mesos to autoscale nodes in the cluster. I've heard people
  (Netflix?) do it out of band on EC2.
 
  On Thu, Jun 4, 2015 at 9:08 AM, Dmitry Goldenberg
  dgoldenberg...@gmail.com mailto:dgoldenberg...@gmail.com
  wrote:
 
  A Mesos noob here. Could someone point me at the doc or summary
  for the cluster autoscaling capabilities in Mesos?
 
  Is there a way to feed it events and have it detect the need to
  bring in more machines or decommission machines?  Is there a way
  to receive events back that notify you that machines have been
  allocated or decommissioned?
 
  Would this work within a certain set of
  preallocated/pre-provisioned/stand-by machines or will Mesos
  go and grab machines from the cloud?
 
  What are the integration points of Apache Spark and Mesos? What
  are the true advantages of running Spark on Mesos?
 
  Can Mesos autoscale the cluster based on some signals/events
  coming out of Spark runtime or Spark consumers, then cause the
  consumers to run on the updated cluster, or signal to the
  consumers to restart themselves into an updated cluster?
 
  Thanks.
 
 
 
 -BEGIN PGP SIGNATURE-

 iQEcBAEBAgAGBQJVcUoqAAoJEOSJAMhvLp3LjYIIAK9pgU41hU3Dbn5tlVWxTK7y
 knsVOnVYiuA43DwDUTXgUUFNl67wMR0DAcueSPtUkXRfyWcgGtwDJfsF1R1vdlrN
 kAiSEVxOSnRb9Gg35HVjAE4Y4uYE5xZnULf6UWi65pIPUEV9nAm3i0K5chjyC/6T
 VE2QagNg3FurXrzeSMJkMrTuwIW+rWHkOifQMtnJb3HwqmdhidZlErXh7Sz5qiDv
 0GMqjcEjpFK0ahrmDK4Nv675HitPOQN0R9V+sYhveKeRXe43CcoIUvk6yTlLN42Q
 oxl8HFLYxvZ4y+BlHuHO2sfVn6GJyO55sZWyk6k5BGVFT5RSCAjYME9jtCuSk3U=
 =RIIH
 -END PGP SIGNATURE-



Re: Cluster autoscaling in Spark+Mesos ?

2015-06-04 Thread Sharma Podila
We Autoscale our Mesos cluster in EC2 from within our framework. Scaling up
can be easy via watching demand Vs supply. However, scaling down requires
bin packing the tasks tightly onto as few servers as possible.
Do you have any specific ideas on how you would leverage Mantis/Mesos for
Spark based jobs? Fenzo, the scheduler part of Mantis, could be another
point of leverage, which could give a framework the ability to autoscale
the cluster among other benefits.



On Thu, Jun 4, 2015 at 1:06 PM, Dmitry Goldenberg dgoldenberg...@gmail.com
wrote:

 Thanks, Vinod. I'm really interested in how we could leverage something
 like Mantis and Mesos to achieve autoscaling in a Spark-based data
 processing system...

 On Jun 4, 2015, at 3:54 PM, Vinod Kone vinodk...@gmail.com wrote:

 Hey Dmitry. At the current time there is no built-in support for Mesos to
 autoscale nodes in the cluster. I've heard people (Netflix?) do it out of
 band on EC2.

 On Thu, Jun 4, 2015 at 9:08 AM, Dmitry Goldenberg 
 dgoldenberg...@gmail.com wrote:

 A Mesos noob here. Could someone point me at the doc or summary for the
 cluster autoscaling capabilities in Mesos?

 Is there a way to feed it events and have it detect the need to bring in
 more machines or decommission machines?  Is there a way to receive events
 back that notify you that machines have been allocated or decommissioned?

 Would this work within a certain set of
 preallocated/pre-provisioned/stand-by machines or will Mesos go and
 grab machines from the cloud?

 What are the integration points of Apache Spark and Mesos?  What are the
 true advantages of running Spark on Mesos?

 Can Mesos autoscale the cluster based on some signals/events coming out
 of Spark runtime or Spark consumers, then cause the consumers to run on the
 updated cluster, or signal to the consumers to restart themselves into an
 updated cluster?

 Thanks.





Re: [DISCUSS] Renaming Mesos Slave

2015-06-02 Thread Sharma Podila
My $0.02...
The use of the word Worker is confusing. This entity has several
responsibilities, including, maintaining connectivity to master, managing
and monitoring the executors, sending status updates, and other future
endeavors such as autonomously determining actions for resource
oversubscriptions, etc. That is, the entity (so far called the slave) has
some intelligence and autonomous behavior associated with it.

The word Worker, in my mind, gives it the attribute of performing a
single purpose action of executing something for the Master. Where as, the
word agent attributes a bit more intelligence to it, one of which is to
execute executors/tasks/containers. Worker has more similarities to
executor than to a Mesos slave.

So, here's my suggestion:

1. Mesos Agent (node)
2. Mesos Agent (daemon/process)
3. No
4. Via deprecation, documentation, etc.



On Tue, Jun 2, 2015 at 1:26 PM, Connor Doyle con...@mesosphere.io wrote:

 James, I'll just say one thing:

 The proposed change is for the benefit of those who _do_ have a problem
 with the current name.

 Of course you are free from having to empathize, but why block the change
 if there is support?
 Finding out if there is wider support is the purpose of this thread.

 --
 Connor


  On Jun 2, 2015, at 11:43, CCAAT cc...@tampabay.rr.com wrote:
 
  On 06/02/2015 11:58 AM, craig mcmillan wrote:
  not being from a slavery oppressed minority i'm not in a position to
  offer an opinion on the experience of the use of 'slave' in CS
  terminology, and the definition of 'minion' doesn't seem overly more
  empowering
 
  however :
 
  dom / sub
 
  is more fun and a little bit cheeky
 
  :c
 
 
  Ah. The nightclub scene is more salient; so Berlin is your favourite
 city?  What if the roles reverse; how does that map to mesos, clustering or
 parallel efforts? For humorous reasons, I like
  Mommy -- daddy so as to promote females to participate in mesos?
 
 
 
  I say all of this, as my grandfather, who later on in life became
  a pharmacist and drug store owner, was a slave in his youth. I find it
 none offense. The only thing I find offense is those not willing to fight
 to overcome their circumstances. As an over educated person, I find the
 entire historical education experienced much more offending than something
 that has existed in every culture that is more than a few hundred years
 old. For me, obtaining education and then social status, from elites, is an
 ugly process. Now, here in the USA, we
  have graduates in debt up to there eyes and often no jobs. You want to
 address a social-ill, why not just get rid of tenure and put the
 pedantics on the same hire-fire master-slave relationship graduates are
 under?  The past is just that; the past, learn from it and move on. Take
  actions about TODAY and tomorrow. Stop wallowing in the self pitty of
 what other did hundreds or thousands of years ago!
 
 
  WE still have wage-slaves,  sex-slaves and many forms of human traffic
 that are or are very, very close to slavery. Try to show your independence,
 as part of a military collective; commander-slave.
 
  How about elite-slave?  politician-slave?  Ivy_league--community_college
  for names?
 
 
  As a solution, why don't we make these relationships 'user defined
 variables'?  Surely that would be great fun and prepare us for supporting
 languages such as Haskell in a fun and ambitious function
  sort of way? [1]
 
 
 
  James
 
  [1]
 http://lesswrong.com/lw/k1o/botworld_a_cellular_automaton_for_studying/
 




Re: Is launchTasks() with multiple offers limited to a single slave?

2015-03-19 Thread Sharma Podila
I will assume that you are not talking of the case that a task actually is
being launched on multiple salves, since a task can only be launched on one
slave with existing concepts.

Yes, that call is for one or more tasks on a single slave. That call (since
0.18, I believe) also takes multiple offers of the same slave, which can
happen due to tasks finishing at different times on the host.

I have seen discussion on batching status updates/acks. But, not on
batching launching of tasks across multiple slaves. From a user
perspective, I'd imagine that this should be possible. It would be useful
for frameworks with high rate of task dispatching.

I suspect (purely my opinion) that this model may have come up in the
beginning when most frameworks were scheduling one task at a time before
moving to the next pending task. My framework, for example, runs a
scheduling loop/iteration and comes up with schedules for multiple tasks
across one more slaves. I would find it useful as well to batch up task
launches across multiple hosts.

That said, I haven't found the existing method to be limiting in
performance/latency for our needs at this time.



On Thu, Mar 19, 2015 at 8:19 AM, Itamar Ostricher ita...@yowza3d.com
wrote:

 Hi,

 According to the Python interface docstring
 https://github.com/apache/mesos/blob/master/src/python/interface/src/mesos/interface/__init__.py#L184-L193,
 launchTasks() may be called with a set of tasks.

 In our framework, we thought this is used to issue a single RPC for
 launching many tasks onto many offers (potentially from many slaves), as an
 optimization (e.g., less communication overhead).

 But, when running with multiple slaves, we saw that tasks are lost when
 they are assigned to different slaves with the same launchTasks() call.

 Reading the docstring of launchTasks carefully, I still couldn't figure
 out that this is the intended behavior, so I'm here to verify that.
 If that's by design, it should be stated clearly in the docstring (I'd be
 happy to provide a documentation pull request for this).

 Now, if this *is* the intended behavior, it raises the question - why does
 launchTasks() support a set of tasks? doesn't mesos already aggregate
 resources from the same slave to a single offer?

 Thanks,
 - Itamar.



Re: Mesos cluster auto scaling slaves

2015-02-27 Thread Sharma Podila
Hello Kenneth,

There is a little bit of work needed in the framework to do autoscaling of
the slave cluster. Theoretically, scaling up can be relatively easy by
watching the utilization and adding nodes. However, in order to scale down,
the framework must support two things - some kind of bin packing so it uses
as few slaves as possible, and a call out to which slaves can be shutdown.
I discussed how we achieve this at last year's MesosCon and also at AWS
re:Invent, slides from which are at
http://www.slideshare.net/spodila/aws-reinvent-2014-talk-scheduling-using-apache-mesos-in-the-cloud
 in case that helps you with ideas.



On Fri, Feb 27, 2015 at 12:52 PM, Kenneth Su su.ke...@gmail.com wrote:

 Hi all,

 I am new to Mesos/Mesosphere, I have tried a test from the tutorials and
 successfully built up a single master with two slaves, also dispatched the
 tasks through Marathon to all slaves. It run as expected and it is great to
 scaling app to as many instance as it needs.

 However, I have a question came up and I tried to find out the related
 information to see how Mesos could automatically scaling the slaves as many
 as need on the hardware/machines, but seems not many details on how it
 works, how the process.

 Do we need to have another layer to watch, provision nodes on demand on
 Paas so the new nodes could automatically join Mesos cluster, or Mesos
 could also handle that kind of task.

 Appreciated if any of related information/documents.

 Thanks!
 Kenneth



Re: cluster wide init

2015-01-22 Thread Sharma Podila
Schedulers can only use resources on slaves that are unused by and
unallocated to other schedulers. Therefore, schedulers cannot achieve this
unless you reserve slots on every slave for the scheduler. Seems kind of
a forced fit. An init like support would be more fundamental to Mesos
cluster itself, if available.


On Thu, Jan 22, 2015 at 10:08 AM, Ryan Thomas r.n.tho...@gmail.com wrote:

 This seems more like the responsibility of the scheduler that is running,
 like marathon or aurora.

 I haven't tried it but I would imagine if you had 10 slaves and started a
 job with 11 tasks with host exclusivity when you spin up an 11th slave
 marathon would start it there.


 On Thursday, 22 January 2015, Sharma Podila spod...@netflix.com wrote:

 Just a thought looking forward...
 Might be useful to define an init kind of feature in Mesos slaves.
 Configuration can be defined in Mesos master that lists services that must
 be run on all slaves. When slaves register, they get the list of services
 to run all the time. Updates to the configuration can be dynamically
 reflected on all slaves and therefore this ensures that all slaves run the
 required services. Sophistication can be put in place to have different set
 of services for different types of slaves (by resource types/quantity,
 etc.).
 Such a feature bodes well with Mesos being the DataCenter OS/Kernel.


 On Thu, Jan 22, 2015 at 9:43 AM, CCAAT cc...@tampabay.rr.com wrote:

 On 01/21/2015 11:10 PM, Shuai Lin wrote:

 OK, I'll take a look at the debian package.

 thanks,
 James




  You can always write the init wrapper scripts for marathon. There is an
 official debian package, which you can find in mesos's apt repo.

 On Thu, Jan 22, 2015 at 4:20 AM, CCAAT cc...@tampabay.rr.com
 mailto:cc...@tampabay.rr.com wrote:

 Hello all,

 I was reading about Marathon: Marathon scheduler processes were
 started outside of Mesos using init, upstart, or a similar tool [1]

 This means

 So my related questions are

 Does Marathon work with mesos + Openrc as the init system?

 Are there any other frameworks that work with Mesos + Openrc?


 James



 [1] http://mesosphere.github.io/__marathon/
 http://mesosphere.github.io/marathon/







Re: cluster wide init

2015-01-22 Thread Sharma Podila
Just a thought looking forward...
Might be useful to define an init kind of feature in Mesos slaves.
Configuration can be defined in Mesos master that lists services that must
be run on all slaves. When slaves register, they get the list of services
to run all the time. Updates to the configuration can be dynamically
reflected on all slaves and therefore this ensures that all slaves run the
required services. Sophistication can be put in place to have different set
of services for different types of slaves (by resource types/quantity,
etc.).
Such a feature bodes well with Mesos being the DataCenter OS/Kernel.


On Thu, Jan 22, 2015 at 9:43 AM, CCAAT cc...@tampabay.rr.com wrote:

 On 01/21/2015 11:10 PM, Shuai Lin wrote:

 OK, I'll take a look at the debian package.

 thanks,
 James




  You can always write the init wrapper scripts for marathon. There is an
 official debian package, which you can find in mesos's apt repo.

 On Thu, Jan 22, 2015 at 4:20 AM, CCAAT cc...@tampabay.rr.com
 mailto:cc...@tampabay.rr.com wrote:

 Hello all,

 I was reading about Marathon: Marathon scheduler processes were
 started outside of Mesos using init, upstart, or a similar tool [1]

 This means

 So my related questions are

 Does Marathon work with mesos + Openrc as the init system?

 Are there any other frameworks that work with Mesos + Openrc?


 James



 [1] http://mesosphere.github.io/__marathon/
 http://mesosphere.github.io/marathon/






Re: Trying to debug an issue in mesos task tracking

2015-01-21 Thread Sharma Podila
Have you checked the mesos-slave and mesos-master logs for that task id?
There should be logs in there for task state updates, including FINISHED.
There can be specific cases where sometimes the task status is not reliably
sent to your scheduler (due to mesos-master restarts, leader election
changes, etc.). There is a task reconciliation support in Mesos. A periodic
call to reconcile tasks from the scheduler can be helpful. There are also
newer enhancements coming to the task reconciliation. In the mean time,
there are other strategies such as what I use, which is periodic heartbeats
from my custom executor to my scheduler (out of band). The timeouts for
task runtimes are similar to heartbeats, except, you need a priori
knowledge of all tasks' runtimes.

Task runtime limits are not support inherently, as far as I know. Your
executor can implement it, and that may be one simple way to do it. That
could also be a good way to implement shell's rlimit*, in general.



On Wed, Jan 21, 2015 at 1:22 AM, Itamar Ostricher ita...@yowza3d.com
wrote:

 I'm using a custom internal framework, loosely based on MesosSubmit.
 The phenomenon I'm seeing is something like this:
 1. Task X is assigned to slave S.
 2. I know this task should run for ~10minutes.
 3. On the master dashboard, I see that task X is in the Running state
 for several *hours*.
 4. I SSH into slave S, and see that task X is *not* running. According to
 the local logs on that slave, task X finished a long time ago, and seemed
 to finish OK.
 5. According to the scheduler logs, it never got any update from task X
 after the Staging-Running update.

 The phenomenon occurs pretty often, but it's not consistent or
 deterministic.

 I'd appreciate your input on how to go about debugging it, and/or
 implement a workaround to avoid wasted resources.

 I'm pretty sure the executor on the slave sends the TASK_FINISHED status
 update (how can I verify that beyond my own logging?).
 I'm pretty sure the scheduler never receives that update (again, how can I
 verify that beyond my own logging?).
 I have no idea if the master got the update and passed it through (how can
 I check that?).
 My scheduler and executor are written in Python.

 As for a workaround - setting a timeout on a task should do the trick. I
 did not see any timeout field in the TaskInfo message. Does mesos support
 the concept of per-task timeouts? Or should I implement my own task
 tracking and timeouting mechanism in the scheduler?



Re: implementing data locality via mesos resource offers

2015-01-16 Thread Sharma Podila
Using the attributes would be the simplest way, if the slave were to
support dynamic updates of the attributes. The JIRA that Tim references
would be nice! Otherwise one would have to resort to something like a
wrapper script of the mesos-slave process that detects new data
availability and restarts mesos-slave with new attributes in cmdline.
Restarts may be OK when slaves are run to checkpoint state and recover
state upon restart.

Another possibility in the interim would be for the framework scheduler to
launch the task that does the download of the file(s) to the small subset
of nodes. Then, the scheduler can maintain this state information and
assign the tasks based on that. This has the additional advantage of
maintaining the list of that subset of nodes in a more dynamic way, if that
is useful to you.

In general, I am a fan of achieving data locality via the scheduler's state
info. In a more generic scenario, the data would be created dynamically by
tasks previously run (instead of just an initial download) and therefore
locality for such data is easier done via the scheduler.



On Fri, Jan 16, 2015 at 12:15 AM, Tim Chen t...@mesosphere.io wrote:

 Hi Douglas,

 The simplest way that Mesos can support is to add attributes via cli flags
 when you launch a mesos slave. And when this slave's resources is being
 offered, it will also include all the attributes you've tagged.

 This currently is static information on launch, and I believe there is
 JIRA tickets to make this dynamic (updatable at runtime).

 Tim

 On Thu, Jan 15, 2015 at 7:23 PM, Douglas Voet dv...@broadinstitute.org
 wrote:

 Hello,

 I am evaluating mesos in the context of running analyses of many large
 files. I only want to download a file to a small subset of my nodes and
 route the related processing there. The mesos paper talks about using
 resource offers as a mechanism to achieve data locality but I can't find
 any reference to how one might do this in the documentation. How would a
 mesos slave know what data is available keeping in mind that that might
 change over time? How can I configure a slave to include this information
 in resource offers?

 Thanks in advance for any pointers.

 -Doug





Re: implementing data locality via mesos resource offers

2015-01-16 Thread Sharma Podila
Hi Tim,

Sure, here's some preliminary thoughts.

In a Mesos cluster that has only one framework, it would suffice for the
scheduler to have this strategy;

- when assigning for a task that needs data locality, assign from an offer
from a host that has the data
- when assigning for a task that does need data locality, do not assign
from an offer from a host that has/had another task which produced data
needed by others for data locality

This strategy would naturally cluster hosts into two groups: one in which
hosts are used for data locality and another in which hosts run tasks that
don't need data locality. Or, multiple groups if not all data is identical.

Now, if there were to be multiple frameworks in the cluster, we would need
new support in Mesos to ensure the above strategy works. Mesos allocater
would need to do the following:

- when giving out offers to framework A, prefer hosts that had other tasks
running (or previously run) from framework A.

As an example, say we have two frameworks A and B. And say there are 4
hosts, h1, h2, h3, and h4, each with 4 cores.
If, say, A and B are assigned 1:1, that is 8 cores each. Say currently, 2
cores from each of the 4 hosts are offered to frameworks A and B. A variety
of reasons could have resulted in such a split.

Now, say framework A launches a task that uses 2 cores and it uses its
offer on host h1. Now, framework A has no ability to launch another task to
achieve data locality. To keep resource allocation still 1:1 and help data
locality, it would be nice if Mesos did the following:

- rescind 2-core offer on h1 from framework B
- rescind 2-core offer on h2 from framework A
- send 2-core offer on h1 to framework A
- send 2-core offer on h2 to framework B

This would need to be done only if framework A indicated, when launching
its task on h1, that this is a task that produces data for locality
purposes.

Similarly, other scenarios and other resource types can be dealt with in
this new strategy.





On Fri, Jan 16, 2015 at 9:53 AM, Tim Chen t...@mesosphere.io wrote:

 Hi Sharma,

 You're correct and that's how most schedulers handle this, which is to
 handle the locality information itself.

 We've considering and finding primitives to help in this front though, so
 if you have any input let us know how to help manage locality that fits at
 the level of Mesos.

 Tim

 On Fri, Jan 16, 2015 at 9:34 AM, Sharma Podila spod...@netflix.com
 wrote:

 Using the attributes would be the simplest way, if the slave were to
 support dynamic updates of the attributes. The JIRA that Tim references
 would be nice! Otherwise one would have to resort to something like a
 wrapper script of the mesos-slave process that detects new data
 availability and restarts mesos-slave with new attributes in cmdline.
 Restarts may be OK when slaves are run to checkpoint state and recover
 state upon restart.

 Another possibility in the interim would be for the framework scheduler
 to launch the task that does the download of the file(s) to the small
 subset of nodes. Then, the scheduler can maintain this state information
 and assign the tasks based on that. This has the additional advantage of
 maintaining the list of that subset of nodes in a more dynamic way, if that
 is useful to you.

 In general, I am a fan of achieving data locality via the scheduler's
 state info. In a more generic scenario, the data would be created
 dynamically by tasks previously run (instead of just an initial download)
 and therefore locality for such data is easier done via the scheduler.



 On Fri, Jan 16, 2015 at 12:15 AM, Tim Chen t...@mesosphere.io wrote:

 Hi Douglas,

 The simplest way that Mesos can support is to add attributes via cli
 flags when you launch a mesos slave. And when this slave's resources is
 being offered, it will also include all the attributes you've tagged.

 This currently is static information on launch, and I believe there is
 JIRA tickets to make this dynamic (updatable at runtime).

 Tim

 On Thu, Jan 15, 2015 at 7:23 PM, Douglas Voet dv...@broadinstitute.org
 wrote:

 Hello,

 I am evaluating mesos in the context of running analyses of many large
 files. I only want to download a file to a small subset of my nodes and
 route the related processing there. The mesos paper talks about using
 resource offers as a mechanism to achieve data locality but I can't find
 any reference to how one might do this in the documentation. How would a
 mesos slave know what data is available keeping in mind that that might
 change over time? How can I configure a slave to include this information
 in resource offers?

 Thanks in advance for any pointers.

 -Doug







Re: Question about External Containerizer

2014-12-03 Thread Sharma Podila
This may have to do with fine-grain Vs coarse-grain resource allocation.
Things may be easier for you, Diptanu, if you are using one Docker
container per task (sort of coarse grain). In that case, I believe there's
no need to alter a running Docker container's resources. Instead, the
resource update of your executor translates into the right Docker
containers running. There's some details to be worked out there, I am sure.
It sounds like Tom's strategy uses the same Docker container for multiple
tasks. Tom, do correct me otherwise.

On Wed, Dec 3, 2014 at 3:38 AM, Tom Arnfeld t...@duedil.com wrote:

 When Mesos is asked to a launch a task (with either a custom Executor or
 the built in CommandExecutor) it will first spawn the executor which _has_
 to be a system process, launched via command. This process will be launched
 inside of a Docker container when using the previously mentioned
 containerizers.

 Once the Executor registers with the slave, the slave will send it a
 number of *launchTask* calls based on the number of tasks queued up for
 that executor. The Executor can then do as it pleases with those tasks,
 whether it's just a *sleep(1)* or to spawn a subprocess and do some other
 work. Given it is possible for the framework to specify resources for both
 tasks and executors, and the only thing which _has_ to be a system
 process is the executor, the mesos slave will limit the resources of the
 executor process to the sum of (TaskInfo.Executor.Resources +
 TaskInfo.Resources).

 Mesos also has the ability to launch new tasks on an already running
 executor, so it's important that mesos is able to dynamically scale the
 resource limits up and down over time. Designing a framework around this
 idea can lead to some complex and powerful workflows which would be a lot
 more complex to build without Mesos.

 Just for an example... Spark.

 1) User launches a job on spark to map over some data
 2) Spark launches a first wave of tasks based on the offers it received
 (let's say T1 and T2)
 3) Mesos launches executors for those tasks (let's say E1 and E2) on
 different slaves
 4) Spark launches another wave of tasks based on offers, and tells mesos
 to use the same executor (E1 and E2)
 5) Mesos will simply call *launchTasks(T{3,4})* on the two already
 running executors

 At point (3) mesos is going to launch a Docker container and execute your
 executor. However at (5) the executor is already running so the tasks will
 be handed to the already running executor.

 Mesos will guarantee you (i'm 99% sure) that the resources for your
 container have been updated to reflect the limits set on the tasks
 *before* handing the tasks to you.

 I hope that makes some sense!

 --

 Tom Arnfeld
 Developer // DueDil


 On Wed, Dec 3, 2014 at 10:54 AM, Diptanu Choudhury dipta...@gmail.com
 wrote:

 Thanks for the explanation Tom, yeah I just figured that out by reading
 your code! You're touching the memory.soft_limit_in_bytes and
 memory.limit_in_bytes directly.

 Still curios to understand in which situations Mesos Slave would call the
 external containerizer to update the resource limits of a container? My
 understanding was that once resource allocation happens for a task,
 resources are not taken away until the task exits[fails, crashes or
 finishes] or Mesos asks the slave to kill the task.

 On Wed, Dec 3, 2014 at 2:47 AM, Tom Arnfeld t...@duedil.com wrote:

 Hi Diptanu,

 That's correct, the ECP has the responsibility of updating the resource
 for a container, and it will do as new tasks are launched and killed for an
 executor. Since docker doesn't support this, our containerizer (Deimos does
 the same) goes behind docker to the cgroup for the container and updates
 the resources in a very similar way to the mesos-slave. I believe this is
 also what the built in Docker containerizer will do.


 https://github.com/duedil-ltd/mesos-docker-containerizer/blob/master/containerizer/commands/update.py#L35

 Tom.

 --

 Tom Arnfeld
 Developer // DueDil


 On Wed, Dec 3, 2014 at 10:45 AM, Diptanu Choudhury dipta...@gmail.com
 wrote:

 Hi,

 I had a quick question about the external containerizer. I see that
 once the Task is launched, the ECP can receive the update calls, and the
 protobuf message passed to ECP with the update call is
 containerizer::Update.

 This protobuf has a Resources [list] field so does that mean Mesos
 might ask a running task to re-adjust the enforced resource limits?

 How would this work if the ECP was launching docker containers because
 Docker doesn't allow changing the resource limits once the container has
 been started?

 I am wondering how does Deimos and mesos-docker-containerizer handle
 this.

 --
 Thanks,
 Diptanu Choudhury
 Web - www.linkedin.com/in/diptanu
 Twitter - @diptanu http://twitter.com/diptanu





 --
 Thanks,
 Diptanu Choudhury
 Web - www.linkedin.com/in/diptanu
 Twitter - @diptanu http://twitter.com/diptanu





Re: Question about External Containerizer

2014-12-03 Thread Sharma Podila
Yes, although, there's a nuance to this specific situation. Here, the same
executor is being used for multiple tasks, but, the executor is launching a
different Docker container for each task. I was extending the coarse grain
allocation concept to within the executor (which is in a fine grained
allocation model).
What you mention, we do use already for a different framework, not the one
Diptanu is talking about.

On Wed, Dec 3, 2014 at 11:04 AM, Connor Doyle con...@mesosphere.io wrote:

 You're right Sharma, it's dependent upon the framework.  If your scheduler
 sets a unique ExecutorID for each TaskInfo, then the executor will not be
 re-used and you won't have to worry about resizing the executor's container
 to accomodate subsequent tasks.  This might be a reasonable simplification
 to start with, especially if your executor adds relatively low resource
 overhead.
 --
 Connor


  On Dec 3, 2014, at 10:20, Sharma Podila spod...@netflix.com wrote:
 
  This may have to do with fine-grain Vs coarse-grain resource allocation.
 Things may be easier for you, Diptanu, if you are using one Docker
 container per task (sort of coarse grain). In that case, I believe there's
 no need to alter a running Docker container's resources. Instead, the
 resource update of your executor translates into the right Docker
 containers running. There's some details to be worked out there, I am sure.
  It sounds like Tom's strategy uses the same Docker container for
 multiple tasks. Tom, do correct me otherwise.
 
  On Wed, Dec 3, 2014 at 3:38 AM, Tom Arnfeld t...@duedil.com wrote:
  When Mesos is asked to a launch a task (with either a custom Executor or
 the built in CommandExecutor) it will first spawn the executor which _has_
 to be a system process, launched via command. This process will be launched
 inside of a Docker container when using the previously mentioned
 containerizers.
 
  Once the Executor registers with the slave, the slave will send it a
 number of launchTask calls based on the number of tasks queued up for that
 executor. The Executor can then do as it pleases with those tasks, whether
 it's just a sleep(1) or to spawn a subprocess and do some other work. Given
 it is possible for the framework to specify resources for both tasks and
 executors, and the only thing which _has_ to be a system process is the
 executor, the mesos slave will limit the resources of the executor process
 to the sum of (TaskInfo.Executor.Resources + TaskInfo.Resources).
 
  Mesos also has the ability to launch new tasks on an already running
 executor, so it's important that mesos is able to dynamically scale the
 resource limits up and down over time. Designing a framework around this
 idea can lead to some complex and powerful workflows which would be a lot
 more complex to build without Mesos.
 
  Just for an example... Spark.
 
  1) User launches a job on spark to map over some data
  2) Spark launches a first wave of tasks based on the offers it received
 (let's say T1 and T2)
  3) Mesos launches executors for those tasks (let's say E1 and E2) on
 different slaves
  4) Spark launches another wave of tasks based on offers, and tells mesos
 to use the same executor (E1 and E2)
  5) Mesos will simply call launchTasks(T{3,4}) on the two already running
 executors
 
  At point (3) mesos is going to launch a Docker container and execute
 your executor. However at (5) the executor is already running so the tasks
 will be handed to the already running executor.
 
  Mesos will guarantee you (i'm 99% sure) that the resources for your
 container have been updated to reflect the limits set on the tasks before
 handing the tasks to you.
 
  I hope that makes some sense!
 
  --
 
  Tom Arnfeld
  Developer // DueDil
 
 
  On Wed, Dec 3, 2014 at 10:54 AM, Diptanu Choudhury dipta...@gmail.com
 wrote:
 
  Thanks for the explanation Tom, yeah I just figured that out by reading
 your code! You're touching the memory.soft_limit_in_bytes and
 memory.limit_in_bytes directly.
 
  Still curios to understand in which situations Mesos Slave would call
 the external containerizer to update the resource limits of a container? My
 understanding was that once resource allocation happens for a task,
 resources are not taken away until the task exits[fails, crashes or
 finishes] or Mesos asks the slave to kill the task.
 
  On Wed, Dec 3, 2014 at 2:47 AM, Tom Arnfeld t...@duedil.com wrote:
  Hi Diptanu,
 
  That's correct, the ECP has the responsibility of updating the resource
 for a container, and it will do as new tasks are launched and killed for an
 executor. Since docker doesn't support this, our containerizer (Deimos does
 the same) goes behind docker to the cgroup for the container and updates
 the resources in a very similar way to the mesos-slave. I believe this is
 also what the built in Docker containerizer will do.
 
 
 https://github.com/duedil-ltd/mesos-docker-containerizer/blob/master/containerizer/commands/update.py#L35
 
  Tom.
 
  --
 
  Tom

Re: A problem with resource offers

2014-11-07 Thread Sharma Podila
Thanks, Adam. I should've looked at the fixed issues for this.
Things work fine with a later version, confirmed with 0.20.

On Fri, Nov 7, 2014 at 1:29 AM, Adam Bordelon a...@mesosphere.io wrote:

 Fixed in 0.19: https://issues.apache.org/jira/browse/MESOS-1400

 On Thu, Nov 6, 2014 at 7:59 PM, Timothy Chen t...@mesosphere.io wrote:

 Hi Sharma,

 Can you try out the latest master and see if you can repro it?

 Tim

 Sent from my iPhone

 On Nov 6, 2014, at 7:41 PM, Sharma Podila spod...@netflix.com wrote:

 ​​
 I am on 0.18 still.

 I think I found a bug. I wrote a simple program to repeat this and
 there's a new twist as well.

 Again, although I have fixed this for now in my framework by removing all
 previous leases after re-registration, this can show up when mesos starts
 rescinding offers in the future.

 Here's what I do:

 1. register with mesos that has just one slave in the cluster and only
 one master
 2. get an offer, O1
 3. kill and restart mesos master
 4. get new offer for the only slave, O2
 5. launch a task with both offers O1 and O2
 6. receive TASK_LOST
 7. wait for new offer, that never comes.
 Here's the new twist:
 8. kill my framework and restart
 9. get no offers from mesos at all.

 Here's the relevant mesos master logs:

 I1106 19:31:55.734485 10423 master.cpp:770] Elected as the leading master!
 I1106 19:31:55.737759 10423 master.cpp:1936] Attempting to re-register
 slave 20141029-125131-16842879-5050-18827-1 at slave(1)@127.0.1.1:5051
 (lgud-spodila2)
 I1106 19:31:55.737788 10423 master.cpp:2818] Adding slave
 20141029-125131-16842879-5050-18827-1 at lgud-spodila2 with cpus(*):8;
 mem(*):39209; disk(*):219127; ports(*):[31000-32000]
 I1106 19:31:55.738088 10422 hierarchical_allocator_process.hpp:445] Added
 slave 20141029-125131-16842879-5050-18827-1 (lgud-spodila2) with cpus(*):8;
 mem(*):39209; disk(*):219127; ports(*):[31000-32000] (and cpus(*):8;
 mem(*):39209; disk(*):219127; ports(*):[31000-32000] available)
 I1106 19:31:56.733850 10423 master.cpp:906] Re-registering framework
 20141106-193136-16842879-5050-10308- at scheduler(1)@127.0.1.1:55515
 I1106 19:31:56.734544 10424 hierarchical_allocator_process.hpp:332] Added
 framework 20141106-193136-16842879-5050-10308-
 I1106 19:31:56.735044 10424 master.cpp:2285] Sending 1 offers to
 framework 20141106-193136-16842879-5050-10308-
 I1106 19:31:59.627913 10423 http.cpp:391] HTTP request for
 '/master/state.json'
 I1106 19:32:09.634088 10421 http.cpp:391] HTTP request for
 '/master/state.json'
 W1106 19:32:10.377226 10425 master.cpp:1556] Failed to validate offer  :
 Offer 20141106-193136-16842879-5050-10308-0 is no longer valid
 I1106 19:32:10.378697 10425 master.cpp:1567] Sending status update
 TASK_LOST (UUID: afadf504-f606-47f2-82cc-5af2e532afcd) for task Job123 of
 framework 20141106-193136-16842879-5050-10308- for launch task attempt
 on invalid offers: [ 20141106-193147-16842879-5050-10406-0,
 20141106-193136-16842879-5050-10308-0 ]

 
 Master thinks both offers are invalid and basically leaks it.

 I1106 19:32:19.640913 10422 http.cpp:391] HTTP request for
 '/master/state.json'
 I1106 19:32:22.667037 10424 master.cpp:595] Framework
 20141106-193136-16842879-5050-10308- disconnected
 I1106 19:32:22.667280 10424 master.cpp:1079] Deactivating framework
 20141106-193136-16842879-5050-10308-
 I1106 19:32:22.668009 10424 master.cpp:617] Giving framework
 20141106-193136-16842879-5050-10308- 0ns to failover
 I1106 19:32:22.668124 10427 hierarchical_allocator_process.hpp:408]
 Deactivated framework 20141106-193136-16842879-5050-10308-
 I1106 19:32:22.668252 10425 master.cpp:2201] Framework failover timeout,
 removing framework 20141106-193136-16842879-5050-10308-
 I1106 19:32:22.668443 10425 master.cpp:2688] Removing framework
 20141106-193136-16842879-5050-10308-
 I1106 19:32:22.668829 10425 hierarchical_allocator_process.hpp:363]
 Removed framework 20141106-193136-16842879-5050-10308-
 I1106 19:32:24.739157 10426 master.cpp:818] Received registration request
 from scheduler(1)@127.0.1.1:37122
 I1106 19:32:24.739328 10426 master.cpp:836] Registering framework
 20141106-193147-16842879-5050-10406- at scheduler(1)@127.0.1.1:37122
 I1106 19:32:24.739753 10426 hierarchical_allocator_process.hpp:332] Added
 framework 20141106-193147-16842879-5050-10406-
 I1106 19:32:29.647886 10423 http.cpp:391] HTTP request for
 '/master/state.json'


 On Thu, Nov 6, 2014 at 6:53 PM, Benjamin Mahler 
 benjamin.mah...@gmail.com wrote:

 Which version of the master are you using and do you have the logs? The
 fact that no offers were coming back sounds like a bug!

 As for using O1 after a disconnection, all offers are invalid once a
 disconnection occurs. The scheduler driver does not automatically rescind
 offers upon disconnection, so I'd recommend clearing all cached offers when
 your scheduler gets disconnected, to avoid

Re: A problem with resource offers

2014-11-06 Thread Sharma Podila
​​
I am on 0.18 still.

I think I found a bug. I wrote a simple program to repeat this and there's
a new twist as well.

Again, although I have fixed this for now in my framework by removing all
previous leases after re-registration, this can show up when mesos starts
rescinding offers in the future.

Here's what I do:

1. register with mesos that has just one slave in the cluster and only one
master
2. get an offer, O1
3. kill and restart mesos master
4. get new offer for the only slave, O2
5. launch a task with both offers O1 and O2
6. receive TASK_LOST
7. wait for new offer, that never comes.
Here's the new twist:
8. kill my framework and restart
9. get no offers from mesos at all.

Here's the relevant mesos master logs:

I1106 19:31:55.734485 10423 master.cpp:770] Elected as the leading master!
I1106 19:31:55.737759 10423 master.cpp:1936] Attempting to re-register
slave 20141029-125131-16842879-5050-18827-1 at slave(1)@127.0.1.1:5051
(lgud-spodila2)
I1106 19:31:55.737788 10423 master.cpp:2818] Adding slave
20141029-125131-16842879-5050-18827-1 at lgud-spodila2 with cpus(*):8;
mem(*):39209; disk(*):219127; ports(*):[31000-32000]
I1106 19:31:55.738088 10422 hierarchical_allocator_process.hpp:445] Added
slave 20141029-125131-16842879-5050-18827-1 (lgud-spodila2) with cpus(*):8;
mem(*):39209; disk(*):219127; ports(*):[31000-32000] (and cpus(*):8;
mem(*):39209; disk(*):219127; ports(*):[31000-32000] available)
I1106 19:31:56.733850 10423 master.cpp:906] Re-registering framework
20141106-193136-16842879-5050-10308- at scheduler(1)@127.0.1.1:55515
I1106 19:31:56.734544 10424 hierarchical_allocator_process.hpp:332] Added
framework 20141106-193136-16842879-5050-10308-
I1106 19:31:56.735044 10424 master.cpp:2285] Sending 1 offers to framework
20141106-193136-16842879-5050-10308-
I1106 19:31:59.627913 10423 http.cpp:391] HTTP request for
'/master/state.json'
I1106 19:32:09.634088 10421 http.cpp:391] HTTP request for
'/master/state.json'
W1106 19:32:10.377226 10425 master.cpp:1556] Failed to validate offer  :
Offer 20141106-193136-16842879-5050-10308-0 is no longer valid
I1106 19:32:10.378697 10425 master.cpp:1567] Sending status update
TASK_LOST (UUID: afadf504-f606-47f2-82cc-5af2e532afcd) for task Job123 of
framework 20141106-193136-16842879-5050-10308- for launch task attempt
on invalid offers: [ 20141106-193147-16842879-5050-10406-0,
20141106-193136-16842879-5050-10308-0 ]


Master thinks both offers are invalid and basically leaks it.

I1106 19:32:19.640913 10422 http.cpp:391] HTTP request for
'/master/state.json'
I1106 19:32:22.667037 10424 master.cpp:595] Framework
20141106-193136-16842879-5050-10308- disconnected
I1106 19:32:22.667280 10424 master.cpp:1079] Deactivating framework
20141106-193136-16842879-5050-10308-
I1106 19:32:22.668009 10424 master.cpp:617] Giving framework
20141106-193136-16842879-5050-10308- 0ns to failover
I1106 19:32:22.668124 10427 hierarchical_allocator_process.hpp:408]
Deactivated framework 20141106-193136-16842879-5050-10308-
I1106 19:32:22.668252 10425 master.cpp:2201] Framework failover timeout,
removing framework 20141106-193136-16842879-5050-10308-
I1106 19:32:22.668443 10425 master.cpp:2688] Removing framework
20141106-193136-16842879-5050-10308-
I1106 19:32:22.668829 10425 hierarchical_allocator_process.hpp:363] Removed
framework 20141106-193136-16842879-5050-10308-
I1106 19:32:24.739157 10426 master.cpp:818] Received registration request
from scheduler(1)@127.0.1.1:37122
I1106 19:32:24.739328 10426 master.cpp:836] Registering framework
20141106-193147-16842879-5050-10406- at scheduler(1)@127.0.1.1:37122
I1106 19:32:24.739753 10426 hierarchical_allocator_process.hpp:332] Added
framework 20141106-193147-16842879-5050-10406-
I1106 19:32:29.647886 10423 http.cpp:391] HTTP request for
'/master/state.json'


On Thu, Nov 6, 2014 at 6:53 PM, Benjamin Mahler benjamin.mah...@gmail.com
wrote:

 Which version of the master are you using and do you have the logs? The
 fact that no offers were coming back sounds like a bug!

 As for using O1 after a disconnection, all offers are invalid once a
 disconnection occurs. The scheduler driver does not automatically rescind
 offers upon disconnection, so I'd recommend clearing all cached offers when
 your scheduler gets disconnected, to avoid the unnecessary TASK_LOST
 updates.

 On Thu, Nov 6, 2014 at 6:25 PM, Sharma Podila spod...@netflix.com wrote:

 We had an interesting problem with resource offers today and I would like
 to confirm this problem and request an enhancement. Here's the summary in
 the right sequence of events:

 1. resource offer O1 for slave A arrives
 2. mesos disconnects
 3. mesos reregisters
 4. mesos offer O2 for slave A arrives
 (our framework keeps offers for sometime if unused, therefore, we now
 have both O1 and O2, incorrectly)
 5. launch task T1 using offers O1 and O2
 6. framework thinks

Re: Reconciliation Document

2014-11-03 Thread Sharma Podila
Inline...

On Tue, Oct 21, 2014 at 12:52 PM, Benjamin Mahler benjamin.mah...@gmail.com
 wrote:

 Inline.

 On Thu, Oct 16, 2014 at 7:43 PM, Sharma Podila spod...@netflix.com
 wrote:

 Response inline, below.

 On Thu, Oct 16, 2014 at 5:41 PM, Benjamin Mahler 
 benjamin.mah...@gmail.com wrote:

 Thanks for the thoughtful questions, I will take these into account in
 the document.

 Addressing each question in order:

 *(1) Why the retry?*

 It could be once per (re-)registration in the future.

 Some requests are temporarily unanswerable. For example, if reconciling
 task T on slave S, and slave S has not yet re-registered, we cannot reply
 until the slave is re-registered or removed. Also, if a slave is
 transitioning (being removed), we want to make sure that operation finishes
 before we can answer.

 It's possible to keep the request around and trigger an event once we
 can answer. However, we chose to drop and remain silent for these tasks.
 This is both for implementation simplicity and as a defense against OOMing
 from too many pending reconciliation requests.


 ​I was thinking that the state machine that maintains the state of tasks
 always has answers for the current state. Therefore, I don't expect any
 blocking. For example, if S hasn't yet re-registered. the state machine
 must think that the state of T is still 'running' until either the slave
 re-registers and informs of the task being lost, or a timeout occurs after
 which master decides the slave is gone. At which point a new status update
 can be sent. I don't see a reason why reconcile needs to wait until slave
 re-registers here. Maybe I am missing something else?​ Same with
 transitioning... the state information is always available, say, as
 running, until transition happens. This results in two status updates, but
 always correct.


 Task state in Mesos is persisted in the leaves of the system (the slaves)
 for scalability reasons. So when a new master starts up, it doesn't know
 anything about tasks; this state is bootstrapped from the slaves as they
 re-register. This interim period of state recovery is when frameworks may
 not receive answers to reconciliation requests, depending on whether the
 particular slave has re-registered.

 In your second case, once a slave is removed, we will send the LOST update
 for all non-terminal tasks on the slave. There's little benefit of replying
 to a reconciliation request while it's being removed, because LOST updates
 are coming shortly thereafter. You can think of these LOST updates as the
 reply to the reconciliation request, as far as the scheduler is concerned.

 I think the two takeaways here are:

 (1) Ultimately while it is possible to avoid the need for retries on the
 framework side, it introduces too much complexity in the master and gives
 us no flexibility in ignoring or dropping messages. Even in such a world,
 the retries would be a valid resiliency measure for frameworks to insulate
 themselves against anything being dropped.

 (2) For now, we want to encourage framework developers to think about
 these kinds of issues, we want them to implement their frameworks in a
 resilient manner. And so in general we haven't chosen to provide a crutch
 when it requires a lot of complexity in Mesos. Today we can't add these
 ergonomic improvements in the scheduler driver because it has no
 persistence. Hopefully as the project moves forward, we can have these kind
 of framework side ergonomic improvements be contained in pure language
 bindings to Mesos. A nice stateful language binding can hide this from you.
 :)


​OK. The only thought I have is that it could be somewhat useful to have
master send back a (new) state of 'PendingSlaveUpdate' instead of going
silent. This way the reconcile process finishes immediately. Framework
would then retry ​later for tasks that got these states. Although, figuring
out the timeout after which to retry is still the same issue.

This brings up another question. Say, a slave is 'missing' and hasn't
re-registered with master yet. What is the expected behavior when framework
asks master to kill a task on that slave? Since the slave is disconnected,
the kill request isn't delivered to the executor on that slave. Is the
framework notified of this failure to send the kill request?

This has implications to a framework's task reconcile logic. After a
certain #reconciliations, framework would want to treat the task as
terminally lost and resubmit a replacement. For safety, I'd kill the
existing task before resubmitting the replacement. I am guessing frameworks
should not assume guaranteed delivery of the kill request. So, it is
possible that the task may continue running after the slave reconnects.
Which implies that the framework is now consuming double resources for the
same task. I understand this is out of scope for the master and the
tasks/frameworks should use external logic to guarantee only one instance
of a task runs. I am just wanting to know

Re: Reconciliation Document

2014-10-15 Thread Sharma Podila
Looks like a good step forward.

What is the reason for the algorithm having to call reconcile tasks
multiple times after waiting some time in step 6? Shouldn't it be just once
per (re)registration?

Are there time bound guarantees within which a task update will be sent out
after a reconcile request is sent? In the algorithm for task
reconciliation, what would be a good timeout after which we conclude that
we got no task update from the master? Upon such a timeout, I would be
tempted to conclude that the task has disappeared. In which case, I would
call driver.killTask() (to be sure its marked as gone), mark my task as
terminated, then submit a replacement task.

Does the rate limiting feature (in the works?) affect task reconciliation
due to the volume of task updates sent back?

Thanks.


On Wed, Oct 15, 2014 at 2:05 PM, Benjamin Mahler benjamin.mah...@gmail.com
wrote:

 Hi all,

 I've sent a review out for a document describing reconciliation, you can
 see the draft here:
 https://gist.github.com/bmahler/18409fc4f052df43f403

 Would love to gather high level feedback on it from framework developers.
 Feel free to reply here, or on the review:
 https://reviews.apache.org/r/26669/

 Thanks!
 Ben



Re: Framework testing in Mesos

2014-10-14 Thread Sharma Podila

 @Sharma #3 looks impressive and I hear the pain. Few questions:
 * Since you already have the state machine modeling, can't the scheduler
 actions also be modeled as a state machine transitions?


I suppose that is possible in theory. I am thinking that the scheduler
state will have to be a function of all the tasks' and slaves' states,
which could be more tedious to verify with every task assignment than
validate individual assignment decisions. Maybe there is a different way to
look at this.


 * Having a spec for (in form of state machine or otherwise) scheduler
 looks important (and hard) goal. Mocking looks like a good thing. Is
 mocking general enough to become a library available to all, to enable
 *verifiably* correct scheduler behavior?


A general library for mocking parts of the scheduler may be useful, I
agree. Here's what I have right now. I mock the incoming offers with an
OfferProvider that has these methods:

getOffer(numCpus, memory, portRanges, attributesMap)  ## and overloaded
variants
getConsumedOffer(assignments)

The first is used to setup a new offer for a slave. When that slave gets
used for some task assignments, the second method returns a new offer that
has resources minus the resources used for the assignments.

This works for the task assignment part of the scheduler (#3 in my previous
email). Also, I don't build the actual Protos.Offer object since the task
assigner object I have deals with a wrapper object around the offer, which
is what I mock, strictly speaking.

Sharma


On Tue, Oct 14, 2014 at 9:36 AM, Dharmesh Kakadia dhkaka...@gmail.com
wrote:

 Thanks to both of you.

 @David Idempotence (and functional style) will both mitigate the issue of
 testing.

 @Sharma #3 looks impressive and I hear the pain. Few questions:
 * Since you already have the state machine modeling, can't the scheduler
 actions also be modeled as a state machine transitions?
 * Having a spec for (in form of state machine or otherwise) scheduler
 looks important (and hard) goal. Mocking looks like a good thing. Is
 mocking general enough to become a library available to all, to enable
 *verifiably* correct scheduler behavior?

 Again thanks for sharing your thoughts.

 Thanks,
 Dharmesh

 On Mon, Oct 13, 2014 at 7:29 AM, David Greenberg dsg123456...@gmail.com
 wrote:

 Specifically with regards to the state of the framework due to callback
 ordering, we ensure that our framework is written in a functional style, so
 that all callbacks atomically transform the previous state to a new state.
 By doing this, we serialize all callbacks. At this point, you can do
 generative testing to create events and run them through your system. This,
 at least, makes #3 possible.

 For #4, we are pretty careful to choose idempotent writes into the DB and
 a DB that supports snapshot reads. This way, you can just use at-least-once
 semantics for easy-to-implement retries. If a write fails, you just crash,
 since that means your DB's completely down. Then we test by thinking
 through and discussing whether operations have this idempotency property
 and the simple retry logic independently. This starts to get at a way to
 manage #4 to avoid learning in production.

 On Sun, Oct 12, 2014 at 11:44 AM, Dharmesh Kakadia dhkaka...@gmail.com
 wrote:

 Thanks David.

 Taking state of the framework is an interesting design. I am assuming
 the scheduler is maintaining the state and then handing tasks on slaves. If
 that's the case, we can safely test executor (stateless - receiving event
 and returning appropriate status to the scheduler). You construct scheduler
 tests similarly by passing different states and event and observing the
 next state. This way you will be sure that your callbacks works fine in
 *isolation*. I would be concerned about the state of the framework in
 case of callback ordering (or re-execution) in *all possible scenarios*.
 Mocking is exactly what might uncover such bugs, but as you pointed out, I
 also think it would not be trivial for many frameworks.

 At a high-level it would be important to know for frameworks developers
 that,
 1. executors are working fine in isolation on fresh start, implementing
 the feature.
 2. executors are working fine when rescheduled/restarted/in presence of
 other executors.
 3. scheduler is working fine in isolation.
 4. scheduler is fine in the wild ( in presence of
 others/failures/checkpointing/...).

 1 is easy to do traditionally. 2 is possible if your executors do not
 have side effect or using Docker etc.
 3 and 4 are not easy to do. I think having support/library for testing
 scheduler is something all the framework writer would benefit from. Not
 having to think about communication between executors and scheduler is
 already a big plus, can we also make it easier for developers to test about
 their scheduler behaviour?

 Thoughts?

 I would love to hear thoughts from others.

 Thanks,
 Dharmesh

 On Sun, Oct 12, 2014 at 8:03 PM, David Greenberg 

Re: Framework testing in Mesos

2014-10-12 Thread Sharma Podila
Trying to test the framework in an automated way, I tend to think of the
framework in these parts:
1. Executor
2. Scheduler's interaction with Mesos and state persistence
3. Scheduler's task assignment of resources

I will skip #1, you covered that already and it depends largely on the kind
of executor being used.

#2 is mostly achieved for us using a state machine along with a reliable
persistence engine. Then, it comes down to testing the state machine. Which
could be pretty simple. Besides the obvious state transition rules testing,
we only add testing of alert generation/handling when certain state
transitions timeout. For example, a task in STAGING state transitions to
STARTING state in certain time, or an alert is generated.

#3 is where we have spent most of the time. This may not be necessary for
simpler assignment strategies such as first fit. We are doing a bit more
for optimal task assignments with hard/soft constraints, auto scaling,
etc.. Trying to test a sophisticated scheduler can be non-trivial. But,
fortunately, it can be unit tested without requiring rest of Mesos. Offers
can be mocked/created for testing including all resources available, etc
(we do this currently for CPU,memory,ports). Using offers to launch tasks
in Mesos can be mocked by generating new offers less resources used by
launched tasks. As of now I have as many LoC in unit tests as the actual
code. Sometimes it takes less effort to write a new scheduler feature but
more effort to come up with deterministic tests for it. And far more effort
to debug it in real runs, if it weren't unit tested.

About #4 in your list 4. scheduler is fine in the wild ( in presence of
others/failures/checkpointing/...), I'd call out ZooKeeper interaction as
well since most likely there's multiple copies of the scheduler running
using leader election strategy for HA purposes.

Happy to hear other strategies as well...

Sharma



On Sun, Oct 12, 2014 at 8:44 AM, Dharmesh Kakadia dhkaka...@gmail.com
wrote:

 Thanks David.

 Taking state of the framework is an interesting design. I am assuming the
 scheduler is maintaining the state and then handing tasks on slaves. If
 that's the case, we can safely test executor (stateless - receiving event
 and returning appropriate status to the scheduler). You construct scheduler
 tests similarly by passing different states and event and observing the
 next state. This way you will be sure that your callbacks works fine in
 *isolation*. I would be concerned about the state of the framework in
 case of callback ordering (or re-execution) in *all possible scenarios*.
 Mocking is exactly what might uncover such bugs, but as you pointed out, I
 also think it would not be trivial for many frameworks.

 At a high-level it would be important to know for frameworks developers
 that,
 1. executors are working fine in isolation on fresh start, implementing
 the feature.
 2. executors are working fine when rescheduled/restarted/in presence of
 other executors.
 3. scheduler is working fine in isolation.
 4. scheduler is fine in the wild ( in presence of
 others/failures/checkpointing/...).

 1 is easy to do traditionally. 2 is possible if your executors do not have
 side effect or using Docker etc.
 3 and 4 are not easy to do. I think having support/library for testing
 scheduler is something all the framework writer would benefit from. Not
 having to think about communication between executors and scheduler is
 already a big plus, can we also make it easier for developers to test about
 their scheduler behaviour?

 Thoughts?

 I would love to hear thoughts from others.

 Thanks,
 Dharmesh

 On Sun, Oct 12, 2014 at 8:03 PM, David Greenberg dsg123456...@gmail.com
 wrote:

 For our frameworks, we don't tend to do much automated testing of the
 Mesos interface--instead, we construct the framework state, then send it a
 message, since our callbacks take the state of the framework + the event
 as the argument. This way, we don't need to have mesos running, and we can
 trim away large amounts of code necessary to connect to mesos but
 unnecessary for the actual feature under test. We've also been
 experimenting with simulation testing by mocking out the mesos APIs. These
 techniques are mostly effective when you can pretend that the executors
 you're using don't communicate much, or when they're trivial to mock.

 On Sun, Oct 12, 2014 at 9:42 AM, Dharmesh Kakadia dhkaka...@gmail.com
 wrote:

 Hi,

 I am working on a tiny experimental framework for Mesos. I was wondering
 what is the recommended way of writing testcases for framework testing. I
 looked at the several existing frameworks, but its still not clear to me. I
 understand that I might be able to test executor functionality in isolation
 through normal test cases, but testing as a whole framework is what I am
 unclear about.

 Suggestions? Is that a non-goal? How do other framework developers go
 about it?

 Also, on the related note, is there a way to debug frameworks 

Re: Design Review: Maintenance Primitives

2014-08-27 Thread Sharma Podila
Nicely written doc. Here's a few thoughts:

- There's some commonality between the existing offerRescinded() and the
new inverseOffer(). Maybe consider having same method names for them with
differing signatures? I'd second Maxime's point about possibly renaming
inverseOffer to something else - maybe offerRescind() or offerRevoke()?

- Offer has hostname but InverseOffer doesn't, is that intentional?

- I like it that the operations of maintenance Vs. draining are separated
out. Draining deactivates the slave, and should also immediately rescind
all outstanding offers (I suppose using the offers would result in
TASK_LOST, but it would play nice with frameworks if offers are rescinded
proactively).

- For maintenance across all or a large part of the cluster, these
maintenance primitives would be helpful. Another piece required to
achieving fully automated maintenance (say, upgrade kernel patch on all
slaves) would be to have a maintenance orchestration engine that has
constraints such as ensure not more than X% of slave type A are down for
maintenance concurrently. That is, automated rolling upgrades with SLA on
uptime/availability. Such an engine could accomplish its task using these
primitives.






On Tue, Aug 26, 2014 at 2:23 PM, Maxime Brugidou maxime.brugi...@gmail.com
wrote:

 Glad to see that you are really thinking this through.

 Yes it's explicit that resources won't be revoked and will stay
 outstanding in this case but I would just add that the slave won't enter
 the drained state. It's just hard to follow the
 drain/revoke/outstanding/inverse offer/reclaim vocabulary. Actually, did
 you also think about the name? Inverse offer sounds weird to me. Maybe
 resourceOffers()  and resource Revoke()? You probably have better arguments
 and idea than me though :)

 Another small note: the OfferID in the inverse offer is completely new and
 just used to identify the inverse offer right? I got a bit confused about a
 link between a previous offerID and this but then I saw the Resource field.
 Wouldn't it be clearer to have InverseOfferID?

 Thanks for the work! I really want to have these primitives.
 On Aug 26, 2014 10:59 PM, Benjamin Mahler benjamin.mah...@gmail.com
 wrote:

 You're right, we don't account for that in the current design because
 such a framework would be relying on disk resources outside of the sandbox.
 Currently, we don't have a model for these persistent resources (e.g.
 disk volume used for HDFS DataNode data). Unlike the existing resources,
 persistent resources will not be tied to the lifecycle of the executor/task.

 When we have a model for persistent resources, I can see this fitting
 into the primitives we are proposing here. Since inverse offers work at the
 resource level, we can provide control to the operators to determine
 whether the persistent resources should be reclaimed from the framework as
 part of the maintenance:

 E.g. If decommissioning a machine, the operator can ensure that all
 persistent resources are reclaimed. If rebooting a machine, the operator
 can leave these resources allocated to the framework for when the machine
 is back in the cluster.

 Now, since we have the soft deadlines on inverse offers, a framework like
 HDFS can determine when it can comply to inverse offers based on the global
 data replication state (e.g. always ensure that 2/3 replicas of a block are
 available). If relinquishing a particular data volume would mean that only
 1 copy of a block is available, the framework can wait to comply with the
 inverse offer, or can take steps to create more replicas.

 One interesting question is how the resource expiry time will interact
 with persistent resources, we may want to expose the expiry time at the
 resource level rather than the offer level. Will think about this.

 *However could you specify that when you drain a slave with hard:false
 you don't enter the drained state even when the deadline has passed if
 tasks are still running? This is not explicit in the document and we want
 to make sure operators have the information about this and could avoid
 unfortunate rolling restarts.*


 This is explicit in the document under the soft deadline section: the
 inverse offer will remain outstanding after the soft deadline elapses, we
 won't forcibly drain the task. Anything that's not clear here?




 On Mon, Aug 25, 2014 at 1:08 PM, Maxime Brugidou 
 maxime.brugi...@gmail.com wrote:

 Nice work!

 First question: don't you think that operations should differentiate
 short and long maintenance?
 I am thinking about frameworks that use persistent storage on disk for
 example. A short maintenance such as a slave reboot or upgrade could be
 done without moving the data to another slave. However decommissioning
 requires to drain the storage too.

 If you have an HDFS datanode with 50TB of (replicated) data, you might
 not want to drain it for a reboot (assuming your replication factor is high
 enough) since it takes ages. However for 

Re: MesosCon attendee introduction thread

2014-08-15 Thread Sharma Podila
Hello Everyone,

I work at Netflix. I came across Mesos 7 months ago. I am developing a
Mesos framework/scheduler for a cloud native reactive stream processing.
Together with my colleague, Justin Becker, we are excited to talk about it
at MesosCon.

Previously, I did a fair bit of work on dynamic scheduling of distributed
resources for compute clusters.

Looking forward to meeting everyone at the conference.

Sharma



On Fri, Aug 15, 2014 at 12:22 PM, Victor VIEUX victorvi...@gmail.com
wrote:

 Hi everyone,


 My name is Victor Vieux, I am a French software engineer at docker.com
 based in San Francisco.

 I’m doing a talk about Mesos and Docker for MesosCon, feel free to ping me
 if you have questions related to Docker. I’ll be around on Thursday.


 I’m also doing a talk about Docker at the meetup announced by @ijimene on
 tuesday (http://www.meetup.com/Docker-Chicago/events/196843942/)


 You can find me on twitter: @vieux

 Cheers


 On Fri, Aug 15, 2014 at 12:17 PM, Isabel Jimenez 
 contact.isabeljime...@gmail.com wrote:

 Hi everyone,

 My name is Isabel Jimenez, I am an OPW software engineer intern for
 Apache Mesos this summer. I look forward to meeting all of Mesos
 contributors and talk about all the different Mesos use cases.
 I'll be in Chicago from Monday 18th and be attending and talking at
 Docker's meetup Tuesday 19th.

 Twitter: @ijimene
 Github: https://github.com/jimenez
 Blog: http://blog.isabeljimenez.com/

 See you next week!




 On Thu, Aug 14, 2014 at 9:14 PM, Franklin Angulo Jr feang...@yaipan.com
 wrote:

 Hi all,

 My name is Franklin Angulo and I am an engineering manager at
 Squarespace http://squarespace.com/ in New York City. We've been
 experimenting with Mesos and Docker for a couple of months now to more
 efficiently utilize the resources in the data centers we operate.

 Twitter: @feangulo
 Github: https://github.com/feangulo
 Blog: http://www.franklinangulo.com/blog/

 Looking forward to meeting everyone and attending the talks!


 On Thu, Aug 14, 2014 at 11:51 PM, Charles Baker cnob...@gmail.com
 wrote:

 Hey everyone. Charles Baker here and I'm a developer at SDL
 International's Language Technologies division and am working on our
 next-generation Machine Translation infrastructure. We are planning a
 Docker-on-Mesos deployment strategy and although we haven't got our hands
 dirty with Mesos yet due to other priorities on the backlog, I am super
 stoked about the technology's fit for our use case and am eager to meet
 everyone and hear the war stories and attend the talks!


 -Chuck






 --
 Victor VIEUX
 http://vvieux.com



Re: Exposing executor container

2014-08-12 Thread Sharma Podila
You may already know this, but, this does sound similar to

http://www.mail-archive.com/user@mesos.apache.org/msg00885.html

There was a possible (and partial) solution in using soft limits for memory
for which a ticket was opened.


On Tue, Aug 12, 2014 at 1:17 PM, Thomas Petr tp...@hubspot.com wrote:

 That solution would likely cause us more pain -- we'd still need to figure
 out an appropriate amount of resources to request for artifact downloads /
 extractions, our scheduler would need to be sophisticated enough to only
 accept offers from the same slave that the setup task ran on, and we'd need
 to manage some new shared artifact storage location outside of the
 containers. Is splitting workflows into multiple tasks like this a common
 pattern?

 I personally agree that tasks manually overriding cgroups limits is a
 little sketchy (and am curious how MESOS-1279 would affect this
 discussion), but I doubt that we'll be the last people to attempt something
 like this. In other words, we acknowledge we're going rogue by
 temporarily overriding the limits... are there other implications of
 exposing the container ID that you're worried about?

 Do you have any thoughts about my other idea (overriding the fetcher
 executable for a task)?

 Thanks,
 Tom

 On Tue, Aug 12, 2014 at 2:05 PM, Vinod Kone vinodk...@gmail.com wrote:

 Thanks Thomas for the clarification.

 One solution you could consider would be separating out the setup
 (fetch/extract) phase and running phase into separate mesos tasks. That way
 you can give the setup task resources need for fetching/extracting and as
 soon as it is done, you can send a TASK_FINISHED so that the resources used
 by that task are reclaimed by Mesos. That would give you the dynamism you
 need. Would that work in your scenario?

 Having the executor change cgroup limits behind the scenes, opaquely to
 Mesos, seems like a recipe for problems in the future to me, since it could
 lead to temporary over-commit of resources and affect isolation across
 containers.



 On Tue, Aug 12, 2014 at 10:45 AM, Thomas Petr tp...@hubspot.com wrote:

 Hey Vinod,

 We're not using mesos-fetcher to download the executor -- we ensure our
 executor exists on the slaves beforehand (during machine provisioning, to
 be exact). The issue that Whitney is talking about is OOMing while fetching
 artifacts necessary for task execution (like the JAR for a web service).

 Our own executor
 https://github.com/HubSpot/Singularity/tree/master/SingularityExecutor has
 some nice enhancements around S3 downloads and artifact caching that we
 don't necessarily want to lose if we switched back to using mesos-fetcher.

 Surfacing the container ID seems like a trivial change, but another
 alternative could be to allow frameworks to specify an alternative fetcher
 executable (perhaps in CommandInfo?).

 Thanks,
 Tom


 On Tue, Aug 12, 2014 at 1:09 PM, Vinod Kone vinodk...@gmail.com wrote:

 Hi Whitney,

 While we could conceivably set the container id in the environment of
 the executor, I would like to understand the problem you are facing.

 The fetching and extracting of the executor is done in by
 mesos-fetcher, a process forked by slave and run under slave's cgroup.
 AFAICT, this shouldn't cause an OOM in the executor. Does your executor do
 more fetches/extracts once it is launched (e.g., for user's tasks)?







Re: Task serialization per machine?

2014-06-30 Thread Sharma Podila
A likely scenario is that your executor is running the task synchronously
inside the callback to launchTask(). If you make it instead run the task
asynchronously (e.g., in a separate thread), that should resolve it.


On Mon, Jun 30, 2014 at 12:48 PM, Asim linka...@gmail.com wrote:

 Hi,

 I want to launch multiple tasks on multiple machines (t  m) that can run
 simultaneously. Currently, I find that every machine processes the tasks in
 a serial fashion one after another.

 I have written a framework with a scheduler and a executor. The scheduler
 launches a task list on a bunch of machines (that show up as offers). When
 I send a task list to run with driver-launchTasks(offers[i].id(),
 tasks[i]) I find that every machine picks up one task at a time (and then
 goes to the next). This happens even though the offer can accommodate more
 than one task from this task list easily.

 Is there something that I am missing?

 Thanks,
 Asim




Re: cgroups memory isolation

2014-06-19 Thread Sharma Podila
Purely from a user expectation point of view, I am wondering if such an
abuse (overuse?) of I/O bandwidth/rate should translate into I/O
bandwidth getting throttled for the job instead of it manifesting into an
OOM that results in a job kill. Such I/O overuse translating into memory
overuse seems like an implementation detail (for lack of a better phrase)
of the OS that uses cache'ing. It's not like the job asked for its memory
to be used up for I/O cache'ing :-)

I do see that this isn't Mesos specific, but, rather a containerization
artifact that is inevitable in a shared resource environment.

That said, specifying memory size for jobs is not trivial in a shared
resource environment. Conservative safe margins do help prevent OOMs, but,
they also come with the side effect of fragmenting resources and reducing
utilization. In some cases, they can cause job starvation to some extent,
if most available memory is allocated to the conservative buffering for
every job.
Another approach that could help, if feasible, is to have containers with
elastic boundaries (different from over-subscription) that manage things
such that sum of actual usage of all containers is = system resources.
This helps when not all jobs have peak use of resources simultaneously.


On Wed, Jun 18, 2014 at 1:42 PM, Tim St Clair tstcl...@redhat.com wrote:

 FWIW -  There is classic grid mantra that applies here.  Test your
 workflow on an upper bound, then over provision to be safe.

 Mesos is no different then SGE, PBS, LSF, Condor, etc.

 Also, there is no hunting algo for jobs, that would have to live outside
 of mesos itself, on some batch system built atop.

 Cheers,
 Tim

 --

 *From: *Thomas Petr tp...@hubspot.com
 *To: *Ian Downes ian.dow...@gmail.com
 *Cc: *user@mesos.apache.org, Eric Abbott eabb...@hubspot.com
 *Sent: *Wednesday, June 18, 2014 9:36:51 AM
 *Subject: *Re: cgroups memory isolation


 Thanks for all the info, Ian. We're running CentOS 6 with the 2.6.32
 kernel.

 I ran `dd if=/dev/zero of=lotsazeros bs=1M` as a task in Mesos and got
 some weird results. I initially gave the task 256 MB, and it never exceeded
 the memory allocation (I killed the task manually after 5 minutes when the
 file hit 50 GB). Then I noticed your example was 128 MB, so I resized and
 tried again. It exceeded memory
 https://gist.github.com/tpetr/d4ff2adda1b5b0a21f82 almost
 immediately. The next (replacement) task our framework started ran
 successfully and never exceeded memory. I watched nr_dirty and it
 fluctuated between 1 to 14000 when the task is running. The slave host
 is a c3.xlarge in EC2, if it makes a difference.

 As Mesos users, we'd like an isolation strategy that isn't affected by
 cache this much -- it makes it harder for us to appropriately size things.
 Is it possible through Mesos or cgroups itself to make the page cache not
 count towards the total memory consumption? If the answer is no, do you
 think it'd be worth looking at using Docker for isolation instead?

 -Tom


 On Tue, Jun 17, 2014 at 6:18 PM, Ian Downes ian.dow...@gmail.com wrote:

 Hello Thomas,

 Your impression is mostly correct: the kernel will *try* to reclaim
 memory by writing out dirty pages before killing processes in a cgroup
 but if it's unable to reclaim sufficient pages within some interval (I
 don't recall this off-hand) then it will start killing things.

 We observed this on a 3.4 kernel where we could overwhelm the disk
 subsystem and trigger an oom. Just how quickly this happens depends on
 how fast you're writing compared to how fast your disk subsystem can
 write it out. A simple dd if=/dev/zero of=lotsazeros bs=1M when
 contained in a memory cgroup will fill the cache quickly, reach its
 limit and get oom'ed. We were not able to reproduce this under 3.10
 and 3.11 kernels. Which kernel are you using?

 Example: under 3.4:

 [idownes@hostname tmp]$ cat /proc/self/cgroup
 6:perf_event:/
 4:memory:/test
 3:freezer:/
 2:cpuacct:/
 1:cpu:/
 [idownes@hostname tmp]$ cat
 /sys/fs/cgroup/memory/test/memory.limit_in_bytes  # 128 MB
 134217728
 [idownes@hostname tmp]$ dd if=/dev/zero of=lotsazeros bs=1M
 Killed
 [idownes@hostname tmp]$ ls -lah lotsazeros
 -rw-r--r-- 1 idownes idownes 131M Jun 17 21:55 lotsazeros


 You can also look in /proc/vmstat at nr_dirty to see how many dirty
 pages there are (system wide). If you wrote at a rate sustainable by
 your disk subsystem then you would see a sawtooth pattern _/|_/| ...
 (use something like watch) as the cgroup approached its limit and the
 kernel flushed dirty pages to bring it down.

 This might be an interesting read:

 http://lonesysadmin.net/2013/12/22/better-linux-disk-caching-performance-vm-dirty_ratio/

 Hope this helps! Please do let us know if you're seeing this on a
 kernel = 3.10, otherwise it's likely this is a kernel issue rather
 than something with Mesos.

 Thanks,
 Ian


 On Tue, Jun 17, 2014 at 2:23 PM, Thomas Petr tp...@hubspot.com wrote:
  Hello,
 
 

Re: cgroups memory isolation

2014-06-19 Thread Sharma Podila
Yeah, having soft-limit for memory seems like the right thing to do
immediately. The only problem left to solve being that it would be nicer to
throttle I/O instead of OOM for high rate I/O jobs. Hopefully the soft
limits on memory push this problem to only the extreme edge cases.

Agreed on still enforcing limits in general. This tends be on an ongoing
issue from the operations perspective, I've had my share of dealing with
it, and I am sure I will continue to do so. Sometimes users can't estimate,
sometimes jobs' memory footprint changes drastically with minor changes,
etc. Memory usage prediction based on historic usage and reactive resizing
based on actual usage are two tools of the trade.

BTW, by resize, did you mean cgrops memory limits can be resized for
running jobs? That's nice to know (am relatively new to cgroups).



On Thu, Jun 19, 2014 at 10:55 AM, Tim St Clair tstcl...@redhat.com wrote:

 Awesome response!

 inline below -

 --

 *From: *Sharma Podila spod...@netflix.com
 *To: *user@mesos.apache.org
 *Cc: *Ian Downes ian.dow...@gmail.com, Eric Abbott 
 eabb...@hubspot.com
 *Sent: *Thursday, June 19, 2014 11:54:34 AM

 *Subject: *Re: cgroups memory isolation

 Purely from a user expectation point of view, I am wondering if such an
 abuse (overuse?) of I/O bandwidth/rate should translate into I/O
 bandwidth getting throttled for the job instead of it manifesting into an
 OOM that results in a job kill. Such I/O overuse translating into memory
 overuse seems like an implementation detail (for lack of a better phrase)
 of the OS that uses cache'ing. It's not like the job asked for its memory
 to be used up for I/O cache'ing :-)

 In cgroups, you could optionally specify the memory limit as soft, vs.
 hard (OOM).



 I do see that this isn't Mesos specific, but, rather a containerization
 artifact that is inevitable in a shared resource environment.

 That said, specifying memory size for jobs is not trivial in a shared
 resource environment. Conservative safe margins do help prevent OOMs, but,
 they also come with the side effect of fragmenting resources and reducing
 utilization. In some cases, they can cause job starvation to some extent,
 if most available memory is allocated to the conservative buffering for
 every job.

 Yup, unless you develop tuning models / hunting algorithms.  You need some
 level of global visibility  history.

 Another approach that could help, if feasible, is to have containers with
 elastic boundaries (different from over-subscription) that manage things
 such that sum of actual usage of all containers is = system resources.
 This helps when not all jobs have peak use of resources simultaneously.


 You could use soft limits  resize, I like to call it the push-over
 policy.  If the limits are not enforced, what prevents abusive users in
 absence of global visibility?

 IMHO - having soft c-group memory limits being an option seems to be the
 right play given the environment.

 Thoughts?



 On Wed, Jun 18, 2014 at 1:42 PM, Tim St Clair tstcl...@redhat.com wrote:

 FWIW -  There is classic grid mantra that applies here.  Test your
 workflow on an upper bound, then over provision to be safe.

 Mesos is no different then SGE, PBS, LSF, Condor, etc.

 Also, there is no hunting algo for jobs, that would have to live
 outside of mesos itself, on some batch system built atop.

 Cheers,
 Tim

 --

 *From: *Thomas Petr tp...@hubspot.com
 *To: *Ian Downes ian.dow...@gmail.com
 *Cc: *user@mesos.apache.org, Eric Abbott eabb...@hubspot.com
 *Sent: *Wednesday, June 18, 2014 9:36:51 AM
 *Subject: *Re: cgroups memory isolation


 Thanks for all the info, Ian. We're running CentOS 6 with the 2.6.32
 kernel.

 I ran `dd if=/dev/zero of=lotsazeros bs=1M` as a task in Mesos and got
 some weird results. I initially gave the task 256 MB, and it never exceeded
 the memory allocation (I killed the task manually after 5 minutes when the
 file hit 50 GB). Then I noticed your example was 128 MB, so I resized and
 tried again. It exceeded memory
 https://gist.github.com/tpetr/d4ff2adda1b5b0a21f82 almost
 immediately. The next (replacement) task our framework started ran
 successfully and never exceeded memory. I watched nr_dirty and it
 fluctuated between 1 to 14000 when the task is running. The slave host
 is a c3.xlarge in EC2, if it makes a difference.

 As Mesos users, we'd like an isolation strategy that isn't affected by
 cache this much -- it makes it harder for us to appropriately size things.
 Is it possible through Mesos or cgroups itself to make the page cache not
 count towards the total memory consumption? If the answer is no, do you
 think it'd be worth looking at using Docker for isolation instead?

 -Tom


 On Tue, Jun 17, 2014 at 6:18 PM, Ian Downes ian.dow...@gmail.com wrote:

 Hello Thomas,

 Your impression is mostly correct: the kernel will *try* to reclaim
 memory by writing out dirty pages before killing

Dealing with run away task processes after executor terminates

2014-06-03 Thread Sharma Podila
When a framework executor terminates, Mesos sends TASK_LOST status updates
for tasks that were running. However, if a task had processes that do not
terminate when the executor dies, then we have a problem since Mesos
considers the slave resources assigned to those tasks as released. Where
as, the task processes are running without releasing those resources.

While it is a good practice for the task processes to exit when their
executor dies, I am not sure that can be guaranteed. I am wondering how
others are dealing with such illegal processes - that is, processes that
once belonged to Mesos run tasks but not anymore.

Conceivably, a per-slave reaper/GC process can periodically scan the
slave's process tree to ensure all processes are 'legal'. Assuming that
such a reaper exists (and could be tricky in a multi-framework environment)
on the slave and is not risky in killing illegal processes, there is still
the time window left until the reaper completes its next clean up routine.
In the mean time, new tasks can land and fail trying to use a resource that
was assumed to be free by Mesos. Especially problematic for ports. Not as
much for CPU and memory.

Would love to hear thoughts on how you are handling this scenario.


Re: Dealing with run away task processes after executor terminates

2014-06-03 Thread Sharma Podila
No, I haven't talked to either of them. Would be great to hear their
thoughts on this. Thanks for including them.

Is container cleanup specific to cgroups? Or, would other containers, say
Docker, also have similar clean up behavior?


On Tue, Jun 3, 2014 at 5:44 PM, Vinod Kone vinodk...@gmail.com wrote:

 +Jie,Ian

 Not sure if you've talked to Ian Downes and/or Jie Yu regarding this but
 they were discussing the same issue (offline) today.

 Just to be sure, if you are using cgroups, the mesos slave will cleanup
 the container (and all its processes) when an executor exits. Now there is
 definitely a race here, mesos might release the resource to framework
 before the container is destroyed. We'll try to fix that really soon. I'll
 let Jie/Ian chime in regarding fixes/tickets.


 On Tue, Jun 3, 2014 at 4:25 PM, Sharma Podila spod...@netflix.com wrote:

 When a framework executor terminates, Mesos sends TASK_LOST status
 updates for tasks that were running. However, if a task had processes that
 do not terminate when the executor dies, then we have a problem since Mesos
 considers the slave resources assigned to those tasks as released. Where
 as, the task processes are running without releasing those resources.

 While it is a good practice for the task processes to exit when their
 executor dies, I am not sure that can be guaranteed. I am wondering how
 others are dealing with such illegal processes - that is, processes that
 once belonged to Mesos run tasks but not anymore.

 Conceivably, a per-slave reaper/GC process can periodically scan the
 slave's process tree to ensure all processes are 'legal'. Assuming that
 such a reaper exists (and could be tricky in a multi-framework environment)
 on the slave and is not risky in killing illegal processes, there is still
 the time window left until the reaper completes its next clean up routine.
 In the mean time, new tasks can land and fail trying to use a resource that
 was assumed to be free by Mesos. Especially problematic for ports. Not as
 much for CPU and memory.

 Would love to hear thoughts on how you are handling this scenario.





Q on master state.json

2014-05-21 Thread Sharma Podila
I see that master/state.json has state information on frameworks, where in,
it has a list of all completed_tasks. Each task seems to be about 500
bytes.

Does the master have a list of all completed tasks for the framework?
Thinking naively about it, does it mean that if I were to run, say, 100K
tasks a day, we have 50MBytes of data in there? In 3 weeks that's a GB.
Which by itself maybe OK, but not if it grows indefinitely.

Is that a cause for concern? Or, is that an incorrect extrapolation? Is
there some kind of purging that happens on tasks that completed a while ago?

Sharma


Re: Question on resource offers and framework failover

2014-05-16 Thread Sharma Podila


   (1) If the slave is unknown, we send TASK_LOST.
   (2) If the task is missing on the slave, we send TASK_LOST.
   (3) If the task state differs, we send the latest state.
 In the absence of bugs or data loss, (1) is the only one that is strictly
 necessary for correctness. In your case, (1) or (2) would result in the
 master sending back TASK_LOST since the task (or possibly the slave) is no
 longer present from the Master's perspective.
 My understanding is that the call from my framework to
 ​SchedulerDriver.reconcileTasks() ends up in Mesos' master.cpp
 reconcileTasks() at line 2142 (Mesos 0.18). In there there is no else case
 for if(slave != NULL). There is never a TASK_LOST sent from there. Only
 task updates for tasks and slaves Mesos knows currently that have different
 status. I have seen this happen in my testing - if I were to ask
 reconciliation for a task that doesn't exist, there is no answer. The
 doesn't exist case occurs when my framework were to loose any previously
 sent terminal status update for the task.
 Is it possible that your description above refers to the reconciliation
 that happens when framework registers with Mesos? Where as, my question is
 about when an already registered framework calls for reconciliation. Or, is
 the code location I refer to above incorrect?

Reconciliation currently occurs in three cases, from the master's
 perspective:


I think there's a bit more to this. I believe I need to do the above only
if I (the framework) am registering with Mesos with failover option. And if
the failover timeout has already happened. Currently the registration
succeeds with no indication that the failover didn't occur (there was a
mention in another thread that the registration will fail in future with
the reuse of FrameworkID upon failover timeout). If so, I have no way of
knowing that my previous tasks have been killed. Once the registration
starts failing with failover option if failover timeout has passed, then, I
think the reconciliation strategy will work fine.




On Fri, May 16, 2014 at 1:16 PM, Sharma Podila spod...@netflix.com wrote:

 I'm not sure these two cases are any different. The TASK_INVALID_OFFER
 would model a terminal state for the task. Afterwards, one still has to
 generate a new TaskInfo in so far as the TaskID should not be re-used
 across launch requests.

 I was expecting to reuse the TaskID. If it can't be reused, then, agreed,
 I do not see a difference in the two cases then.

 Reconciliation currently occurs in three cases, from the master's
 perspective:
   (1) If the slave is unknown, we send TASK_LOST.
   (2) If the task is missing on the slave, we send TASK_LOST.
   (3) If the task state differs, we send the latest state.
 In the absence of bugs or data loss, (1) is the only one that is strictly
 necessary for correctness. In your case, (1) or (2) would result in the
 master sending back TASK_LOST since the task (or possibly the slave) is no
 longer present from the Master's perspective.


 My understanding is that the call from my framework to
 ​SchedulerDriver.reconcileTasks() ends up in Mesos' master.cpp
 reconcileTasks() at line 2142 (Mesos 0.18). In there there is no else case
 for if(slave != NULL). There is never a TASK_LOST sent from there. Only
 task updates for tasks and slaves Mesos knows currently that have different
 status. I have seen this happen in my testing - if I were to ask
 reconciliation for a task that doesn't exist, there is no answer. The
 doesn't exist case occurs when my framework were to loose any previously
 sent terminal status update for the task.

 Is it possible that your description above refers to the reconciliation
 that happens when framework registers with Mesos? Where as, my question is
 about when an already registered framework calls for reconciliation. Or, is
 the code location I refer to above incorrect?

 ​I do agree that it would be nice if we provided a mechanism to reconcile
 these scenarios as well, given that bugs can and will occur! What are the
 operational causes you were referring to? I've filed 
 MESOS-1379https://issues.apache.org/jira/browse/MESOS-1379 for
 this.
 Barring bugs or data loss, if the framework persists its intent before
 launching a task, then the set of tasks in the framework will always be a
 superset of the tasks in the Master/Slaves.

 ​
 ​Thanks for filing that. Data loss is what I had in mind. Say a framework
 restarts after its persistence store crashes hard and had to be rebuilt
 from a backup/replica which maybe a bit behind. It would then be unaware of
 some of the newer tasks.




 On Thu, May 15, 2014 at 4:28 PM, Benjamin Mahler 
 benjamin.mah...@gmail.com wrote:

 Thanks for providing more details!

 I'm not sure these two cases are any different. The TASK_INVALID_OFFER
 would model a terminal state for the task. Afterwards, one still has to
 generate a new TaskInfo in so far as the TaskID should not be re-used
 across launch requests.

 *For example

Re: Question on resource offers and framework failover

2014-05-16 Thread Sharma Podila

 I'm not sure these two cases are any different. The TASK_INVALID_OFFER
 would model a terminal state for the task. Afterwards, one still has to
 generate a new TaskInfo in so far as the TaskID should not be re-used
 across launch requests.

I was expecting to reuse the TaskID. If it can't be reused, then, agreed, I
do not see a difference in the two cases then.

Reconciliation currently occurs in three cases, from the master's
 perspective:
   (1) If the slave is unknown, we send TASK_LOST.
   (2) If the task is missing on the slave, we send TASK_LOST.
   (3) If the task state differs, we send the latest state.
 In the absence of bugs or data loss, (1) is the only one that is strictly
 necessary for correctness. In your case, (1) or (2) would result in the
 master sending back TASK_LOST since the task (or possibly the slave) is no
 longer present from the Master's perspective.


My understanding is that the call from my framework to
​SchedulerDriver.reconcileTasks() ends up in Mesos' master.cpp
reconcileTasks() at line 2142 (Mesos 0.18). In there there is no else case
for if(slave != NULL). There is never a TASK_LOST sent from there. Only
task updates for tasks and slaves Mesos knows currently that have different
status. I have seen this happen in my testing - if I were to ask
reconciliation for a task that doesn't exist, there is no answer. The
doesn't exist case occurs when my framework were to loose any previously
sent terminal status update for the task.

Is it possible that your description above refers to the reconciliation
that happens when framework registers with Mesos? Where as, my question is
about when an already registered framework calls for reconciliation. Or, is
the code location I refer to above incorrect?

​I do agree that it would be nice if we provided a mechanism to reconcile
 these scenarios as well, given that bugs can and will occur! What are the
 operational causes you were referring to? I've filed 
 MESOS-1379https://issues.apache.org/jira/browse/MESOS-1379 for
 this.
 Barring bugs or data loss, if the framework persists its intent before
 launching a task, then the set of tasks in the framework will always be a
 superset of the tasks in the Master/Slaves.

​
​Thanks for filing that. Data loss is what I had in mind. Say a framework
restarts after its persistence store crashes hard and had to be rebuilt
from a backup/replica which maybe a bit behind. It would then be unaware of
some of the newer tasks.




On Thu, May 15, 2014 at 4:28 PM, Benjamin Mahler
benjamin.mah...@gmail.comwrote:

 Thanks for providing more details!

 I'm not sure these two cases are any different. The TASK_INVALID_OFFER
 would model a terminal state for the task. Afterwards, one still has to
 generate a new TaskInfo in so far as the TaskID should not be re-used
 across launch requests.

 *For example, what if reconciliation is requested on a task that completed
 a long time ago? For which Mesos may have already sent a status of
 completion and/or lost, but my framework somehow lost that. Hopefully, this
 and other possible cases are addressed. *


 Reconciliation currently occurs in three cases, from the master's
 perspective:
   (1) If the slave is unknown, we send TASK_LOST.
   (2) If the task is missing on the slave, we send TASK_LOST.
   (3) If the task state differs, we send the latest state.

 In the absence of bugs or data loss, (1) is the only one that is strictly
 necessary for correctness. In your case, (1) or (2) would result in the
 master sending back TASK_LOST since the task (or possibly the slave) is no
 longer present from the Master's perspective.

 *What about tasks that Mesos is running for my framework, but my framework
 lost track of them (there could be some operational causes for this, even
 if we assume my code is bug free)? How are frameworks handling such a
 scenario?*


 I do agree that it would be nice if we provided a mechanism to reconcile
 these scenarios as well, given that bugs can and will occur! What are the
 operational causes you were referring to? I've filed 
 MESOS-1379https://issues.apache.org/jira/browse/MESOS-1379 for
 this.

 Barring bugs or data loss, if the framework persists its intent before
 launching a task, then the set of tasks in the framework will always be a
 superset of the tasks in the Master/Slaves.

 On Wed, May 14, 2014 at 11:04 PM, Sharma Podila spod...@netflix.comwrote:

 TASK_LOST is a good thing. I expect to deal with it now and in the
 future. I was trying to distinguish this:

- case TASK_LOST:
   - persist state update to TASK_LOST
   - create new task submission request
   - schedule with next available offer
- case TASK_INVALID_OFFER:
   - persist state update to PENDING (i.e., from Launched back to
   Pending)
   - schedule with next available offer

 The difference is create new task submission request. Although this
 would be undesirable, and an additional call into persistence state, I can
 see

Re: Question on resource offers and framework failover

2014-05-15 Thread Sharma Podila
TASK_LOST is a good thing. I expect to deal with it now and in the future.
I was trying to distinguish this:

   - case TASK_LOST:
  - persist state update to TASK_LOST
  - create new task submission request
  - schedule with next available offer
   - case TASK_INVALID_OFFER:
  - persist state update to PENDING (i.e., from Launched back to
  Pending)
  - schedule with next available offer

The difference is create new task submission request. Although this would
be undesirable, and an additional call into persistence state, I can see
that this is an unlikely event. In which case, introducing complexity to
differentiate the two cases may not be a critical need. As I was saying,
strictly speaking there's a difference.

are you still planning to do this out-of-band reconciliation when Mesos
 provides complete reconciliation (thanks to the Registrar)? Mesos will
 ensure that the situation you describe is not possible (in 0.19.0
 optionally, and in 0.20.0 by default).


It would be nice to not have to do it. Depends on what complete
reconciliation entails. For example, what if reconciliation is requested on
a task that completed a long time ago? For which Mesos may have already
sent a status of completion and/or lost, but my framework somehow lost
that. Hopefully, this and other possible cases are addressed.

This brings up another question: Reconciliation addresses tasks that my
framework knows about. What about tasks that Mesos is running for my
framework, but my framework lost track of them (there could be some
operational causes for this, even if we assume my code is bug free)? How
are frameworks handling such a scenario?



On Wed, May 14, 2014 at 4:05 PM, Benjamin Mahler
benjamin.mah...@gmail.comwrote:

 Where as, a TASK_LOST will make me (unnecessarily, in this case) try to
 ensure that the task is actually lost, not running away on the slave that
 got disconnected from Mesos master. Not all environments may need the
 distinction, but at least some do.


 To be clear, are you still planning to do this out-of-band reconciliation
 when Mesos provides complete reconciliation (thanks to the Registrar)?
 Mesos will ensure that the situation you describe is not possible (in
 0.19.0 optionally, and in 0.20.0 by default).

 Taking a step back, you will always have to deal with TASK_LOST as a
 status *regardless* of what the true status of the task was, this is the
 reality of failures in a distributed system. For example, let's say the
 Master fails right before we could send you the TASK_INVALID_OFFER update,
 or your framework fails right before it could persist the
 TASK_INVALID_OFFER update. In both cases, you will need to reconcile with
 the Master, and it will be TASK_LOST.

 Likewise, let's say your TASK_FINISHED on the slave, but the slave fails
 permanently before the update could reach the Master. Then when you
 reconcile this with the Master it will be TASK_LOST.

 For these reasons, we haven't yet found much value in providing more
 precise task states for various conditions.


 On Tue, May 13, 2014 at 10:10 AM, Sharma Podila spod...@netflix.comwrote:

 ​Thanks for confirming that, Adam.
 ​

 , but it would be a good Mesos FAQ topic.

 I was thinking it might be good to also add to doc in code, either in
 mesos.proto or MesosSchedulerDriver (mesos.proto already refers to the
 latter for failover at FrameworkID definition).

 If you were to try to persist the 'ephemeral' offers to another framework
 instance, and call launchTasks with one of the old offers, the master will
 respond with TASK_LOST (Task launched with invalid offers), since the
 master no longer knows about that offer

 Strictly speaking, shouldn't this produce some kind of an 'invalid offer'
 response instead of task being lost? A TASK_LOST response is handled
 differently in my scheduler, for example, compared to what I'd do for an
 invalid offer response. An invalid offer would just be a simple discard
 offer and retry of launch with a more recent offer. Where as, a TASK_LOST
 will make me (unnecessarily, in this case) try to ensure that the task is
 actually lost, not running away on the slave that got disconnected from
 Mesos master. Not all environments may need the distinction, but at least
 some do.



 On Mon, May 12, 2014 at 11:12 PM, Adam Bordelon a...@mesosphere.iowrote:

 Correct, Sharma. I don't think this is documented anywhere yet, but it
 would be a good Mesos FAQ topic.
 When the master notices that the framework has exited or is deactivated,
 it disables the framework in the allocator so no new offers will be made to
 that framework, and removes any outstanding offers (but does not send a
 RescindResourceOfferMessage to the framework, since the framework is
 presumably failing over). When a framework reregisters, it is reactivated
 in the allocator and will start receiving new offers again.
 If you were to try to persist the 'ephemeral' offers to another
 framework instance, and call

Re: Question on resource offers and framework failover

2014-05-13 Thread Sharma Podila
​Thanks for confirming that, Adam.
​

 , but it would be a good Mesos FAQ topic.

I was thinking it might be good to also add to doc in code, either in
mesos.proto or MesosSchedulerDriver (mesos.proto already refers to the
latter for failover at FrameworkID definition).

If you were to try to persist the 'ephemeral' offers to another framework
 instance, and call launchTasks with one of the old offers, the master will
 respond with TASK_LOST (Task launched with invalid offers), since the
 master no longer knows about that offer

Strictly speaking, shouldn't this produce some kind of an 'invalid offer'
response instead of task being lost? A TASK_LOST response is handled
differently in my scheduler, for example, compared to what I'd do for an
invalid offer response. An invalid offer would just be a simple discard
offer and retry of launch with a more recent offer. Where as, a TASK_LOST
will make me (unnecessarily, in this case) try to ensure that the task is
actually lost, not running away on the slave that got disconnected from
Mesos master. Not all environments may need the distinction, but at least
some do.



On Mon, May 12, 2014 at 11:12 PM, Adam Bordelon a...@mesosphere.io wrote:

 Correct, Sharma. I don't think this is documented anywhere yet, but it
 would be a good Mesos FAQ topic.
 When the master notices that the framework has exited or is deactivated,
 it disables the framework in the allocator so no new offers will be made to
 that framework, and removes any outstanding offers (but does not send a
 RescindResourceOfferMessage to the framework, since the framework is
 presumably failing over). When a framework reregisters, it is reactivated
 in the allocator and will start receiving new offers again.
 If you were to try to persist the 'ephemeral' offers to another framework
 instance, and call launchTasks with one of the old offers, the master will
 respond with TASK_LOST (Task launched with invalid offers), since the
 master no longer knows about that offer. So don't bother trying. :)
 Already running tasks (used offers) continue running, unless the framework
 failover timeout is exceeded.


 On Mon, May 12, 2014 at 5:38 PM, Sharma Podila spod...@netflix.comwrote:

 My understanding is that when a framework fails over (either new instance
 starts after previous one fails, or the same instance restarts), Mesos
 master would automatically cancel any unused offers it had given to the
 previous framework instance. This is a good thing. Can someone confirm this
 to be the case? Is such an expectation documented somewhere? I did look at
 master.cpp and I hope I interpreted it right.

 Effectively then, the offers are 'ephemeral' and don't need to be
 persisted by the framework scheduler to pass along to another of its
 instance that may failover as the leader.

 Thank you.

 Sharma





Trying to get task reconciliation to work

2014-04-17 Thread Sharma Podila
Hello,

I don't seem to have reconcileTasks() working for me and was wondering if I
am either using it incorrectly or hitting a problem. Here's what's
happening:

1. There's one Mesos (0.18) master, one slave, one framework, all running
on Ubuntu 12.04
2. Mesos master and slave come up fine (using Zookeeper, but that isn't
relevant here, I'd think)
3. My framework registers and gets offers
4. Two tasks are launched, both start running fine on the single available
slave
5. I restart my framework. During restart my framework knows that it had
previously launched two tasks that were last known to be in running state.
Therefore, upon getting the registered() callback, it calls
driver.reconcileTasks() for the two tasks. In actuality, the tasks are
still running fine. I see this in mesos master logs:

I0417 12:26:27.207361 27301 master.cpp:2154] Performing task state
reconciliation for framework MyFramework

​But, no other logs about reconciliation.​

6. My framework gets no callback about status of tasks that it requested
reconciliation on.

At this point, I am not sure if the lack of a callback for status update is
due to
  a) the fact that my framework asked for reconciliation on running state,
which Mesos also knows to be true, therefore, no status update
  b) Or, if the reconcile is not working. (hopefully this; reason (a) would
be problematic)

So, I then proceed to another test:

7. kill my framework and mesos master
8. Then, kill the slave (as an aside, this seems to have killed the tasks
as well)
9. Restart mesos master
10. Restart my framework. Now, again the reconciliation is requested.
11. Still no callback.

At this time, mesos master doesn't know about the slave because it hasn't
returned since master restarted.
What is the expected behavior for reconciliation under these circumstances?

12. Restarted slave
13. Killed and restarted my framework.
14. Still no callback for reconciliation.

Given these results, I can't see how reconciliation is working at all. I
did try this with Mesos 0.16 first and then upgraded to 0.18 to see if it
makes a difference.

Thank you for any ideas on getting this resolved.

Sharma


Re: Trying to get task reconciliation to work

2014-04-17 Thread Sharma Podila
Should've looked at the code before sending the previous email...
master/main.cpp confirmed what I needed to know. It doesn't look like I
will be able to use reconcileTasks the way I thought I could. Effectively,
a lack of callback could either mean that the master agrees with the
requested reconcile task state, or that the task and/or slave is currently
unknown. Which makes it an unreliable source of data. I understand this is
expected to improve later by leveraging the registrar, but, I suspect
there's more to it.

I take it then that individual frameworks need to have their own mechanisms
to ascertain the state of their tasks.


On Thu, Apr 17, 2014 at 12:53 PM, Sharma Podila spod...@netflix.com wrote:

 Hello,

 I don't seem to have reconcileTasks() working for me and was wondering if
 I am either using it incorrectly or hitting a problem. Here's what's
 happening:

 1. There's one Mesos (0.18) master, one slave, one framework, all running
 on Ubuntu 12.04
 2. Mesos master and slave come up fine (using Zookeeper, but that isn't
 relevant here, I'd think)
 3. My framework registers and gets offers
 4. Two tasks are launched, both start running fine on the single available
 slave
 5. I restart my framework. During restart my framework knows that it had
 previously launched two tasks that were last known to be in running state.
 Therefore, upon getting the registered() callback, it calls
 driver.reconcileTasks() for the two tasks. In actuality, the tasks are
 still running fine. I see this in mesos master logs:

 I0417 12:26:27.207361 27301 master.cpp:2154] Performing task state
 reconciliation for framework MyFramework

 ​But, no other logs about reconciliation.​

 6. My framework gets no callback about status of tasks that it requested
 reconciliation on.

 At this point, I am not sure if the lack of a callback for status update
 is due to
   a) the fact that my framework asked for reconciliation on running state,
 which Mesos also knows to be true, therefore, no status update
   b) Or, if the reconcile is not working. (hopefully this; reason (a)
 would be problematic)

 So, I then proceed to another test:

 7. kill my framework and mesos master
 8. Then, kill the slave (as an aside, this seems to have killed the tasks
 as well)
 9. Restart mesos master
 10. Restart my framework. Now, again the reconciliation is requested.
 11. Still no callback.

 At this time, mesos master doesn't know about the slave because it hasn't
 returned since master restarted.
 What is the expected behavior for reconciliation under these circumstances?

 12. Restarted slave
 13. Killed and restarted my framework.
 14. Still no callback for reconciliation.

 Given these results, I can't see how reconciliation is working at all. I
 did try this with Mesos 0.16 first and then upgraded to 0.18 to see if it
 makes a difference.

 Thank you for any ideas on getting this resolved.

 Sharma




Re: Question on executors

2014-03-10 Thread Sharma Podila
Thank you for the confirmation and the pointer to the 1 sec sleep.
Yes, I meant TASK_FINISHED.


 If you don't want to implement an Executor and your Task merely consists
 of forking an arbitrary process, you can use the built-in Command
 Executor. You can launch a task directly in this manner by specifying a
 CommandInfo inside your TaskInfo (see the documentation in mesos.proto).
 Unless you're using the Command Executor, you will still need to
 implement forking and process management.


I do have an Executor implementation in Java to handle all the callbacks
from the driver. The launchTask implementation simply loads the task's jar
and runs the code in the same JVM. In this case, it sounds like I don't
need to implement forking and process management. Unless there's something
else I am missing?