Re: Users, or Job Types Base Resource Allocation

2017-06-13 Thread Sharma Podila
Mesos does have roles based allocation that may cover this need. Alternatively, schedulers can do this from within the resources allocated to them. As an example, OSS Fenzo library helps with the latter. Do you mean one of these? Something else? On Tue, Jun 13, 2017 at 1:48 AM, Bryan Fok

Re: Can I consider other framework tasks as a resource? Does it make sense?

2016-12-15 Thread Sharma Podila
> when I need to model some constraint how to place a task I would know where > it belongs in my framework’s code. It seems to be answered. Thanks a lot. > > > > *From:* Sharma Podila [mailto:spod...@netflix.com] > *Sent:* 15. prosince 2016 1:59 > *To:* user@mesos.apache

Re: Can I consider other framework tasks as a resource? Does it make sense?

2016-12-14 Thread Sharma Podila
In general, placing a task based on certain constraints (e.g., locality with other tasks) is a scheduling concern. The complexity in your scenario is that the constraint specification requires knowledge external to your scheduler. If you are able to route that external information (on what and

Re: mesos agent not recovering after ZK init failure

2016-07-15 Thread Sharma Podila
t works more reliably with ZFS or at least calling > out the need to specify the disk resource explicitly. > > Thanks for the help. > Andrew > > On Jul 15, 2016, at 11:41 AM, Jie Yu <yujie@gmail.com> wrote: > > Can you hard code your disk size using --resources

Re: mesos agent not recovering after ZK init failure

2016-07-15 Thread Sharma Podila
"whole GBs" of the disk so we are insensitive to small changes in the total size. But, not sure if the changes can be larger due to Andrew's point above. On Mon, Mar 7, 2016 at 6:00 PM, Sharma Podila <spod...@netflix.com> wrote: > Sure, will do. > > > On Mon, Mar 7, 2016

Re: Mesos on hybrid AWS - Best practices?

2016-06-30 Thread Sharma Podila
I would second the suggestion of separate Mesos clusters for DC and AWS, with a layer on top for picking one or either based on the job SLAs and resource requirements. The local storage on cloud instances are more ephemeral than I'd expect the DC instances to be. So, persistent storage of job

Re: how to stop the mesos executor process in JVM?

2016-06-06 Thread Sharma Podila
Yao, in our Java executor, we explicitly call System.exit(0) after we have successfully sent the last finished message. However, note that there can be a little bit of a timing issue here. Once we send the last message, we call an asynchronous "sleep some and exit" routine. This gives the mesos

Re: Running Mesos agent on ARM (Raspberry Pi)?

2016-05-13 Thread Sharma Podila
iPhone > On May 13, 2016, at 2:10 AM, Tomek Janiszewski <jani...@gmail.com> wrote: > > Cool. Did you hit any trubles with that setup? > > > pt., 13.05.2016, 03:13 użytkownik Sharma Podila <spod...@netflix.com> napisał: >> We have Mesos agents running on Pi

Re: Running Mesos agent on ARM (Raspberry Pi)?

2016-05-12 Thread Sharma Podila
We have Mesos agents running on Pi3's taking tasks from master running on a Linux laptop. https://twitter.com/aspyker/status/730924571440779264 More info to follow. Thanks for all the pointers. On Fri, Apr 29, 2016 at 1:09 PM, Sharma Podila <spod...@netflix.com> wrote: > Fy

Re: How to use a complete host

2016-05-02 Thread Sharma Podila
This can't be achieved with the offer model as it stands today, unless you have only a single framework in the cluster. There is no visibility into what other resources are available on the agent which weren't offered to your framework. However, for the short term, you can use a hack to put in

Re: Running Mesos agent on ARM (Raspberry Pi)?

2016-04-25 Thread Sharma Podila
-- > > Aaron Carey > Production Engineer - Cloud Pipeline > Industrial Light & Magic > London > 020 3751 9150 > > ------ > *From:* Sharma Podila [spod...@netflix.com] > *Sent:* 22 April 2016 17:53 > *To:* user@mesos.apache.org; dev > *S

Re: Running Mesos agent on ARM (Raspberry Pi)?

2016-04-22 Thread Sharma Podila
:02 użytkownik haosdent <haosd...@gmail.com> napisał: >> >>> Tomek have a gsoc proposal to make Mesos build on ARM >>> https://docs.google.com/document/d/1zbms2jQfExuIm6g-adqaXjFpPif6OsqJ84KAgMrOjHQ/edit >>> I think you could take a look at this code in github >&g

Running Mesos agent on ARM (Raspberry Pi)?

2016-04-22 Thread Sharma Podila
We are working on a hack to run Mesos agents on Raspberry Pi and are wondering if anyone here has done that before. From the Google search results we looked at so far, it seems like it has been compiled, but we haven't seen an indication that anyone has run it and launched tasks on them. And does

Re: mesos agent not recovering after ZK init failure

2016-03-07 Thread Sharma Podila
n. > > On Fri, Feb 26, 2016 at 4:34 PM, Sharma Podila <spod...@netflix.com> > wrote: > >> MESOS-4795 created. >> >> I don't have the exit status. We haven't seen a repeat yet, will catch >> the exit status next time it happens. >> >> Yes, removin

Re: mesos agent not recovering after ZK init failure

2016-02-26 Thread Sharma Podila
again. > > On Tue, Feb 23, 2016 at 2:51 PM, Sharma Podila <spod...@netflix.com> > wrote: > >> Hi Ben, >> >> Let me know if there is a new issue created for this, I would like to add >> myself to watch it. >> Thanks. >> >> &g

Re: mesos agent not recovering after ZK init failure

2016-02-23 Thread Sharma Podila
Hi Ben, Let me know if there is a new issue created for this, I would like to add myself to watch it. Thanks. On Wed, Feb 10, 2016 at 9:54 AM, Sharma Podila <spod...@netflix.com> wrote: > Hi Ben, > > That is accurate, with one additional line: > > -Agent running fine wit

Re: AW: Feature request: move in-flight containers w/o stopping them

2016-02-19 Thread Sharma Podila
Moving stateless services can be trivial or a non problem, as others have suggested. Migrating state full services becomes a function of migrating the state, including any network conx, etc. To think aloud, from a bit of past considerations in hpc like systems, some systems relied upon the

Re: mesos agent not recovering after ZK init failure

2016-02-10 Thread Sharma Podila
inue flapping, but silent exit after printing the > detector.cpp:481 log line. > > Is this accurate? What is the exit code from the silent exit? > > On Tue, Feb 9, 2016 at 9:09 PM, Sharma Podila <spod...@netflix.com> wrote: > >> Maybe related, but, maybe different since

mesos agent not recovering after ZK init failure

2016-02-09 Thread Sharma Podila
We had a few mesos agents stuck in an unrecoverable state after a transient ZK init error. Is this a known problem? I wasn't able to find an existing jira item for this. We are on 0.24.1 at this time. Most agents were fine, except a handful. These handful of agents had their mesos-slave process

Re: mesos agent not recovering after ZK init failure

2016-02-09 Thread Sharma Podila
was fixed in 0.19.0 (set the fix version now). But I guess you > are saying it is somehow related but not exactly the same issue? > > On Tue, Feb 9, 2016 at 11:46 AM, Raúl Gutiérrez Segalés < > r...@itevenworks.net> wrote: > >> On 9 February 2016 at 11:04, Sharma Podila &l

Re: Scheduling tasks based on dependancy

2015-10-06 Thread Sharma Podila
ot collects these > resource information by itself. > > Am I missing something here? > > Regards, > Pradeep > > On 5 October 2015 at 18:28, Sharma Podila <spod...@netflix.com> wrote: > >> Pradeep, >> >> We recently open sourced Fenzo <https://github

Re: Scheduling tasks based on dependancy

2015-10-06 Thread Sharma Podila
Pradeep, attributes show up as name value pairs in the offers. Custom attributes can also be used in Fenzo for assignment optimizations. For example, we set custom attributes for AWS EC2 ZONE names and ASG names. We use the ZONE name custom attribute to balance tasks of a job across zones via the

Re: Scheduling tasks based on dependancy

2015-10-05 Thread Sharma Podila
Pradeep, We recently open sourced Fenzo (wiki ) to handle these scenarios. We add a custom attribute for network bandwidth for each agent's "mesos-slave" command line. And we have Fenzo assign resources to tasks based on

Re: Metric for tasks queued/waiting?

2015-09-23 Thread Sharma Podila
t; That would allow for tunable fair-sharing based on DRF-principles. >>> >>> On Wed, Sep 23, 2015 at 10:59 AM haosdent <haosd...@gmail.com> wrote: >>> >>>> Feel free to open a story in jira if you think you ideas are awesome. >>>> :-) >>>&

Re: How to kill a task gracefully?

2015-09-22 Thread Sharma Podila
I believe this depends on the executor being used. A kill request to mesos driver from the framework scheduler is delivered to the executor. The kill request by itself is not a guarantee that the task will be killed, until honored by the executor. So, it is possible that the executor can be

Re: Setting maximum per-node resources in offers

2015-09-10 Thread Sharma Podila
FYI- If you are to use Fenzo in writing your framework, it has support for limiting overall resources used by tasks with the use of a "group name". That is, all tasks with a group name, say "userA", would be limited to using the resources specified in the limit for the group. For this to work, you

Re: MesosCon Seattle attendee introduction thread

2015-08-17 Thread Sharma Podila
Hello Everyone, I am Sharma Podila, senior software engineer at Netflix. It is exciting to be a part of MesosCon again this year. We developed a cloud native Mesos framework to run a mix of service, batch, and stream processing workloads. To which end we created a reusable plug-ins based

Re: Setting minimum offer size

2015-06-30 Thread Sharma Podila
. The alternative with one framework will of course work, but this implies having a general-purpose framework, that does some work that is better done by Mesos (which has more information and therefore can take better decisions). On Wed, Jun 24, 2015 at 11:54 PM, Sharma Podila spod...@netflix.com

Re: Cluster autoscaling in Spark+Mesos ?

2015-06-05 Thread Sharma Podila
packing logic of fenzo. - -- Ankur Chauhan On 04/06/2015 22:35, Sharma Podila wrote: We Autoscale our Mesos cluster in EC2 from within our framework. Scaling up can be easy via watching demand Vs supply. However, scaling down requires bin packing the tasks tightly onto as few servers

Re: Cluster autoscaling in Spark+Mesos ?

2015-06-04 Thread Sharma Podila
We Autoscale our Mesos cluster in EC2 from within our framework. Scaling up can be easy via watching demand Vs supply. However, scaling down requires bin packing the tasks tightly onto as few servers as possible. Do you have any specific ideas on how you would leverage Mantis/Mesos for Spark based

Re: [DISCUSS] Renaming Mesos Slave

2015-06-02 Thread Sharma Podila
My $0.02... The use of the word Worker is confusing. This entity has several responsibilities, including, maintaining connectivity to master, managing and monitoring the executors, sending status updates, and other future endeavors such as autonomously determining actions for resource

Re: Is launchTasks() with multiple offers limited to a single slave?

2015-03-19 Thread Sharma Podila
I will assume that you are not talking of the case that a task actually is being launched on multiple salves, since a task can only be launched on one slave with existing concepts. Yes, that call is for one or more tasks on a single slave. That call (since 0.18, I believe) also takes multiple

Re: Mesos cluster auto scaling slaves

2015-02-27 Thread Sharma Podila
Hello Kenneth, There is a little bit of work needed in the framework to do autoscaling of the slave cluster. Theoretically, scaling up can be relatively easy by watching the utilization and adding nodes. However, in order to scale down, the framework must support two things - some kind of bin

Re: cluster wide init

2015-01-22 Thread Sharma Podila
exclusivity when you spin up an 11th slave marathon would start it there. On Thursday, 22 January 2015, Sharma Podila spod...@netflix.com wrote: Just a thought looking forward... Might be useful to define an init kind of feature in Mesos slaves. Configuration can be defined in Mesos master

Re: cluster wide init

2015-01-22 Thread Sharma Podila
Just a thought looking forward... Might be useful to define an init kind of feature in Mesos slaves. Configuration can be defined in Mesos master that lists services that must be run on all slaves. When slaves register, they get the list of services to run all the time. Updates to the

Re: Trying to debug an issue in mesos task tracking

2015-01-21 Thread Sharma Podila
Have you checked the mesos-slave and mesos-master logs for that task id? There should be logs in there for task state updates, including FINISHED. There can be specific cases where sometimes the task status is not reliably sent to your scheduler (due to mesos-master restarts, leader election

Re: implementing data locality via mesos resource offers

2015-01-16 Thread Sharma Podila
Using the attributes would be the simplest way, if the slave were to support dynamic updates of the attributes. The JIRA that Tim references would be nice! Otherwise one would have to resort to something like a wrapper script of the mesos-slave process that detects new data availability and

Re: implementing data locality via mesos resource offers

2015-01-16 Thread Sharma Podila
, Sharma Podila spod...@netflix.com wrote: Using the attributes would be the simplest way, if the slave were to support dynamic updates of the attributes. The JIRA that Tim references would be nice! Otherwise one would have to resort to something like a wrapper script of the mesos-slave process

Re: Question about External Containerizer

2014-12-03 Thread Sharma Podila
This may have to do with fine-grain Vs coarse-grain resource allocation. Things may be easier for you, Diptanu, if you are using one Docker container per task (sort of coarse grain). In that case, I believe there's no need to alter a running Docker container's resources. Instead, the resource

Re: Question about External Containerizer

2014-12-03 Thread Sharma Podila
On Dec 3, 2014, at 10:20, Sharma Podila spod...@netflix.com wrote: This may have to do with fine-grain Vs coarse-grain resource allocation. Things may be easier for you, Diptanu, if you are using one Docker container per task (sort of coarse grain). In that case, I believe there's no need

Re: A problem with resource offers

2014-11-07 Thread Sharma Podila
, Timothy Chen t...@mesosphere.io wrote: Hi Sharma, Can you try out the latest master and see if you can repro it? Tim Sent from my iPhone On Nov 6, 2014, at 7:41 PM, Sharma Podila spod...@netflix.com wrote: ​​ I am on 0.18 still. I think I found a bug. I wrote a simple program to repeat

Re: A problem with resource offers

2014-11-06 Thread Sharma Podila
once a disconnection occurs. The scheduler driver does not automatically rescind offers upon disconnection, so I'd recommend clearing all cached offers when your scheduler gets disconnected, to avoid the unnecessary TASK_LOST updates. On Thu, Nov 6, 2014 at 6:25 PM, Sharma Podila spod

Re: Reconciliation Document

2014-11-03 Thread Sharma Podila
Inline... On Tue, Oct 21, 2014 at 12:52 PM, Benjamin Mahler benjamin.mah...@gmail.com wrote: Inline. On Thu, Oct 16, 2014 at 7:43 PM, Sharma Podila spod...@netflix.com wrote: Response inline, below. On Thu, Oct 16, 2014 at 5:41 PM, Benjamin Mahler benjamin.mah...@gmail.com wrote

Re: Reconciliation Document

2014-10-15 Thread Sharma Podila
Looks like a good step forward. What is the reason for the algorithm having to call reconcile tasks multiple times after waiting some time in step 6? Shouldn't it be just once per (re)registration? Are there time bound guarantees within which a task update will be sent out after a reconcile

Re: Framework testing in Mesos

2014-10-14 Thread Sharma Podila
@Sharma #3 looks impressive and I hear the pain. Few questions: * Since you already have the state machine modeling, can't the scheduler actions also be modeled as a state machine transitions? I suppose that is possible in theory. I am thinking that the scheduler state will have to be a

Re: Framework testing in Mesos

2014-10-12 Thread Sharma Podila
Trying to test the framework in an automated way, I tend to think of the framework in these parts: 1. Executor 2. Scheduler's interaction with Mesos and state persistence 3. Scheduler's task assignment of resources I will skip #1, you covered that already and it depends largely on the kind of

Re: Design Review: Maintenance Primitives

2014-08-27 Thread Sharma Podila
Nicely written doc. Here's a few thoughts: - There's some commonality between the existing offerRescinded() and the new inverseOffer(). Maybe consider having same method names for them with differing signatures? I'd second Maxime's point about possibly renaming inverseOffer to something else -

Re: MesosCon attendee introduction thread

2014-08-15 Thread Sharma Podila
Hello Everyone, I work at Netflix. I came across Mesos 7 months ago. I am developing a Mesos framework/scheduler for a cloud native reactive stream processing. Together with my colleague, Justin Becker, we are excited to talk about it at MesosCon. Previously, I did a fair bit of work on dynamic

Re: Exposing executor container

2014-08-12 Thread Sharma Podila
You may already know this, but, this does sound similar to http://www.mail-archive.com/user@mesos.apache.org/msg00885.html There was a possible (and partial) solution in using soft limits for memory for which a ticket was opened. On Tue, Aug 12, 2014 at 1:17 PM, Thomas Petr tp...@hubspot.com

Re: Task serialization per machine?

2014-06-30 Thread Sharma Podila
A likely scenario is that your executor is running the task synchronously inside the callback to launchTask(). If you make it instead run the task asynchronously (e.g., in a separate thread), that should resolve it. On Mon, Jun 30, 2014 at 12:48 PM, Asim linka...@gmail.com wrote: Hi, I want

Re: cgroups memory isolation

2014-06-19 Thread Sharma Podila
Purely from a user expectation point of view, I am wondering if such an abuse (overuse?) of I/O bandwidth/rate should translate into I/O bandwidth getting throttled for the job instead of it manifesting into an OOM that results in a job kill. Such I/O overuse translating into memory overuse seems

Re: cgroups memory isolation

2014-06-19 Thread Sharma Podila
...@redhat.com wrote: Awesome response! inline below - -- *From: *Sharma Podila spod...@netflix.com *To: *user@mesos.apache.org *Cc: *Ian Downes ian.dow...@gmail.com, Eric Abbott eabb...@hubspot.com *Sent: *Thursday, June 19, 2014 11:54:34 AM *Subject: *Re

Dealing with run away task processes after executor terminates

2014-06-03 Thread Sharma Podila
When a framework executor terminates, Mesos sends TASK_LOST status updates for tasks that were running. However, if a task had processes that do not terminate when the executor dies, then we have a problem since Mesos considers the slave resources assigned to those tasks as released. Where as, the

Re: Dealing with run away task processes after executor terminates

2014-06-03 Thread Sharma Podila
there is definitely a race here, mesos might release the resource to framework before the container is destroyed. We'll try to fix that really soon. I'll let Jie/Ian chime in regarding fixes/tickets. On Tue, Jun 3, 2014 at 4:25 PM, Sharma Podila spod...@netflix.com wrote: When a framework executor

Q on master state.json

2014-05-21 Thread Sharma Podila
I see that master/state.json has state information on frameworks, where in, it has a list of all completed_tasks. Each task seems to be about 500 bytes. Does the master have a list of all completed tasks for the framework? Thinking naively about it, does it mean that if I were to run, say, 100K

Re: Question on resource offers and framework failover

2014-05-16 Thread Sharma Podila
failover timeout). If so, I have no way of knowing that my previous tasks have been killed. Once the registration starts failing with failover option if failover timeout has passed, then, I think the reconciliation strategy will work fine. On Fri, May 16, 2014 at 1:16 PM, Sharma Podila spod

Re: Question on resource offers and framework failover

2014-05-16 Thread Sharma Podila
its intent before launching a task, then the set of tasks in the framework will always be a superset of the tasks in the Master/Slaves. On Wed, May 14, 2014 at 11:04 PM, Sharma Podila spod...@netflix.comwrote: TASK_LOST is a good thing. I expect to deal with it now and in the future. I

Re: Question on resource offers and framework failover

2014-05-15 Thread Sharma Podila
in providing more precise task states for various conditions. On Tue, May 13, 2014 at 10:10 AM, Sharma Podila spod...@netflix.comwrote: ​Thanks for confirming that, Adam. ​ , but it would be a good Mesos FAQ topic. I was thinking it might be good to also add to doc in code, either

Re: Question on resource offers and framework failover

2014-05-13 Thread Sharma Podila
the framework failover timeout is exceeded. On Mon, May 12, 2014 at 5:38 PM, Sharma Podila spod...@netflix.comwrote: My understanding is that when a framework fails over (either new instance starts after previous one fails, or the same instance restarts), Mesos master would automatically cancel

Trying to get task reconciliation to work

2014-04-17 Thread Sharma Podila
Hello, I don't seem to have reconcileTasks() working for me and was wondering if I am either using it incorrectly or hitting a problem. Here's what's happening: 1. There's one Mesos (0.18) master, one slave, one framework, all running on Ubuntu 12.04 2. Mesos master and slave come up fine (using

Re: Trying to get task reconciliation to work

2014-04-17 Thread Sharma Podila
to ascertain the state of their tasks. On Thu, Apr 17, 2014 at 12:53 PM, Sharma Podila spod...@netflix.com wrote: Hello, I don't seem to have reconcileTasks() working for me and was wondering if I am either using it incorrectly or hitting a problem. Here's what's happening: 1. There's one

Re: Question on executors

2014-03-10 Thread Sharma Podila
Thank you for the confirmation and the pointer to the 1 sec sleep. Yes, I meant TASK_FINISHED. If you don't want to implement an Executor and your Task merely consists of forking an arbitrary process, you can use the built-in Command Executor. You can launch a task directly in this manner by