Hey Claudiu, I spent some time trying to understand the logs you posted. Whats strange to me is that in the very beginning when framework's 1 and 2 are registered, only one framework gets offers for a period of 9s. It's not clear why this happens. I even wrote a test ( https://reviews.apache.org/r/22714/) to repro but wasn't able to.
It would probably be helpful to add more logging to the drf sorting comparator function to understand why frameworks are sorted in such a way when their share is same (0). My expectation is that after each allocation, the 'allocations' for a framework should increase causing the sort function to behave correctly. But that doesn't seem to be happening in your case. I0604 22:12:43.715530 22270 master.cpp:2282] Sending 4 offers to framework 20140604-221214-302055434-5050-22260-0000 I0604 22:12:44.276062 22273 master.cpp:2282] Sending 4 offers to framework 20140604-221214-302055434-5050-22260-0001 I0604 22:12:44.756918 22292 master.cpp:2282] Sending 4 offers to framework 20140604-221214-302055434-5050-22260-0000 I0604 22:12:45.794178 22276 master.cpp:2282] Sending 4 offers to framework 20140604-221214-302055434-5050-22260-0001 I0604 22:12:46.841629 22291 master.cpp:2282] Sending 4 offers to framework 20140604-221214-302055434-5050-22260-0001 I0604 22:12:47.884266 22262 master.cpp:2282] Sending 4 offers to framework 20140604-221214-302055434-5050-22260-0001 I0604 22:12:48.926856 22268 master.cpp:2282] Sending 4 offers to framework 20140604-221214-302055434-5050-22260-0001 I0604 22:12:49.966560 22280 master.cpp:2282] Sending 4 offers to framework 20140604-221214-302055434-5050-22260-0001 I0604 22:12:51.007143 22267 master.cpp:2282] Sending 4 offers to framework 20140604-221214-302055434-5050-22260-0001 I0604 22:12:52.047987 22280 master.cpp:2282] Sending 4 offers to framework 20140604-221214-302055434-5050-22260-0001 I0604 22:12:53.089340 22291 master.cpp:2282] Sending 4 offers to framework 20140604-221214-302055434-5050-22260-0001 I0604 22:12:54.130242 22263 master.cpp:2282] Sending 4 offers to framework 20140604-221214-302055434-5050-22260-0000 @vinodkone On Fri, Jun 13, 2014 at 3:40 PM, Claudiu Barbura <[email protected] > wrote: > Hi Vinod, > > Attached are the patch files.Hadoop has to be treated differently as it > requires resources in order to shut down task trackers after a job is > complete. Therefore we set the role name so that Mesos allocates resources > for it first, ahead of the rest of the frameworks under the default role > (*). > This is not ideal, we are going to loo into the Hadoop Mesos framework > code and fix if possible. Luckily, Hadoop is the only framework we use on > top of Mesos that allows a configurable role name to be passed in when > registering a framework (unlike, Spark, Aurora, Storm etc) > For the non-Hadoop frameworks, we are making sure that once a framework is > running its jobs, Mesos no longer offers resources to it. In the same time, > once a framework completes its job, we make sure its “client allocations” > value is updated so that when it completes the execution of its jobs, it is > placed back in the sorting list with a real chance of being offered again > immediately (not starved!). > What is also key is that mem type resources are ignored during share > computation as only cpus are a good indicator of which frameworks are > actually running jobs in the cluster. > > Thanks, > Claudiu > > From: Claudiu Barbura <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Thursday, June 12, 2014 at 6:20 PM > > To: "[email protected]" <[email protected]> > Subject: Re: Framework Starvation > > Hi Vinod, > > We have a fix (more like a hack) that works for us, but it requires us > to run each Hadoop framework with a different role as we need to treat > Hadoop differently than the rest of the frameworks (Spark, Shark, Aurora) > which are running with the default role (*). > We had to change the drf_sorter.cpp/hpp and > hierarchical_allocator_process.cpp files. > > Let me know if you need more info on this. > > Thanks, > Claudiu > > From: Claudiu Barbura <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Thursday, June 5, 2014 at 2:41 AM > To: "[email protected]" <[email protected]> > Subject: Re: Framework Starvation > > Hi Vinod, > > I attached the master log after adding more logging to the sorter code. > I believe the problem lies somewhere else however … > in HierarchicalAllocatorProcess<RoleSorter, FrameworkSorter>::allocate() > > I will continue to investigate in the meantime. > > Thanks, > Claudiu > > From: Vinod Kone <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Tuesday, June 3, 2014 at 5:16 PM > To: "[email protected]" <[email protected]> > Subject: Re: Framework Starvation > > Either should be fine. I don't think there are any changes in allocator > since 0.18.0-rc1. > > > On Tue, Jun 3, 2014 at 4:08 PM, Claudiu Barbura < > [email protected]> wrote: > >> Hi Vinod, >> >> Should we use the same 0-18.1-rc1 branch or trunk code? >> >> Thanks, >> Claudiu >> >> From: Vinod Kone <[email protected]> >> Reply-To: "[email protected]" <[email protected]> >> Date: Tuesday, June 3, 2014 at 3:55 PM >> To: "[email protected]" <[email protected]> >> Subject: Re: Framework Starvation >> >> Hey Claudiu, >> >> Is it possible for you to run the same test but logging more >> information about the framework shares? For example, it would be really >> insightful if you can log each framework's share in DRFSorter::sort() >> (see: master/drf_sorter.hpp). This will help us diagnose the problem. I >> suspect one of our open tickets around allocation (MESOS-1119 >> <https://issues.apache.org/jira/browse/MESOS-1119>, MESOS-1130 >> <https://issues.apache.org/jira/browse/MESOS-1130> and MESOS-1187 >> <https://issues.apache.org/jira/browse/MESOS-1187>) is the issue. But it >> would be good to have that logging data regardless to confirm. >> >> >> On Mon, Jun 2, 2014 at 10:46 AM, Claudiu Barbura < >> [email protected]> wrote: >> >>> Hi Vinod, >>> >>> I tried to attach the logs (2MB) and the email (see below) did not go >>> through. I emailed your gmail account separately. >>> >>> Thanks, >>> Claudiu >>> >>> From: Claudiu Barbura <[email protected]> >>> Date: Monday, June 2, 2014 at 10:00 AM >>> >>> To: "[email protected]" <[email protected]> >>> Subject: Re: Framework Starvation >>> >>> Hi Vinod, >>> >>> I attached the maser logs snapshots during starvation and after >>> starvation. >>> >>> There are 4 slave nodes and 1 master, all with of the same ec2 >>> instance type (cc2.8xlarge, 32 cores, 60GB RAM). >>> I am running 4 shark-cli instances from the same master node, and >>> running queries on all 4 of them … then “starvation” kicks in (see attached >>> log_during_starvation file). >>> After I terminate 2 of the shark-cli instances, the starved ones are >>> receiving offers and are able to run queries again (see attached >>> log_after_starvation file). >>> >>> Let me know if you need the slave logs. >>> >>> Thank you! >>> Claudiu >>> >>> From: Vinod Kone <[email protected]> >>> Reply-To: "[email protected]" <[email protected]> >>> Date: Friday, May 30, 2014 at 10:13 AM >>> To: "[email protected]" <[email protected]> >>> Subject: Re: Framework Starvation >>> >>> Hey Claudiu, >>> >>> Mind posting some master logs with the simple setup that you described >>> (3 shark cli instances)? That would help us better diagnose the problem. >>> >>> >>> On Fri, May 30, 2014 at 1:59 AM, Claudiu Barbura < >>> [email protected]> wrote: >>> >>>> This is a critical issue for us as we have to shut down frameworks >>>> for various components in our platform to work and this has created more >>>> contention than before we deployed Mesos, when everyone had to wait in line >>>> for their MR/Hive jobs to run. >>>> >>>> Any guidance, ideas would be extremely helpful at this point. >>>> >>>> Thank you, >>>> Claudiu >>>> >>>> From: Claudiu Barbura <[email protected]> >>>> Reply-To: "[email protected]" <[email protected]> >>>> Date: Tuesday, May 27, 2014 at 11:57 PM >>>> To: "[email protected]" <[email protected]> >>>> Subject: Framework Starvation >>>> >>>> Hi, >>>> >>>> Following Ben’s suggestion at the Seattle Spark Meetup in April, I >>>> built and deployed the 0-18.1-rc1 branch hoping that this wold solve the >>>> framework starvation problem we have been seeing in the past 2 months now. >>>> The hope was that https://issues.apache.org/jira/browse/MESOS-1086 would >>>> also help us. Unfortunately it did not. >>>> This bug is preventing us to run multiple spark and shark servers >>>> (http, thrift), in load balanced fashion, Hadoop and Aurora in the same >>>> mesos cluster. >>>> >>>> For example, if we start at least 3 frameworks, one Hadoop, one >>>> SparkJobServer (one Spark context in fine-grained mode) and one Http >>>> SharkServer (one JavaSharkContext that inherits from Spark Contexts, again >>>> in fine-grained mode) and we run queries on all three of them, very soon we >>>> notice the following behavior: >>>> >>>> >>>> - only the last two frameworks that we run queries against receive >>>> resource offers (master.cpp log entries in the log/mesos-master.INFO) >>>> - the other frameworks are ignored and not allocated any resources >>>> until we kill one the two privileged ones above >>>> - As soon as one of the privileged framework is terminated, one of >>>> the starved framework takes its place >>>> - Any new Spark context created in coarse-grained mode (fixed >>>> number of cores) will generally receive offers immediately (rarely it >>>> gets >>>> starved) >>>> - Hadoop behaves slightly differently when starved: task trackers >>>> are started but never released, which means, if the first job (Hive >>>> query) >>>> is small in terms of number of input splits, only one task tracker with >>>> a >>>> small number of allocated ores is created, and then all subsequent >>>> queries, >>>> regardless of size are only run in very limited mode with this one >>>> “small” >>>> task tracker. Most of the time only the map phase of a big query is >>>> completed while the reduce phase is hanging. Killing one of the >>>> registered >>>> Spark context above releases resources for Mesos to complete the query >>>> and >>>> gracefully shut down the task trackers (as noticed in the master log) >>>> >>>> We are using the default settings in terms of isolation, weights etc … >>>> the only stand out configuration would be the memory allocation for slave >>>> (export MESOS_resources=mem:35840 in mesos-slave-env.sh) but I am not sure >>>> if this is ever enforced, as each framework has its own executor process >>>> (JVM in our case) with its own memory allocation (we are not using cgroups >>>> yet) >>>> >>>> A very easy to reproduce this bug is to start a minimum of 3 >>>> shark-cli instances in a mesos cluster and notice that only two of them are >>>> being offered resources and are running queries successfully. >>>> I spent quite a bit of time in mesos, spark and hadoop-mesos code in an >>>> attempt to find a possible workaround but no luck so far. >>>> >>>> Any guidance would be very appreciated. >>>> >>>> Thank you, >>>> Claudiu >>>> >>>> >>>> >>> >> >

