Either should be fine. I don't think there are any changes in allocator since 0.18.0-rc1.
On Tue, Jun 3, 2014 at 4:08 PM, Claudiu Barbura <[email protected]> wrote: > Hi Vinod, > > Should we use the same 0-18.1-rc1 branch or trunk code? > > Thanks, > Claudiu > > From: Vinod Kone <[email protected]> > Reply-To: "[email protected]" <[email protected]> > Date: Tuesday, June 3, 2014 at 3:55 PM > To: "[email protected]" <[email protected]> > Subject: Re: Framework Starvation > > Hey Claudiu, > > Is it possible for you to run the same test but logging more information > about the framework shares? For example, it would be really insightful if > you can log each framework's share in DRFSorter::sort() (see: > master/drf_sorter.hpp). This will help us diagnose the problem. I suspect > one of our open tickets around allocation (MESOS-1119 > <https://issues.apache.org/jira/browse/MESOS-1119>, MESOS-1130 > <https://issues.apache.org/jira/browse/MESOS-1130> and MESOS-1187 > <https://issues.apache.org/jira/browse/MESOS-1187>) is the issue. But it > would be good to have that logging data regardless to confirm. > > > On Mon, Jun 2, 2014 at 10:46 AM, Claudiu Barbura < > [email protected]> wrote: > >> Hi Vinod, >> >> I tried to attach the logs (2MB) and the email (see below) did not go >> through. I emailed your gmail account separately. >> >> Thanks, >> Claudiu >> >> From: Claudiu Barbura <[email protected]> >> Date: Monday, June 2, 2014 at 10:00 AM >> >> To: "[email protected]" <[email protected]> >> Subject: Re: Framework Starvation >> >> Hi Vinod, >> >> I attached the maser logs snapshots during starvation and after >> starvation. >> >> There are 4 slave nodes and 1 master, all with of the same ec2 instance >> type (cc2.8xlarge, 32 cores, 60GB RAM). >> I am running 4 shark-cli instances from the same master node, and running >> queries on all 4 of them … then “starvation” kicks in (see attached >> log_during_starvation file). >> After I terminate 2 of the shark-cli instances, the starved ones are >> receiving offers and are able to run queries again (see attached >> log_after_starvation file). >> >> Let me know if you need the slave logs. >> >> Thank you! >> Claudiu >> >> From: Vinod Kone <[email protected]> >> Reply-To: "[email protected]" <[email protected]> >> Date: Friday, May 30, 2014 at 10:13 AM >> To: "[email protected]" <[email protected]> >> Subject: Re: Framework Starvation >> >> Hey Claudiu, >> >> Mind posting some master logs with the simple setup that you described >> (3 shark cli instances)? That would help us better diagnose the problem. >> >> >> On Fri, May 30, 2014 at 1:59 AM, Claudiu Barbura < >> [email protected]> wrote: >> >>> This is a critical issue for us as we have to shut down frameworks for >>> various components in our platform to work and this has created more >>> contention than before we deployed Mesos, when everyone had to wait in line >>> for their MR/Hive jobs to run. >>> >>> Any guidance, ideas would be extremely helpful at this point. >>> >>> Thank you, >>> Claudiu >>> >>> From: Claudiu Barbura <[email protected]> >>> Reply-To: "[email protected]" <[email protected]> >>> Date: Tuesday, May 27, 2014 at 11:57 PM >>> To: "[email protected]" <[email protected]> >>> Subject: Framework Starvation >>> >>> Hi, >>> >>> Following Ben’s suggestion at the Seattle Spark Meetup in April, I >>> built and deployed the 0-18.1-rc1 branch hoping that this wold solve the >>> framework starvation problem we have been seeing in the past 2 months now. >>> The hope was that https://issues.apache.org/jira/browse/MESOS-1086 would >>> also help us. Unfortunately it did not. >>> This bug is preventing us to run multiple spark and shark servers (http, >>> thrift), in load balanced fashion, Hadoop and Aurora in the same mesos >>> cluster. >>> >>> For example, if we start at least 3 frameworks, one Hadoop, one >>> SparkJobServer (one Spark context in fine-grained mode) and one Http >>> SharkServer (one JavaSharkContext that inherits from Spark Contexts, again >>> in fine-grained mode) and we run queries on all three of them, very soon we >>> notice the following behavior: >>> >>> >>> - only the last two frameworks that we run queries against receive >>> resource offers (master.cpp log entries in the log/mesos-master.INFO) >>> - the other frameworks are ignored and not allocated any resources >>> until we kill one the two privileged ones above >>> - As soon as one of the privileged framework is terminated, one of >>> the starved framework takes its place >>> - Any new Spark context created in coarse-grained mode (fixed number >>> of cores) will generally receive offers immediately (rarely it gets >>> starved) >>> - Hadoop behaves slightly differently when starved: task trackers >>> are started but never released, which means, if the first job (Hive >>> query) >>> is small in terms of number of input splits, only one task tracker with a >>> small number of allocated ores is created, and then all subsequent >>> queries, >>> regardless of size are only run in very limited mode with this one >>> “small” >>> task tracker. Most of the time only the map phase of a big query is >>> completed while the reduce phase is hanging. Killing one of the >>> registered >>> Spark context above releases resources for Mesos to complete the query >>> and >>> gracefully shut down the task trackers (as noticed in the master log) >>> >>> We are using the default settings in terms of isolation, weights etc … >>> the only stand out configuration would be the memory allocation for slave >>> (export MESOS_resources=mem:35840 in mesos-slave-env.sh) but I am not sure >>> if this is ever enforced, as each framework has its own executor process >>> (JVM in our case) with its own memory allocation (we are not using cgroups >>> yet) >>> >>> A very easy to reproduce this bug is to start a minimum of 3 shark-cli >>> instances in a mesos cluster and notice that only two of them are being >>> offered resources and are running queries successfully. >>> I spent quite a bit of time in mesos, spark and hadoop-mesos code in an >>> attempt to find a possible workaround but no luck so far. >>> >>> Any guidance would be very appreciated. >>> >>> Thank you, >>> Claudiu >>> >>> >>> >> >

