Re: Framework Starvation

Vinod Kone Thu, 19 Jun 2014 10:57:02 -0700

On Thu, Jun 19, 2014 at 10:46 AM, Vinod Kone <[email protected]> wrote:


> Waiting to see your blog post :)
>
> That said, what baffles me is that in the very beginning when only two
> frameworks are present and no tasks have been launched, one framework is
> getting more allocations than other (see the log lines I posted in the
> earlier email), which is unexpected.
>
>
> @vinodkone
>
>
> On Tue, Jun 17, 2014 at 9:41 PM, Claudiu Barbura <
> [email protected]> wrote:
>
>>  Hi Vinod,
>>
>>  Yo are looking at logs I had posted before we implemented our fix
>> (files attached in my last email).
>> I will write a detailed blog post on the issue … after the Spark Summit
>> at the end of this month.
>>
>>  What wold happen before is that frameworks with the same share (0)
>> would also have the smallest allocation in the beginning, and after sorting
>> the list they would be at the top, always offered all the resources before
>> other frameworks that had been already offered, running tasks with a share
>> and allocation > 0.
>>
>>  Thanks,
>> Claudiu
>>
>>   From: Vinod Kone <[email protected]>
>> Reply-To: "[email protected]" <[email protected]>
>> Date: Wednesday, June 18, 2014 at 4:54 AM
>>
>> To: "[email protected]" <[email protected]>
>> Subject: Re: Framework Starvation
>>
>>   Hey Claudiu,
>>
>>  I spent some time trying to understand the logs you posted. Whats
>> strange to me is that in the very beginning when framework's 1 and 2 are
>> registered, only one framework gets offers for a period of 9s. It's not
>> clear why this happens. I even wrote a test (
>> https://reviews.apache.org/r/22714/) to repro but wasn't able to.
>>
>>  It would probably be helpful to add more logging to the drf sorting
>> comparator function to understand why frameworks are sorted in such a way
>> when their share is same (0). My expectation is that after each allocation,
>> the 'allocations' for a framework should increase causing the sort function
>> to behave correctly. But that doesn't seem to be happening in your case.
>>
>>
>>  I0604 22:12:43.715530 22270 master.cpp:2282] Sending 4 offers to
>> framework 20140604-221214-302055434-5050-22260-0000
>>
>> I0604 22:12:44.276062 22273 master.cpp:2282] Sending 4 offers to
>> framework 20140604-221214-302055434-5050-22260-0001
>>
>> I0604 22:12:44.756918 22292 master.cpp:2282] Sending 4 offers to
>> framework 20140604-221214-302055434-5050-22260-0000
>>
>> I0604 22:12:45.794178 22276 master.cpp:2282] Sending 4 offers to
>> framework 20140604-221214-302055434-5050-22260-0001
>>
>> I0604 22:12:46.841629 22291 master.cpp:2282] Sending 4 offers to
>> framework 20140604-221214-302055434-5050-22260-0001
>>
>> I0604 22:12:47.884266 22262 master.cpp:2282] Sending 4 offers to
>> framework 20140604-221214-302055434-5050-22260-0001
>>
>> I0604 22:12:48.926856 22268 master.cpp:2282] Sending 4 offers to
>> framework 20140604-221214-302055434-5050-22260-0001
>>
>> I0604 22:12:49.966560 22280 master.cpp:2282] Sending 4 offers to
>> framework 20140604-221214-302055434-5050-22260-0001
>>
>> I0604 22:12:51.007143 22267 master.cpp:2282] Sending 4 offers to
>> framework 20140604-221214-302055434-5050-22260-0001
>>
>> I0604 22:12:52.047987 22280 master.cpp:2282] Sending 4 offers to
>> framework 20140604-221214-302055434-5050-22260-0001
>>
>> I0604 22:12:53.089340 22291 master.cpp:2282] Sending 4 offers to
>> framework 20140604-221214-302055434-5050-22260-0001
>>
>> I0604 22:12:54.130242 22263 master.cpp:2282] Sending 4 offers to
>> framework 20140604-221214-302055434-5050-22260-0000
>>
>>
>>  @vinodkone
>>
>>
>> On Fri, Jun 13, 2014 at 3:40 PM, Claudiu Barbura <
>> [email protected]> wrote:
>>
>>>  Hi Vinod,
>>>
>>>  Attached are the patch files.Hadoop has to be treated differently as
>>> it requires resources in order to shut down task trackers after a job is
>>> complete. Therefore we set the role name so that Mesos allocates resources
>>> for it first, ahead of the rest of the frameworks under the default role
>>> (*).
>>> This is not ideal, we are going to loo into the Hadoop Mesos framework
>>> code and fix if possible. Luckily, Hadoop is the only framework we use on
>>> top of Mesos that allows a configurable role name to be passed in when
>>> registering a framework (unlike, Spark, Aurora, Storm etc)
>>> For the non-Hadoop frameworks, we are making sure that once a framework
>>> is running its jobs, Mesos no longer offers resources to it. In the same
>>> time, once a framework completes its job, we make sure its “client
>>> allocations” value is updated so that when it completes the execution of
>>> its jobs, it is placed back in the sorting list with a real chance of being
>>> offered again immediately (not starved!).
>>> What is also key is that mem type resources are ignored during share
>>> computation as only cpus are a good indicator of which frameworks are
>>> actually running jobs in the cluster.
>>>
>>>  Thanks,
>>> Claudiu
>>>
>>>   From: Claudiu Barbura <[email protected]>
>>> Reply-To: "[email protected]" <[email protected]>
>>>  Date: Thursday, June 12, 2014 at 6:20 PM
>>>
>>> To: "[email protected]" <[email protected]>
>>> Subject: Re: Framework Starvation
>>>
>>>   Hi Vinod,
>>>
>>>  We have a fix (more like a hack) that works for us, but it requires us
>>> to run each Hadoop framework with a different role as we need to treat
>>> Hadoop differently than the rest of the frameworks (Spark, Shark, Aurora)
>>> which are running with the default role (*).
>>> We had to change the drf_sorter.cpp/hpp and
>>> hierarchical_allocator_process.cpp files.
>>>
>>>  Let me know if you need more info on this.
>>>
>>>  Thanks,
>>> Claudiu
>>>
>>>   From: Claudiu Barbura <[email protected]>
>>> Reply-To: "[email protected]" <[email protected]>
>>> Date: Thursday, June 5, 2014 at 2:41 AM
>>> To: "[email protected]" <[email protected]>
>>> Subject: Re: Framework Starvation
>>>
>>>   Hi Vinod,
>>>
>>>  I attached the master log after adding more logging to the sorter code.
>>> I believe the problem lies somewhere else however …
>>> in HierarchicalAllocatorProcess<RoleSorter, FrameworkSorter>::allocate()
>>>
>>>  I will continue to investigate in the meantime.
>>>
>>>  Thanks,
>>> Claudiu
>>>
>>>   From: Vinod Kone <[email protected]>
>>> Reply-To: "[email protected]" <[email protected]>
>>> Date: Tuesday, June 3, 2014 at 5:16 PM
>>> To: "[email protected]" <[email protected]>
>>> Subject: Re: Framework Starvation
>>>
>>>   Either should be fine. I don't think there are any changes in
>>> allocator since 0.18.0-rc1.
>>>
>>>
>>> On Tue, Jun 3, 2014 at 4:08 PM, Claudiu Barbura <
>>> [email protected]> wrote:
>>>
>>>>  Hi Vinod,
>>>>
>>>>  Should we use the same 0-18.1-rc1 branch or trunk code?
>>>>
>>>>  Thanks,
>>>> Claudiu
>>>>
>>>>   From: Vinod Kone <[email protected]>
>>>> Reply-To: "[email protected]" <[email protected]>
>>>>  Date: Tuesday, June 3, 2014 at 3:55 PM
>>>> To: "[email protected]" <[email protected]>
>>>> Subject: Re: Framework Starvation
>>>>
>>>>   Hey Claudiu,
>>>>
>>>>  Is it possible for you to run the same test but logging more
>>>> information about the framework shares? For example, it would be really
>>>> insightful if you can log each framework's share in DRFSorter::sort()
>>>> (see: master/drf_sorter.hpp). This will help us diagnose the problem. I
>>>> suspect one of our open tickets around allocation (MESOS-1119
>>>> <https://issues.apache.org/jira/browse/MESOS-1119>, MESOS-1130
>>>> <https://issues.apache.org/jira/browse/MESOS-1130> and MESOS-1187
>>>> <https://issues.apache.org/jira/browse/MESOS-1187>) is the issue. But
>>>> it would be good to have that logging data regardless to confirm.
>>>>
>>>>
>>>> On Mon, Jun 2, 2014 at 10:46 AM, Claudiu Barbura <
>>>> [email protected]> wrote:
>>>>
>>>>>  Hi Vinod,
>>>>>
>>>>>  I tried to attach the logs (2MB) and the email (see below) did not
>>>>> go through. I emailed your gmail account separately.
>>>>>
>>>>>  Thanks,
>>>>> Claudiu
>>>>>
>>>>>   From: Claudiu Barbura <[email protected]>
>>>>> Date: Monday, June 2, 2014 at 10:00 AM
>>>>>
>>>>> To: "[email protected]" <[email protected]>
>>>>> Subject: Re: Framework Starvation
>>>>>
>>>>>   Hi Vinod,
>>>>>
>>>>>  I attached the maser logs snapshots during starvation and after
>>>>> starvation.
>>>>>
>>>>>  There are 4 slave nodes and 1 master, all with of the same ec2
>>>>> instance type (cc2.8xlarge, 32 cores, 60GB RAM).
>>>>> I am running 4 shark-cli instances from the same master node, and
>>>>> running queries on all 4 of them … then “starvation” kicks in (see 
>>>>> attached
>>>>> log_during_starvation file).
>>>>> After I terminate 2 of the shark-cli instances, the starved ones are
>>>>> receiving offers and are able to run queries again (see attached
>>>>> log_after_starvation file).
>>>>>
>>>>>  Let me know if you need the slave logs.
>>>>>
>>>>>  Thank you!
>>>>> Claudiu
>>>>>
>>>>>   From: Vinod Kone <[email protected]>
>>>>> Reply-To: "[email protected]" <[email protected]>
>>>>> Date: Friday, May 30, 2014 at 10:13 AM
>>>>> To: "[email protected]" <[email protected]>
>>>>> Subject: Re: Framework Starvation
>>>>>
>>>>>   Hey Claudiu,
>>>>>
>>>>>  Mind posting some master logs with the simple setup that you
>>>>> described (3 shark cli instances)? That would help us better diagnose the
>>>>> problem.
>>>>>
>>>>>
>>>>> On Fri, May 30, 2014 at 1:59 AM, Claudiu Barbura <
>>>>> [email protected]> wrote:
>>>>>
>>>>>>  This is a critical issue for us as we have to shut down frameworks
>>>>>> for various components in our platform to work and this has created more
>>>>>> contention than before we deployed Mesos, when everyone had to wait in 
>>>>>> line
>>>>>> for their MR/Hive jobs to run.
>>>>>>
>>>>>>  Any guidance, ideas would be extremely helpful at this point.
>>>>>>
>>>>>>  Thank you,
>>>>>> Claudiu
>>>>>>
>>>>>>   From: Claudiu Barbura <[email protected]>
>>>>>> Reply-To: "[email protected]" <[email protected]>
>>>>>> Date: Tuesday, May 27, 2014 at 11:57 PM
>>>>>> To: "[email protected]" <[email protected]>
>>>>>> Subject: Framework Starvation
>>>>>>
>>>>>>   Hi,
>>>>>>
>>>>>>  Following Ben’s suggestion at the Seattle Spark Meetup in April, I
>>>>>> built and deployed  the 0-18.1-rc1 branch hoping that this wold solve the
>>>>>> framework starvation problem we have been seeing in the past 2 months 
>>>>>> now.
>>>>>> The hope was that https://issues.apache.org/jira/browse/MESOS-1086 would
>>>>>> also help us. Unfortunately it did not.
>>>>>> This bug is preventing us to run multiple spark and shark servers
>>>>>> (http, thrift), in load balanced fashion, Hadoop and Aurora in the same
>>>>>> mesos cluster.
>>>>>>
>>>>>>  For example, if we start at least 3 frameworks, one Hadoop, one
>>>>>> SparkJobServer (one Spark context in fine-grained mode) and one Http
>>>>>> SharkServer (one JavaSharkContext that inherits from Spark Contexts, 
>>>>>> again
>>>>>> in fine-grained mode) and we run queries on all three of them, very soon 
>>>>>> we
>>>>>> notice the following behavior:
>>>>>>
>>>>>>
>>>>>>    - only the last two frameworks that we run queries against
>>>>>>    receive resource offers (master.cpp log entries in the
>>>>>>    log/mesos-master.INFO)
>>>>>>    - the other frameworks are ignored and not allocated any
>>>>>>    resources until we kill one the two privileged ones above
>>>>>>    - As soon as one of the privileged framework is terminated, one
>>>>>>    of the starved framework takes its place
>>>>>>    - Any new Spark context created in coarse-grained mode (fixed
>>>>>>    number of cores) will generally receive offers immediately (rarely it 
>>>>>> gets
>>>>>>    starved)
>>>>>>    - Hadoop behaves slightly differently when starved: task trackers
>>>>>>    are started but never released, which means, if the first job (Hive 
>>>>>> query)
>>>>>>    is small in terms of number of input splits, only one task tracker 
>>>>>> with a
>>>>>>    small number of allocated ores is created, and then all subsequent 
>>>>>> queries,
>>>>>>    regardless of size are only run in very limited mode with this one 
>>>>>> “small”
>>>>>>    task tracker. Most of the time only the map phase of a big query is
>>>>>>    completed while the reduce phase is hanging. Killing one of the 
>>>>>> registered
>>>>>>    Spark context above releases resources for Mesos to complete the 
>>>>>> query and
>>>>>>    gracefully shut down the task trackers (as noticed in the master log)
>>>>>>
>>>>>> We are using the default settings in terms of isolation, weights etc
>>>>>> … the only stand out configuration would be the memory allocation for 
>>>>>> slave
>>>>>> (export MESOS_resources=mem:35840 in mesos-slave-env.sh) but I am not 
>>>>>> sure
>>>>>> if this is ever enforced, as each framework has its own executor process
>>>>>> (JVM in our case) with its own memory allocation (we are not using 
>>>>>> cgroups
>>>>>> yet)
>>>>>>
>>>>>>  A very easy to reproduce this bug is to start a minimum of 3
>>>>>> shark-cli instances in a mesos cluster and notice that only two of them 
>>>>>> are
>>>>>> being offered resources and are running queries successfully.
>>>>>> I spent quite a bit of time in mesos, spark and hadoop-mesos code in
>>>>>> an attempt to find a possible workaround  but no luck so far.
>>>>>>
>>>>>>  Any guidance would be very appreciated.
>>>>>>
>>>>>>  Thank you,
>>>>>> Claudiu
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Framework Starvation

Reply via email to