Hi Vinod,

Yo are looking at logs I had posted before we implemented our fix (files 
attached in my last email).
I will write a detailed blog post on the issue … after the Spark Summit at the 
end of this month.

What wold happen before is that frameworks with the same share (0) would also 
have the smallest allocation in the beginning, and after sorting the list they 
would be at the top, always offered all the resources before other frameworks 
that had been already offered, running tasks with a share and allocation > 0.

Thanks,
Claudiu

From: Vinod Kone <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Wednesday, June 18, 2014 at 4:54 AM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Framework Starvation

Hey Claudiu,

I spent some time trying to understand the logs you posted. Whats strange to me 
is that in the very beginning when framework's 1 and 2 are registered, only one 
framework gets offers for a period of 9s. It's not clear why this happens. I 
even wrote a test (https://reviews.apache.org/r/22714/) to repro but wasn't 
able to.

It would probably be helpful to add more logging to the drf sorting comparator 
function to understand why frameworks are sorted in such a way when their share 
is same (0). My expectation is that after each allocation, the 'allocations' 
for a framework should increase causing the sort function to behave correctly. 
But that doesn't seem to be happening in your case.



I0604 22:12:43.715530 22270 master.cpp:2282] Sending 4 offers to framework 
20140604-221214-302055434-5050-22260-0000

I0604 22:12:44.276062 22273 master.cpp:2282] Sending 4 offers to framework 
20140604-221214-302055434-5050-22260-0001

I0604 22:12:44.756918 22292 master.cpp:2282] Sending 4 offers to framework 
20140604-221214-302055434-5050-22260-0000

I0604 22:12:45.794178 22276 master.cpp:2282] Sending 4 offers to framework 
20140604-221214-302055434-5050-22260-0001

I0604 22:12:46.841629 22291 master.cpp:2282] Sending 4 offers to framework 
20140604-221214-302055434-5050-22260-0001

I0604 22:12:47.884266 22262 master.cpp:2282] Sending 4 offers to framework 
20140604-221214-302055434-5050-22260-0001

I0604 22:12:48.926856 22268 master.cpp:2282] Sending 4 offers to framework 
20140604-221214-302055434-5050-22260-0001

I0604 22:12:49.966560 22280 master.cpp:2282] Sending 4 offers to framework 
20140604-221214-302055434-5050-22260-0001

I0604 22:12:51.007143 22267 master.cpp:2282] Sending 4 offers to framework 
20140604-221214-302055434-5050-22260-0001

I0604 22:12:52.047987 22280 master.cpp:2282] Sending 4 offers to framework 
20140604-221214-302055434-5050-22260-0001

I0604 22:12:53.089340 22291 master.cpp:2282] Sending 4 offers to framework 
20140604-221214-302055434-5050-22260-0001

I0604 22:12:54.130242 22263 master.cpp:2282] Sending 4 offers to framework 
20140604-221214-302055434-5050-22260-0000


@vinodkone


On Fri, Jun 13, 2014 at 3:40 PM, Claudiu Barbura 
<[email protected]<mailto:[email protected]>> wrote:
Hi Vinod,

Attached are the patch files.Hadoop has to be treated differently as it 
requires resources in order to shut down task trackers after a job is complete. 
Therefore we set the role name so that Mesos allocates resources for it first, 
ahead of the rest of the frameworks under the default role (*).
This is not ideal, we are going to loo into the Hadoop Mesos framework code and 
fix if possible. Luckily, Hadoop is the only framework we use on top of Mesos 
that allows a configurable role name to be passed in when registering a 
framework (unlike, Spark, Aurora, Storm etc)
For the non-Hadoop frameworks, we are making sure that once a framework is 
running its jobs, Mesos no longer offers resources to it. In the same time, 
once a framework completes its job, we make sure its “client allocations” value 
is updated so that when it completes the execution of its jobs, it is placed 
back in the sorting list with a real chance of being offered again immediately 
(not starved!).
What is also key is that mem type resources are ignored during share 
computation as only cpus are a good indicator of which frameworks are actually 
running jobs in the cluster.

Thanks,
Claudiu

From: Claudiu Barbura 
<[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Thursday, June 12, 2014 at 6:20 PM

To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Framework Starvation

Hi Vinod,

We have a fix (more like a hack) that works for us, but it requires us to run 
each Hadoop framework with a different role as we need to treat Hadoop 
differently than the rest of the frameworks (Spark, Shark, Aurora) which are 
running with the default role (*).
We had to change the drf_sorter.cpp/hpp and hierarchical_allocator_process.cpp 
files.

Let me know if you need more info on this.

Thanks,
Claudiu

From: Claudiu Barbura 
<[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Thursday, June 5, 2014 at 2:41 AM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Framework Starvation

Hi Vinod,

I attached the master log after adding more logging to the sorter code.
I believe the problem lies somewhere else however … in 
HierarchicalAllocatorProcess<RoleSorter, FrameworkSorter>::allocate()

I will continue to investigate in the meantime.

Thanks,
Claudiu

From: Vinod Kone <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Tuesday, June 3, 2014 at 5:16 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Framework Starvation

Either should be fine. I don't think there are any changes in allocator since 
0.18.0-rc1.


On Tue, Jun 3, 2014 at 4:08 PM, Claudiu Barbura 
<[email protected]<mailto:[email protected]>> wrote:
Hi Vinod,

Should we use the same 0-18.1-rc1 branch or trunk code?

Thanks,
Claudiu

From: Vinod Kone <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Tuesday, June 3, 2014 at 3:55 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Framework Starvation

Hey Claudiu,

Is it possible for you to run the same test but logging more information about 
the framework shares? For example, it would be really insightful if you can log 
each framework's share in DRFSorter::sort() (see: master/drf_sorter.hpp). This 
will help us diagnose the problem. I suspect one of our open tickets around 
allocation (MESOS-1119<https://issues.apache.org/jira/browse/MESOS-1119>, 
MESOS-1130<https://issues.apache.org/jira/browse/MESOS-1130> and 
MESOS-1187<https://issues.apache.org/jira/browse/MESOS-1187>) is the issue. But 
it would be good to have that logging data regardless to confirm.


On Mon, Jun 2, 2014 at 10:46 AM, Claudiu Barbura 
<[email protected]<mailto:[email protected]>> wrote:
Hi Vinod,

I tried to attach the logs (2MB) and the email (see below) did not go through. 
I emailed your gmail account separately.

Thanks,
Claudiu

From: Claudiu Barbura 
<[email protected]<mailto:[email protected]>>
Date: Monday, June 2, 2014 at 10:00 AM

To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Framework Starvation

Hi Vinod,

I attached the maser logs snapshots during starvation and after starvation.

There are 4 slave nodes and 1 master, all with of the same ec2 instance type 
(cc2.8xlarge, 32 cores, 60GB RAM).
I am running 4 shark-cli instances from the same master node, and running 
queries on all 4 of them … then “starvation” kicks in (see attached 
log_during_starvation file).
After I terminate 2 of the shark-cli instances, the starved ones are receiving 
offers and are able to run queries again (see attached log_after_starvation 
file).

Let me know if you need the slave logs.

Thank you!
Claudiu

From: Vinod Kone <[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Friday, May 30, 2014 at 10:13 AM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: Framework Starvation

Hey Claudiu,

Mind posting some master logs with the simple setup that you described (3 shark 
cli instances)? That would help us better diagnose the problem.


On Fri, May 30, 2014 at 1:59 AM, Claudiu Barbura 
<[email protected]<mailto:[email protected]>> wrote:
This is a critical issue for us as we have to shut down frameworks for various 
components in our platform to work and this has created more contention than 
before we deployed Mesos, when everyone had to wait in line for their MR/Hive 
jobs to run.

Any guidance, ideas would be extremely helpful at this point.

Thank you,
Claudiu

From: Claudiu Barbura 
<[email protected]<mailto:[email protected]>>
Reply-To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Tuesday, May 27, 2014 at 11:57 PM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Framework Starvation

Hi,

Following Ben’s suggestion at the Seattle Spark Meetup in April, I built and 
deployed  the 0-18.1-rc1 branch hoping that this wold solve the framework 
starvation problem we have been seeing in the past 2 months now. The hope was 
that https://issues.apache.org/jira/browse/MESOS-1086 would also help us. 
Unfortunately it did not.
This bug is preventing us to run multiple spark and shark servers (http, 
thrift), in load balanced fashion, Hadoop and Aurora in the same mesos cluster.

For example, if we start at least 3 frameworks, one Hadoop, one SparkJobServer 
(one Spark context in fine-grained mode) and one Http SharkServer (one 
JavaSharkContext that inherits from Spark Contexts, again in fine-grained mode) 
and we run queries on all three of them, very soon we notice the following 
behavior:


  *   only the last two frameworks that we run queries against receive resource 
offers (master.cpp log entries in the log/mesos-master.INFO)
  *   the other frameworks are ignored and not allocated any resources until we 
kill one the two privileged ones above
  *   As soon as one of the privileged framework is terminated, one of the 
starved framework takes its place
  *   Any new Spark context created in coarse-grained mode (fixed number of 
cores) will generally receive offers immediately (rarely it gets starved)
  *   Hadoop behaves slightly differently when starved: task trackers are 
started but never released, which means, if the first job (Hive query) is small 
in terms of number of input splits, only one task tracker with a small number 
of allocated ores is created, and then all subsequent queries, regardless of 
size are only run in very limited mode with this one “small” task tracker. Most 
of the time only the map phase of a big query is completed while the reduce 
phase is hanging. Killing one of the registered Spark context above releases 
resources for Mesos to complete the query and gracefully shut down the task 
trackers (as noticed in the master log)

We are using the default settings in terms of isolation, weights etc … the only 
stand out configuration would be the memory allocation for slave (export 
MESOS_resources=mem:35840 in mesos-slave-env.sh) but I am not sure if this is 
ever enforced, as each framework has its own executor process (JVM in our case) 
with its own memory allocation (we are not using cgroups yet)

A very easy to reproduce this bug is to start a minimum of 3 shark-cli 
instances in a mesos cluster and notice that only two of them are being offered 
resources and are running queries successfully.
I spent quite a bit of time in mesos, spark and hadoop-mesos code in an attempt 
to find a possible workaround  but no luck so far.

Any guidance would be very appreciated.

Thank you,
Claudiu






Reply via email to