Re: Omega vs. YARN

Robert Evans Fri, 19 Apr 2013 15:10:04 -0700

Tim,

Answers inline below

On 4/19/13 1:42 PM, "Tim St Clair" <[email protected]> wrote:

>Robert, 
>
>Thank you for your response.
>I've placed some questions and comments inline below.
>
>Cheers,
>Tim
>
>----- Original Message -----
>> From: "Robert Evans" <[email protected]>
>> To: [email protected]
>> Sent: Friday, April 19, 2013 12:34:52 PM
>> Subject: Re: Omega vs. YARN
>> 
>> Tim,
>> 
>> They are very interesting points.  From a scalability point I don't
>>think
>> we have really run into those situations yet but they are coming.  YARN
>> currently has some very "simplistic" scheduling for the RM.  All of the
>> complexity comes out in the AM.  There have been a number of JIRA to
>>make
>> requests more complex, to help support more "picky" applications like
>>the
>> paper says.  These would make YARN shift a bit more from a two-level
>> scheduler towards a Monolithic one, and thereby reducing some of the
>> scalability of the system, but making it support more complex scheduling
>> patterns.  The largest YARN cluster I know of right now is about 4000
>> nodes. On it we are hitting some bottlenecks with the current scheduler.
>> We have looked at some ways to speed it up with more conventional
>> approaches like allowing the scheduler to me multithreaded.  We expect
>>to
>> be able to easily support 4000-6000 nodes through YARN with a few
>> optimizations. Going to tens of thousands of nodes would require some
>>more
>> significant changes.
>
>If there are JIRA(s) which outline the limitations I would be interested
>in knowing more.

YARN-397 is kind of a roll up JIRA for some of the scheduler API
enhancements.  But there are also YARN-314, YARN-56, YARN-110, and
YARN-238.  But this does not include the ones that I am most interested in
which is gang scheduling. I just haven't filed a JIRA for that yet.

There is also preemption as an option in YARN-397, although not strictly
part of the scheduling request API, but I believe includes informing the
AM that resources are going to be taken back if it does not release some
of them.

>
>> 
>> As far as utilization is concerned the presented architecture does
>>provide
>> some very interesting points, but all of that can be addressed with a
>> Monolithic scheduler so long as we don't have to scale very large. It
>>also
>> would probably require a complete redesign of YARN and the MR AM, which
>>is
>> not a small undertaking.  There is also the question of trusted code.
>>In
>> a shared state system where all of the various schedulers are peers how
>> would we enforce resource constraints?
>
>I think the biggest open questions I have with a distributed approach,
>are; priority, preemption policies, and fragmentation.

Yes, those become more difficult in a distributed environment, but I don't
think they are overwhelmingly difficult.  These are hard problems to solve
for any scheduler.  This is because we are trying to come up with
heuristics for a problem that is practically impossible to solve.  I am
not a mathematician but I believe that optimally scheduling resources is
an NP-Hard problem.  What is more we don't know what the resource
utilization is going to be up front, despite the users' resource
request/hint, so it is an NP-Hard problem once we have solved the halting
problem.  This is where priority and preemption come in as ways to try to
offload some of the complexity on to the user, and then also to fix
mistakes that the heuristic made while scheduling.

Moving this to a distributed environment you can solve this in a number of
way. The paper talks about optimistic scheduling vs pessimistic
scheduling.  With optimistic scheduling each scheduler acts kind of like
it is the only one there is, and then cleans things up afterwards if there
is a collision. In pessimistic scheduling it works hard to avoid all
collisions most likely through locking. The paper also talked about
auditing the schedulers after the fact to detect if any of them are doing
something that does not fit with the policy instead of trying to enforce
it.  If we want to go with enforcement you could have specific schedulers
with priority over other schedulers.  So that by convention lower priority
schedulers could not preempt higher priority ones, but then the resource
utilization, in theory, would go down.

On fragmentation I don't think any of the hadoop schedulers right now try
to do anything about fragmentation, except pretend it does not exits.  In
fact we have seen a very rare live lock situation where the MR AM thinks
there is enough headroom to schedule a map task so it does not bother to
shoot a reducer, but because the headroom is fragmented between various
machines the map task will never actually be scheduled.

>
>> Each of the schedulers would have
>> to enforce them themselves, and as such would have to be trusted code.
>> This makes adding in new application types on the fly difficult.
>> 
>> I suppose we could do a hybrid approach, where the RM is a single type
>>of
>> scheduler among many.  It would provide the same API that currently
>>exists
>> for YARN applications, but MR applications could have one or more
>> "JobTracker" like schedulers that share state with the RM, and what
>>other
>> "schedulers" there are out.  That would be something fun to try out, but
>> sadly I really don't have time to even get started thinking about a
>>proof
>> of concept on something like that. At least that is until we hit a
>> significant business use case that would drive it over the architecture
>>we
>> already have.  
>>
>> For example needing 10s of thousands of nodes in a
>> cluster, or a huge shift in different types of jobs on to YARN so that
>>we
>> are doing a lot more than just MR on the same cluster.
>
>Something tells me it may come fast, if/when the YARN application space
>expands.

I agree that it may come fast. I just don't know if the mix of
applications that Google has will actually come to Hadoop.  I can see a
lot of batch processing applications begin run on top of YARN, because
even though YARN is generic it makes a lot of batch processing
assumptions.  Because of this I just don't know about other types of
processing.   It is a bit of a chicken/egg problem.

I am in the process of doing a basic port of storm to run on top of YARN.
Because the resource scheduling/isolation is not that great on YARN (Mesos
too for that matter) we request entire nodes from YARN and bring up a
predefined number of machines instead of letting the cluster grow and
shrink on demand, because we need to be sure we can get the resources when
we need them.  Honestly it is a lot simpler to do what we are doing
through OpenStack or some other VM management system than through YARN.
Even looking at tools like Impala or Hbase, it would be very difficult
with there current architecture to think about using YARN for
scheduling/deployment.  For example when security is enabled HDFS sets
limits on how long a delegation token is good for.  Once 2 weeks,
configurable, are up your HDFS delegation token is done and no new
containers will be abel to be launched because the distributed cache will
no longer be able to download your jars.

I just see a lot of difficulty in using YARN for long running processes,
and it being a lot simpler to use something else like OpenStack for that.
Now with that said, if we could some how have a distributed scheduler with
both Hadoop and OpenStack sharing the same large cluster.  That would be
awesome.  But again I have to convince my management that is the right way
to go, it will save us X million dollars a year, and I have to convince
myself that it is worth spending my time on that instead of the other fun
stuff I have been doing. :)

>
>> 
>> --Bobby
>> 
>> On 4/19/13 9:47 AM, "Tim St Clair" <[email protected]> wrote:
>> 
>> >I recently read Googles Omega paper, and wondering if any of the YARN
>> >developers were planning to address some of the items considered as key
>> >points.
>> >
>> 
>>>http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.p
>>>df
>> >
>> >Cheers,
>> >Tim
>> 
>>

Re: Omega vs. YARN

Reply via email to