Tim, Answers inline below
On 4/19/13 1:42 PM, "Tim St Clair" <[email protected]> wrote: >Robert, > >Thank you for your response. >I've placed some questions and comments inline below. > >Cheers, >Tim > >----- Original Message ----- >> From: "Robert Evans" <[email protected]> >> To: [email protected] >> Sent: Friday, April 19, 2013 12:34:52 PM >> Subject: Re: Omega vs. YARN >> >> Tim, >> >> They are very interesting points. From a scalability point I don't >>think >> we have really run into those situations yet but they are coming. YARN >> currently has some very "simplistic" scheduling for the RM. All of the >> complexity comes out in the AM. There have been a number of JIRA to >>make >> requests more complex, to help support more "picky" applications like >>the >> paper says. These would make YARN shift a bit more from a two-level >> scheduler towards a Monolithic one, and thereby reducing some of the >> scalability of the system, but making it support more complex scheduling >> patterns. The largest YARN cluster I know of right now is about 4000 >> nodes. On it we are hitting some bottlenecks with the current scheduler. >> We have looked at some ways to speed it up with more conventional >> approaches like allowing the scheduler to me multithreaded. We expect >>to >> be able to easily support 4000-6000 nodes through YARN with a few >> optimizations. Going to tens of thousands of nodes would require some >>more >> significant changes. > >If there are JIRA(s) which outline the limitations I would be interested >in knowing more. YARN-397 is kind of a roll up JIRA for some of the scheduler API enhancements. But there are also YARN-314, YARN-56, YARN-110, and YARN-238. But this does not include the ones that I am most interested in which is gang scheduling. I just haven't filed a JIRA for that yet. There is also preemption as an option in YARN-397, although not strictly part of the scheduling request API, but I believe includes informing the AM that resources are going to be taken back if it does not release some of them. > >> >> As far as utilization is concerned the presented architecture does >>provide >> some very interesting points, but all of that can be addressed with a >> Monolithic scheduler so long as we don't have to scale very large. It >>also >> would probably require a complete redesign of YARN and the MR AM, which >>is >> not a small undertaking. There is also the question of trusted code. >>In >> a shared state system where all of the various schedulers are peers how >> would we enforce resource constraints? > >I think the biggest open questions I have with a distributed approach, >are; priority, preemption policies, and fragmentation. Yes, those become more difficult in a distributed environment, but I don't think they are overwhelmingly difficult. These are hard problems to solve for any scheduler. This is because we are trying to come up with heuristics for a problem that is practically impossible to solve. I am not a mathematician but I believe that optimally scheduling resources is an NP-Hard problem. What is more we don't know what the resource utilization is going to be up front, despite the users' resource request/hint, so it is an NP-Hard problem once we have solved the halting problem. This is where priority and preemption come in as ways to try to offload some of the complexity on to the user, and then also to fix mistakes that the heuristic made while scheduling. Moving this to a distributed environment you can solve this in a number of way. The paper talks about optimistic scheduling vs pessimistic scheduling. With optimistic scheduling each scheduler acts kind of like it is the only one there is, and then cleans things up afterwards if there is a collision. In pessimistic scheduling it works hard to avoid all collisions most likely through locking. The paper also talked about auditing the schedulers after the fact to detect if any of them are doing something that does not fit with the policy instead of trying to enforce it. If we want to go with enforcement you could have specific schedulers with priority over other schedulers. So that by convention lower priority schedulers could not preempt higher priority ones, but then the resource utilization, in theory, would go down. On fragmentation I don't think any of the hadoop schedulers right now try to do anything about fragmentation, except pretend it does not exits. In fact we have seen a very rare live lock situation where the MR AM thinks there is enough headroom to schedule a map task so it does not bother to shoot a reducer, but because the headroom is fragmented between various machines the map task will never actually be scheduled. > >> Each of the schedulers would have >> to enforce them themselves, and as such would have to be trusted code. >> This makes adding in new application types on the fly difficult. >> >> I suppose we could do a hybrid approach, where the RM is a single type >>of >> scheduler among many. It would provide the same API that currently >>exists >> for YARN applications, but MR applications could have one or more >> "JobTracker" like schedulers that share state with the RM, and what >>other >> "schedulers" there are out. That would be something fun to try out, but >> sadly I really don't have time to even get started thinking about a >>proof >> of concept on something like that. At least that is until we hit a >> significant business use case that would drive it over the architecture >>we >> already have. >> >> For example needing 10s of thousands of nodes in a >> cluster, or a huge shift in different types of jobs on to YARN so that >>we >> are doing a lot more than just MR on the same cluster. > >Something tells me it may come fast, if/when the YARN application space >expands. I agree that it may come fast. I just don't know if the mix of applications that Google has will actually come to Hadoop. I can see a lot of batch processing applications begin run on top of YARN, because even though YARN is generic it makes a lot of batch processing assumptions. Because of this I just don't know about other types of processing. It is a bit of a chicken/egg problem. I am in the process of doing a basic port of storm to run on top of YARN. Because the resource scheduling/isolation is not that great on YARN (Mesos too for that matter) we request entire nodes from YARN and bring up a predefined number of machines instead of letting the cluster grow and shrink on demand, because we need to be sure we can get the resources when we need them. Honestly it is a lot simpler to do what we are doing through OpenStack or some other VM management system than through YARN. Even looking at tools like Impala or Hbase, it would be very difficult with there current architecture to think about using YARN for scheduling/deployment. For example when security is enabled HDFS sets limits on how long a delegation token is good for. Once 2 weeks, configurable, are up your HDFS delegation token is done and no new containers will be abel to be launched because the distributed cache will no longer be able to download your jars. I just see a lot of difficulty in using YARN for long running processes, and it being a lot simpler to use something else like OpenStack for that. Now with that said, if we could some how have a distributed scheduler with both Hadoop and OpenStack sharing the same large cluster. That would be awesome. But again I have to convince my management that is the right way to go, it will save us X million dollars a year, and I have to convince myself that it is worth spending my time on that instead of the other fun stuff I have been doing. :) > >> >> --Bobby >> >> On 4/19/13 9:47 AM, "Tim St Clair" <[email protected]> wrote: >> >> >I recently read Googles Omega paper, and wondering if any of the YARN >> >developers were planning to address some of the items considered as key >> >points. >> > >> >>>http://eurosys2013.tudos.org/wp-content/uploads/2013/paper/Schwarzkopf.p >>>df >> > >> >Cheers, >> >Tim >> >>
