Hi, It's not yet available anywhere. I will post it today or tomorrow, just the time to remove some hardcoding I did into it ;) It's a quick and dirty PerformanceBalancer. It's not a CPULoadBalencer.
Anyway, I will give more details over the week-end, but there is absolutly nothing extraordinaire with it. JM 2012/12/8, Robert Dyer <[email protected]>: > I too am interested in this custom load balancer, as I was actually just > starting to look into writing one that does the same thing for > my heterogeneous cluster! > > Is this available somewhere? > > On Sat, Dec 8, 2012 at 9:17 AM, James Chang <[email protected]> > wrote: > >> By the way, I saw you mentioned that you >> have built a "LoadBalancer", could you kindly >> share some detailed info about it? >> >> Jean-Marc Spaggiari 於 2012年12月8日星期六寫道: >> >> > Hi, >> > >> > Here is the situation. >> > >> > I have an heterogeneous cluster with 2 cores CPUs, 4 cores CPUs and 8 >> > cores CPUs servers. The performances of those different servers allow >> > them to handle different size of load. So far, I built a LoadBalancer >> > which balance the regions over those servers based on the >> > performances. And it’s working quite well. The RowCounter went down >> > from 11 minutes to 6 minutes. However, I can still see that the tasks >> > are run on some servers accessing data on other servers, which >> > overwhelme the bandwidth and slow done the process since some 2 cores >> > servers are assigned to count some rows hosted on 8 cores servers. >> > >> > I’m looking for a way to “force” the tasks to run on the servers where >> > the regions are assigned. >> > >> > I first tried to reject the tasks on the Mapper setup method when the >> > data was not local to see if the tracker will assign it to another >> > server. No. It’s just failing and mostly not re-assigned. I tried >> > IOExceptions, RuntimeExceptions, InterruptionExceptions with no >> > success. >> > >> > So now I have 3 possible options. >> > >> > The first one is to move from the MapReduce to the Coprocessor >> > EndPoint. Running locally on the RegionServer, it’s accessing only the >> > local data and I can manually reject all what is not local. Therefor >> > it’s achieving my needs, but it’s not my preferred options since I >> > would like to keep the MR features. >> > >> > The second option is to tell Hadoop where the tasks should be >> > assigned. Should that be done by HBase? By Hadoop? I don’t know. >> > Where? I don’t know either. I have started to look at JobTracker and >> > JobInProgress code but it seems it will be a big task. Also, doing >> > that will mean I will have to re-patch the distributed code each time >> > I’m upgrading the version, and I will have to redo everything when I >> > will move from 1.0.x to 2.x… >> > >> > Third option is to not process the task if the data is not local. I >> > mean, on the map method, simply have a if (!local) return; right from >> > the beginning and just do nothing. This will not work for things like >> > RowCount since all the entries are required, but for some of my >> > usecases this might work where I don’t necessary need all the data to >> > be processed. I will not be efficient stlil the task will still scan >> > the entire region. >> > >> > My preferred option is definitively the 2nd one, but it seems also to >> > be the most difficult one. The Third one is very easy to implement. >> > Need 2 lines to see if the data is local. But it’s not working for all >> > the scenarios, and is more like a dirty fix. The coprocessor option >> > might be doable too since I already have all the code for my MapReduce >> > jobs. So it might be an acceptable option. >> > >> > I’m wondering if anyone already faced this situation and worked on >> > something, and if not, do you have any other ideas/options to propose, >> > or can someone point me to the right classes to look at to implement >> > the solution 2? >> > >> > Thanks, >> > >> > JM >> > >> > > > > -- > > Robert Dyer > [email protected] >
