I of course can not speak for Jean-Marc, however my use case is not very corporate. It is a small cluster (9 nodes) and only 1 of those nodes is different (drastically different).
And yes, I configured it so that node has a lot more map slots. However, the problem is HBase balances without regard to that and thus even though more map tasks run on those nodes they are not data-local! If I have a balancer that is able to keep more regions on that particular node, then the data locality of my map tasks is improved. On Sat, Dec 8, 2012 at 5:45 PM, Michael Segel <[email protected]>wrote: > Take what I say with a grain of kosher salt. (Its what they put on your > drink glasses because the grains are bigger. ;-) > > I think what you are doing is cool hack, however in the bigger picture, > you shouldn't have to do this with your load balancer. Also it doesn't > matter if you think about ti. > > With a heterogenous cluster, you will not share the same configuration > across all machines in the cluster. You will change the number of slots per > node based on its capacity. > That will limit what amount of work could be done on the same cluster. > > You could also consider playing with the rack aware aspects of your > cluster. > You could make all of your 2CPU machines in the same rack. > > In theory... machine, rack , second rack is how the data is distributed. > In theory if the 2CPU cores are neighbors, then the 2nd and or 3rd copy > goes to another machine. > > Trying to write a custom balancer, may be a good hack, but not good in > terms of corporate life. > > Just saying! > > -Mike > > On Dec 8, 2012, at 1:34 PM, Jean-Marc Spaggiari <[email protected]> > wrote: > > > Hi, > > > > It's not yet available anywhere. I will post it today or tomorrow, > > just the time to remove some hardcoding I did into it ;) It's a quick > > and dirty PerformanceBalancer. It's not a CPULoadBalencer. > > > > Anyway, I will give more details over the week-end, but there is > > absolutly nothing extraordinaire with it. > > > > JM > > > > 2012/12/8, Robert Dyer <[email protected]>: > >> I too am interested in this custom load balancer, as I was actually just > >> starting to look into writing one that does the same thing for > >> my heterogeneous cluster! > >> > >> Is this available somewhere? > >> > >> On Sat, Dec 8, 2012 at 9:17 AM, James Chang <[email protected]> > >> wrote: > >> > >>> By the way, I saw you mentioned that you > >>> have built a "LoadBalancer", could you kindly > >>> share some detailed info about it? > >>> > >>> Jean-Marc Spaggiari 於 2012年12月8日星期六寫道: > >>> > >>>> Hi, > >>>> > >>>> Here is the situation. > >>>> > >>>> I have an heterogeneous cluster with 2 cores CPUs, 4 cores CPUs and 8 > >>>> cores CPUs servers. The performances of those different servers allow > >>>> them to handle different size of load. So far, I built a LoadBalancer > >>>> which balance the regions over those servers based on the > >>>> performances. And it’s working quite well. The RowCounter went down > >>>> from 11 minutes to 6 minutes. However, I can still see that the tasks > >>>> are run on some servers accessing data on other servers, which > >>>> overwhelme the bandwidth and slow done the process since some 2 cores > >>>> servers are assigned to count some rows hosted on 8 cores servers. > >>>> > >>>> I’m looking for a way to “force” the tasks to run on the servers where > >>>> the regions are assigned. > >>>> > >>>> I first tried to reject the tasks on the Mapper setup method when the > >>>> data was not local to see if the tracker will assign it to another > >>>> server. No. It’s just failing and mostly not re-assigned. I tried > >>>> IOExceptions, RuntimeExceptions, InterruptionExceptions with no > >>>> success. > >>>> > >>>> So now I have 3 possible options. > >>>> > >>>> The first one is to move from the MapReduce to the Coprocessor > >>>> EndPoint. Running locally on the RegionServer, it’s accessing only the > >>>> local data and I can manually reject all what is not local. Therefor > >>>> it’s achieving my needs, but it’s not my preferred options since I > >>>> would like to keep the MR features. > >>>> > >>>> The second option is to tell Hadoop where the tasks should be > >>>> assigned. Should that be done by HBase? By Hadoop? I don’t know. > >>>> Where? I don’t know either. I have started to look at JobTracker and > >>>> JobInProgress code but it seems it will be a big task. Also, doing > >>>> that will mean I will have to re-patch the distributed code each time > >>>> I’m upgrading the version, and I will have to redo everything when I > >>>> will move from 1.0.x to 2.x… > >>>> > >>>> Third option is to not process the task if the data is not local. I > >>>> mean, on the map method, simply have a if (!local) return; right from > >>>> the beginning and just do nothing. This will not work for things like > >>>> RowCount since all the entries are required, but for some of my > >>>> usecases this might work where I don’t necessary need all the data to > >>>> be processed. I will not be efficient stlil the task will still scan > >>>> the entire region. > >>>> > >>>> My preferred option is definitively the 2nd one, but it seems also to > >>>> be the most difficult one. The Third one is very easy to implement. > >>>> Need 2 lines to see if the data is local. But it’s not working for all > >>>> the scenarios, and is more like a dirty fix. The coprocessor option > >>>> might be doable too since I already have all the code for my MapReduce > >>>> jobs. So it might be an acceptable option. > >>>> > >>>> I’m wondering if anyone already faced this situation and worked on > >>>> something, and if not, do you have any other ideas/options to propose, > >>>> or can someone point me to the right classes to look at to implement > >>>> the solution 2? > >>>> > >>>> Thanks, > >>>> > >>>> JM > >>>> > >>> > >> > >> > >> > >> -- > >> > >> Robert Dyer > >> [email protected] > >> > > > > -- Robert Dyer [email protected]
