Heterogeneous cluster

Jean-Marc Spaggiari Fri, 07 Dec 2012 19:33:23 -0800

Hi,

Here is the situation.


I have an heterogeneous cluster with 2 cores CPUs, 4 cores CPUs and 8
cores CPUs servers. The performances of those different servers allow
them to handle different size of load. So far, I built a LoadBalancer
which balance the regions over those servers based on the
performances. And it’s working quite well. The RowCounter went down
from 11 minutes to 6 minutes. However, I can still see that the tasks
are run on some servers accessing data on other servers, which
overwhelme the bandwidth and slow done the process since some 2 cores
servers are assigned to count some rows hosted on 8 cores servers.

I’m looking for a way to “force” the tasks to run on the servers where
the regions are assigned.

I first tried to reject the tasks on the Mapper setup method when the
data was not local to see if the tracker will assign it to another
server. No. It’s just failing and mostly not re-assigned. I tried
IOExceptions, RuntimeExceptions, InterruptionExceptions with no
success.

So now I have 3 possible options.

The first one is to move from the MapReduce to the Coprocessor
EndPoint. Running locally on the RegionServer, it’s accessing only the
local data and I can manually reject all what is not local. Therefor
it’s achieving my needs, but it’s not my preferred options since I
would like to keep the MR features.

The second option is to tell Hadoop where the tasks should be
assigned. Should that be done by HBase? By Hadoop? I don’t know.
Where? I don’t know either. I have started to look at JobTracker and
JobInProgress code but it seems it will be a big task. Also, doing
that will mean I will have to re-patch the distributed code each time
I’m upgrading the version, and I will have to redo everything when I
will move from 1.0.x to 2.x…

Third option is to not process the task if the data is not local. I
mean, on the map method, simply have a if (!local) return; right from
the beginning and just do nothing. This will not work for things like
RowCount since all the entries are required, but for some of my
usecases this might work where I don’t necessary need all the data to
be processed. I will not be efficient stlil the task will still scan
the entire region.

My preferred option is definitively the 2nd one, but it seems also to
be the most difficult one. The Third one is very easy to implement.
Need 2 lines to see if the data is local. But it’s not working for all
the scenarios, and is more like a dirty fix. The coprocessor option
might be doable too since I already have all the code for my MapReduce
jobs. So it might be an acceptable option.

I’m wondering if anyone already faced this situation and worked on
something, and if not, do you have any other ideas/options to propose,
or can someone point me to the right classes to look at to implement
the solution 2?

Thanks,

JM

Heterogeneous cluster

Reply via email to