Hey Albert, If you use TableInputFormat, it will create one map task per region in that table. So, each mapper should just talk to one regionserver.
-Sean On Thu, Dec 2, 2010 at 5:26 PM, Albert Shau <[email protected]> wrote: > Hi, > > I'm doing a distributed scan of an hbase table using map-reduce by taking > all the regions belonging to a regionserver, and then assigning those > regions to a mapper (so there's 1 mapper per regionserver, and each mapper > only talks to one regionserver). However, doing it this way I'm getting > some data skew. For example, I have 2 tables U and T. Each regionserver > may have 30 regions, but one regionserver might have 10 regions from table U > while another regionserver might have 25 regions from table U. Is there a > way to balance regions per table per regionserver (so that each regionserver > has 15 regions from table U for example)? Or should I just not worry about > trying to have each individual mapper only talk to one regionserver? > > Also, how do regions get assigned to regionservers? Is it based on data > locality? Region start/end keys? Randomly? > > Thanks, > Albert >
