When a table is created with N regions, is it possible to distribute them (almost) equally among the region servers ?
Thanks On Thu, Dec 2, 2010 at 3:10 PM, Jonathan Gray <[email protected]> wrote: > Yeah, I'd recommend just using the normal TIF which will have a map task > per region, attempts to schedule it on that node, and each task would talk > to only one (hopefully local) server. > > As for assignment, the story has changed significantly between previous > versions and the upcoming 0.90 release. > > In 0.90, there are two modes of startup assignment. The new default is > 'retain assignment' where the master will attempt to reuse whatever the last > set of assignments were on the previous run of the cluster. The other > option, if you turn off retain assignment, is round-robin. This round-robin > assignment would give you what you want (an approximately equal number of > regions of each table on each server). > > What I've done to get good distribution of the tables is startup with > round-robin, then from then on use retain assignment. > > JG > > > -----Original Message----- > > From: Sean Sechrist [mailto:[email protected]] > > Sent: Thursday, December 02, 2010 2:50 PM > > To: [email protected] > > Subject: Re: region, regionserver questions > > > > Hey Albert, > > > > If you use TableInputFormat, it will create one map task per region in > that > > table. So, each mapper should just talk to one regionserver. > > > > -Sean > > > > On Thu, Dec 2, 2010 at 5:26 PM, Albert Shau <[email protected]> wrote: > > > > > Hi, > > > > > > I'm doing a distributed scan of an hbase table using map-reduce by > taking > > > all the regions belonging to a regionserver, and then assigning those > > > regions to a mapper (so there's 1 mapper per regionserver, and each > > mapper > > > only talks to one regionserver). However, doing it this way I'm > getting > > > some data skew. For example, I have 2 tables U and T. Each > regionserver > > > may have 30 regions, but one regionserver might have 10 regions from > > table U > > > while another regionserver might have 25 regions from table U. Is > there > > a > > > way to balance regions per table per regionserver (so that each > > regionserver > > > has 15 regions from table U for example)? Or should I just not worry > > about > > > trying to have each individual mapper only talk to one regionserver? > > > > > > Also, how do regions get assigned to regionservers? Is it based on > data > > > locality? Region start/end keys? Randomly? > > > > > > Thanks, > > > Albert > > > >
