I wanted to do some more investigation before posting to the list, but it seems relevant to this conversation...
Is it possible that major compactions don't always localize the data blocks? Our cluster had a bunch of regions full of historical analytics data that were already major compacted, then we added a new datanode/regionserver. We have a job that triggers major compactions at a minimum of once per week by hashing the region name and giving it a time slot. It's been several weeks and the original nodes each have ~480gb used in hdfs, while the new node has only 240gb. Regions are scattered pretty randomly and evenly among the regionservers. The job calls hBaseAdmin.majorCompact(hRegionInfo.getRegionName()); My guess is that if a region is already major compacted and no new data has been added to it, then compaction is skipped. That's definitely an essential feature during typical operation, but it's a problem if you're relying on major compaction to balance the cluster. Matt On Thu, May 19, 2011 at 4:42 AM, Michel Segel <[email protected]>wrote: > I had asked the question about how he created random keys... Hadn't seen a > response. > > Sent from a remote device. Please excuse any typos... > > Mike Segel > > On May 18, 2011, at 11:27 PM, Stack <[email protected]> wrote: > > > On Wed, May 18, 2011 at 5:11 PM, Weihua JIANG <[email protected]> > wrote: > >> All the DNs almost have the same number of blocks. Major compaction > >> makes no difference. > >> > > > > I would expect major compaction to even the number of blocks across > > the cluster and it'd move the data for each region local to the > > regionserver. > > > > The only explanation that I can see is that the hot DNs must be > > carrying the hot blocks (The client querys are not random). I do not > > know what else it could be. > > > > St.Ack > > >
