I've looked into this in the past, and I haven't implemented anything yet. But I have a couple notes:
1) From what I can tell hbase doesn't currently provide you with an API you could use to figure this out smartly. (I was looking at 0.90.x, it could have changed in later versions). 2) What seemed to me to be a good way to do it was to do it based on a combination of oldest modified time and number of store files. I was going to write a script which iterates all the regions in HDFS, chooses the region (or up to N regions) which had either the most files or the files with the oldest modified timestamp, and major_compact those. 3) At the end of the day, our servers were not really utilizing 100% of disk and CPU so we decided to just major compact everything each night. We staggered the compactions over a couple hours so as not to overwhelm, but not sure if that has much effect since it is in serial in a single thread anyway. On Wed, Dec 12, 2012 at 3:19 PM, Otis Gospodnetic < [email protected]> wrote: > Hi, > > If you want to do major compaction on a single region at a time (to > minimize the impact on the cluster), how do you pick which region to > compact? > > What should one look for in order to get the best ROI out of major > compaction - the best ratio of the negative impact and positive benefit > - and is there a programmatic way to get to this information, so region > selection+compaction can be automated? > > Thanks, > Otis > -- > HBASE Performance Monitoring - http://sematext.com/spm/index.html > Search Analytics - http://sematext.com/search-analytics/index.html >
